3 Exploratory Data Analysis for Clinical Research

3.1 The Role of EDA in Clinical Research

Exploratory Data Analysis (EDA) is a critical step in understanding clinical data before formal modeling or hypothesis testing. In regulated clinical research, thorough exploration helps identify data issues, understand distributions, and guide analysis decisions.

3.1.1 EDA Goals in Clinical Research

Exploratory analysis in clinical settings serves several specific purposes:

Data quality assessment: Identifying issues that may have been missed during data cleaning
Understanding baseline characteristics: Examining the study population’s key features
Treatment pattern exploration: Visualizing medication adherence, dose adjustments, etc.
Outcome variable exploration: Understanding distribution and relationships with predictors
Informing modeling decisions: Guiding choices of statistical approaches
Generating hypotheses: Identifying unexpected relationships for further investigation

3.2 Descriptive Statistics for Clinical Data

3.2.1 Patient Demographics and Baseline Characteristics

A fundamental starting point is characterizing the study population:

Code

library(gtsummary)
library(tidyverse)

# Create a baseline characteristics table
demographics %>%
  select(age, sex, race, ethnicity, bmi, 
         comorbidity_count, treatment_group) %>%
  tbl_summary(
    by = treatment_group,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_p() %>%
  add_overall() %>%
  bold_labels()

3.2.2 Exploring Distributions of Key Variables

Examining the distribution of key clinical measurements:

Code

# Visualize distribution of key clinical measurements
library(patchwork)

p1 <- ggplot(clinical_data, aes(x = systolic_bp)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 140, linetype = "dashed", color = "red") +
  labs(title = "Systolic Blood Pressure",
       x = "mmHg", y = "Count") +
  theme_minimal()

p2 <- ggplot(clinical_data, aes(x = ldl_cholesterol)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 130, linetype = "dashed", color = "red") +
  labs(title = "LDL Cholesterol",
       x = "mg/dL", y = "Count") +
  theme_minimal()

p3 <- ggplot(clinical_data, aes(x = hba1c)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 6.5, linetype = "dashed", color = "red") +
  labs(title = "HbA1c",
       x = "%", y = "Count") +
  theme_minimal()

p4 <- ggplot(clinical_data, aes(x = egfr)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 60, linetype = "dashed", color = "red") +
  labs(title = "eGFR",
       x = "mL/min/1.73m²", y = "Count") +
  theme_minimal()

(p1 + p2) / (p3 + p4)

3.2.3 Summarizing by Treatment Groups

Comparing key variables across treatment arms:

Code

# Compare key outcome measures by treatment group
outcomes_by_treatment <- clinical_data %>%
  group_by(treatment_group) %>%
  summarise(
    n = n(),
    mean_change = mean(endpoint_change, na.rm = TRUE),
    sd_change = sd(endpoint_change, na.rm = TRUE),
    median_change = median(endpoint_change, na.rm = TRUE),
    q1_change = quantile(endpoint_change, 0.25, na.rm = TRUE),
    q3_change = quantile(endpoint_change, 0.75, na.rm = TRUE),
    min_change = min(endpoint_change, na.rm = TRUE),
    max_change = max(endpoint_change, na.rm = TRUE),
    responder_rate = mean(responder == "Yes", na.rm = TRUE)
  ) %>%
  mutate(
    ci_lower = mean_change - qt(0.975, n-1) * sd_change / sqrt(n),
    ci_upper = mean_change + qt(0.975, n-1) * sd_change / sqrt(n)
  )

# Visualize treatment differences
ggplot(outcomes_by_treatment, 
       aes(x = treatment_group, y = mean_change, 
           ymin = ci_lower, ymax = ci_upper, 
           color = treatment_group)) +
  geom_pointrange(size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Mean Change from Baseline by Treatment Group",
       subtitle = "With 95% Confidence Intervals",
       x = "Treatment Group", 
       y = "Mean Change in Primary Endpoint") +
  theme_minimal() +
  theme(legend.position = "none")

3.3 Visualizing Relationships in Clinical Data

3.3.1 Correlation Analysis

Examining relationships between variables:

Code

# Calculate correlations between key variables
library(corrplot)
library(corrr)

# Select numeric variables of interest
numeric_vars <- clinical_data %>%
  select(age, bmi, systolic_bp, diastolic_bp, ldl_cholesterol, 
         hdl_cholesterol, triglycerides, hba1c, creatinine, egfr) 

# Calculate correlation matrix
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Create correlation plot
corrplot(corr_matrix, 
         method = "circle", 
         type = "upper", 
         tl.col = "black", 
         tl.srt = 45,
         diag = FALSE)

# Alternative with corrr package for a tidy approach
numeric_vars %>%
  correlate() %>%
  rearrange() %>%
  shave(upper = TRUE) %>%
  rplot(print_cor = TRUE)

3.3.2 Exploring Bivariate Relationships

Visualizing relationships between pairs of variables:

Code

# Create scatter plots with regression lines by treatment
ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, 
                          color = treatment_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  facet_wrap(~treatment_group) +
  labs(title = "Relationship Between Baseline and Endpoint Values",
       subtitle = "By Treatment Group",
       x = "Baseline Value", 
       y = "Endpoint Value") +
  theme_minimal()

# Relationship between continuous and categorical variable
ggplot(clinical_data, aes(x = age_category, y = endpoint_value, 
                          fill = age_category)) +
  geom_boxplot() +
  facet_wrap(~treatment_group) +
  labs(title = "Endpoint Values by Age Category and Treatment",
       x = "Age Category", 
       y = "Endpoint Value") +
  theme_minimal() +
  theme(legend.position = "none")

3.3.3 Stratified Analysis

Examining outcome patterns across key subgroups:

Code

# Forest plot for treatment effect across subgroups
library(meta)
library(forestplot)

# For demonstration (in practice would be calculated from the data)
subgroup_results <- tibble(
  subgroup = c("Overall", "Male", "Female", "Age < 65", "Age ≥ 65", 
              "With Comorbidity", "Without Comorbidity"),
  n_treatment = c(150, 80, 70, 90, 60, 85, 65),
  n_control = c(150, 78, 72, 92, 58, 82, 68),
  mean_diff = c(12.3, 14.2, 10.1, 15.7, 8.4, 11.2, 13.5),
  lower_ci = c(8.7, 9.5, 5.2, 10.3, 3.1, 6.4, 8.9),
  upper_ci = c(15.9, 18.9, 15.0, 21.1, 13.7, 16.0, 18.1)
)

# Create a forest plot
forestplot(
  labeltext = subgroup_results$subgroup,
  mean = subgroup_results$mean_diff,
  lower = subgroup_results$lower_ci,
  upper = subgroup_results$upper_ci,
  xlab = "Treatment Effect (95% CI)",
  zero = 0,
  boxsize = 0.2,
  lineheight = unit(1, "cm"),
  col = fpColors(box = "#0072B2", line = "#0072B2", summary = "#D55E00")
)

3.4 Time-Related Patterns in Clinical Data

3.4.1 Longitudinal Trends

Visualizing changes over time:

Code

# Prepare longitudinal data
longitudinal_data <- clinical_data_long %>%
  filter(parameter == "primary_endpoint") %>%
  group_by(treatment_group, visit) %>%
  summarise(
    n = n(),
    mean_value = mean(value, na.rm = TRUE),
    sd_value = sd(value, na.rm = TRUE),
    se_value = sd_value / sqrt(n),
    lower_ci = mean_value - qt(0.975, n-1) * se_value,
    upper_ci = mean_value + qt(0.975, n-1) * se_value,
    .groups = "drop"
  ) 

# Create longitudinal plot
ggplot(longitudinal_data, 
       aes(x = visit, y = mean_value, 
           group = treatment_group, color = treatment_group)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), width = 0.2) +
  labs(title = "Change in Primary Endpoint Over Time",
       subtitle = "Mean Values with 95% Confidence Intervals",
       x = "Visit", 
       y = "Mean Value") +
  theme_minimal()

3.4.2 Time-to-Event Analysis

Exploratory analysis of time-to-event data:

Code

# Kaplan-Meier curves
library(survival)
library(survminer)

# Create survival object
surv_object <- Surv(time = clinical_data$time_to_event, 
                    event = clinical_data$event_status)

# Fit Kaplan-Meier curves by treatment
km_fit <- survfit(surv_object ~ treatment_group, data = clinical_data)

# Plot Kaplan-Meier curves
ggsurvplot(
  km_fit,
  data = clinical_data,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlab = "Time (months)",
  ylab = "Survival Probability",
  ggtheme = theme_minimal(),
  palette = c("#0072B2", "#D55E00"),
  legend.title = "Treatment Group",
  risk.table.height = 0.25
)

3.5 Exploring Categorical Relationships

3.5.1 Contingency Tables and Mosaic Plots

Examining relationships between categorical variables:

Code

# Create contingency table
contingency_table <- table(clinical_data$response_category, 
                           clinical_data$treatment_group)

# Print table with chi-square test
knitr::kable(contingency_table, 
             caption = "Response Category by Treatment Group")

# Chi-square test
chi_result <- chisq.test(contingency_table)
chi_result

# Visualize with mosaic plot
library(vcd)
mosaic(contingency_table,
       main = "Response Category by Treatment Group",
       shade = TRUE)

# Alternative visualization with ggplot
ggplot(clinical_data, 
       aes(x = treatment_group, fill = response_category)) +
  geom_bar(position = "fill") +
  labs(title = "Response Categories by Treatment Group",
       x = "Treatment Group", 
       y = "Proportion",
       fill = "Response Category") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

3.6 Exploring Adverse Events

Specialized visualization for safety data:

Code

# Prepare adverse event data
ae_summary <- adverse_events %>%
  group_by(system_organ_class, preferred_term, treatment_group) %>%
  summarise(
    n_events = n(),
    n_subjects = n_distinct(subject_id),
    .groups = "drop"
  ) %>%
  # Get total subjects per treatment group
  left_join(
    adverse_events %>%
      distinct(subject_id, treatment_group) %>%
      count(treatment_group, name = "total_subjects"),
    by = "treatment_group"
  ) %>%
  # Calculate percentages
  mutate(
    percent = (n_subjects / total_subjects) * 100
  ) %>%
  # Filter for more common AEs
  filter(n_subjects >= 5) %>%
  arrange(system_organ_class, desc(n_subjects))

# Create adverse event plot
ggplot(ae_summary, 
       aes(x = reorder(preferred_term, percent), 
           y = percent, 
           fill = treatment_group)) +
  geom_col(position = "dodge") +
  coord_flip() +
  facet_wrap(~system_organ_class, scales = "free_y") +
  labs(title = "Adverse Events by Treatment Group",
       subtitle = "Events occurring in ≥5 subjects",
       x = "", 
       y = "Percentage of Subjects (%)",
       fill = "Treatment Group") +
  theme_minimal()

3.7 Interactive Exploration

For complex clinical data, interactive tools can be valuable:

Code

library(plotly)
library(DT)

# Create interactive scatter plot
p <- ggplot(clinical_data, 
           aes(x = age, y = endpoint_value, 
               color = treatment_group,
               text = paste(
                 "Subject ID:", subject_id,
                 "\nAge:", age,
                 "\nSex:", sex,
                 "\nBaseline:", baseline_value,
                 "\nEndpoint:", endpoint_value,
                 "\nChange:", endpoint_value - baseline_value
               ))) +
  geom_point(alpha = 0.7) +
  labs(title = "Relationship Between Age and Endpoint Value",
       x = "Age (years)", 
       y = "Endpoint Value") +
  theme_minimal()

# Convert to interactive plotly
ggplotly(p, tooltip = "text")

# Interactive data table
datatable(
  clinical_data %>%
    select(subject_id, age, sex, treatment_group, 
           baseline_value, endpoint_value, 
           change = endpoint_value - baseline_value),
  filter = "top",
  options = list(
    pageLength = 10,
    autoWidth = TRUE,
    scrollX = TRUE
  )
)

3.8 Exercises

Create a comprehensive baseline characteristics table for a clinical trial dataset.
Explore the distribution of a primary efficacy endpoint across different demographic subgroups.
Visualize the correlation between laboratory parameters in a longitudinal study.
Create a Kaplan-Meier plot for time-to-event data and explore differences between treatment groups.
Design a dashboard for exploratory analysis of adverse events from a clinical trial.

3.9 Summary

Exploratory data analysis is a critical step in clinical research that informs data quality assessment, understanding of the study population, and guides subsequent formal analyses. Using R’s tidyverse ecosystem along with specialized packages like gtsummary, survival, and plotly provides powerful tools for visualizing and summarizing clinical data. The insights gained during EDA not only help identify potential data issues but also inform hypothesis generation and modeling decisions in the analytical stages that follow.

In clinical research specifically, EDA serves an additional role in regulatory documentation, as these explorations often form the basis for determining analysis populations, addressing protocol deviations, and establishing the robustness of efficacy and safety conclusions.

3.10 References

# Exploratory Data Analysis for Clinical Research ## The Role of EDA in Clinical Research Exploratory Data Analysis (EDA) is a critical step in understanding clinical data before formal modeling or hypothesis testing. In regulated clinical research, thorough exploration helps identify data issues, understand distributions, and guide analysis decisions. ```{r} #| echo: false #| fig-cap: "The EDA Process in Clinical Research" library(ggplot2) library(dplyr) library(DiagrammeR) # This would render a flowchart in the actual document # Placeholder comment for the diagram code ``` ### EDA Goals in Clinical Research Exploratory analysis in clinical settings serves several specific purposes: 1. **Data quality assessment**: Identifying issues that may have been missed during data cleaning 2. **Understanding baseline characteristics**: Examining the study population's key features 3. **Treatment pattern exploration**: Visualizing medication adherence, dose adjustments, etc. 4. **Outcome variable exploration**: Understanding distribution and relationships with predictors 5. **Informing modeling decisions**: Guiding choices of statistical approaches 6. **Generating hypotheses**: Identifying unexpected relationships for further investigation ## Descriptive Statistics for Clinical Data ### Patient Demographics and Baseline Characteristics A fundamental starting point is characterizing the study population: ```{r} #| echo: true #| eval: false library(gtsummary) library(tidyverse) # Create a baseline characteristics table demographics %>% select(age, sex, race, ethnicity, bmi, comorbidity_count, treatment_group) %>% tbl_summary( by = treatment_group, statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)" ), digits = all_continuous() ~ 1, missing = "no" ) %>% add_p() %>% add_overall() %>% bold_labels() ``` ### Exploring Distributions of Key Variables Examining the distribution of key clinical measurements: ```{r} #| echo: true #| eval: false # Visualize distribution of key clinical measurements library(patchwork) p1 <- ggplot(clinical_data, aes(x = systolic_bp)) + geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) + geom_vline(xintercept = 140, linetype = "dashed", color = "red") + labs(title = "Systolic Blood Pressure", x = "mmHg", y = "Count") + theme_minimal() p2 <- ggplot(clinical_data, aes(x = ldl_cholesterol)) + geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) + geom_vline(xintercept = 130, linetype = "dashed", color = "red") + labs(title = "LDL Cholesterol", x = "mg/dL", y = "Count") + theme_minimal() p3 <- ggplot(clinical_data, aes(x = hba1c)) + geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) + geom_vline(xintercept = 6.5, linetype = "dashed", color = "red") + labs(title = "HbA1c", x = "%", y = "Count") + theme_minimal() p4 <- ggplot(clinical_data, aes(x = egfr)) + geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) + geom_vline(xintercept = 60, linetype = "dashed", color = "red") + labs(title = "eGFR", x = "mL/min/1.73m²", y = "Count") + theme_minimal() (p1 + p2) / (p3 + p4) ``` ### Summarizing by Treatment Groups Comparing key variables across treatment arms: ```{r} #| echo: true #| eval: false # Compare key outcome measures by treatment group outcomes_by_treatment <- clinical_data %>% group_by(treatment_group) %>% summarise( n = n(), mean_change = mean(endpoint_change, na.rm = TRUE), sd_change = sd(endpoint_change, na.rm = TRUE), median_change = median(endpoint_change, na.rm = TRUE), q1_change = quantile(endpoint_change, 0.25, na.rm = TRUE), q3_change = quantile(endpoint_change, 0.75, na.rm = TRUE), min_change = min(endpoint_change, na.rm = TRUE), max_change = max(endpoint_change, na.rm = TRUE), responder_rate = mean(responder == "Yes", na.rm = TRUE) ) %>% mutate( ci_lower = mean_change - qt(0.975, n-1) * sd_change / sqrt(n), ci_upper = mean_change + qt(0.975, n-1) * sd_change / sqrt(n) ) # Visualize treatment differences ggplot(outcomes_by_treatment, aes(x = treatment_group, y = mean_change, ymin = ci_lower, ymax = ci_upper, color = treatment_group)) + geom_pointrange(size = 1) + geom_hline(yintercept = 0, linetype = "dashed") + labs(title = "Mean Change from Baseline by Treatment Group", subtitle = "With 95% Confidence Intervals", x = "Treatment Group", y = "Mean Change in Primary Endpoint") + theme_minimal() + theme(legend.position = "none") ``` ## Visualizing Relationships in Clinical Data ### Correlation Analysis Examining relationships between variables: ```{r} #| echo: true #| eval: false # Calculate correlations between key variables library(corrplot) library(corrr) # Select numeric variables of interest numeric_vars <- clinical_data %>% select(age, bmi, systolic_bp, diastolic_bp, ldl_cholesterol, hdl_cholesterol, triglycerides, hba1c, creatinine, egfr) # Calculate correlation matrix corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs") # Create correlation plot corrplot(corr_matrix, method = "circle", type = "upper", tl.col = "black", tl.srt = 45, diag = FALSE) # Alternative with corrr package for a tidy approach numeric_vars %>% correlate() %>% rearrange() %>% shave(upper = TRUE) %>% rplot(print_cor = TRUE) ``` ### Exploring Bivariate Relationships Visualizing relationships between pairs of variables: ```{r} #| echo: true #| eval: false # Create scatter plots with regression lines by treatment ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, color = treatment_group)) + geom_point(alpha = 0.6) + geom_smooth(method = "lm", se = TRUE) + facet_wrap(~treatment_group) + labs(title = "Relationship Between Baseline and Endpoint Values", subtitle = "By Treatment Group", x = "Baseline Value", y = "Endpoint Value") + theme_minimal() # Relationship between continuous and categorical variable ggplot(clinical_data, aes(x = age_category, y = endpoint_value, fill = age_category)) + geom_boxplot() + facet_wrap(~treatment_group) + labs(title = "Endpoint Values by Age Category and Treatment", x = "Age Category", y = "Endpoint Value") + theme_minimal() + theme(legend.position = "none") ``` ### Stratified Analysis Examining outcome patterns across key subgroups: ```{r} #| echo: true #| eval: false # Forest plot for treatment effect across subgroups library(meta) library(forestplot) # For demonstration (in practice would be calculated from the data) subgroup_results <- tibble( subgroup = c("Overall", "Male", "Female", "Age < 65", "Age ≥ 65", "With Comorbidity", "Without Comorbidity"), n_treatment = c(150, 80, 70, 90, 60, 85, 65), n_control = c(150, 78, 72, 92, 58, 82, 68), mean_diff = c(12.3, 14.2, 10.1, 15.7, 8.4, 11.2, 13.5), lower_ci = c(8.7, 9.5, 5.2, 10.3, 3.1, 6.4, 8.9), upper_ci = c(15.9, 18.9, 15.0, 21.1, 13.7, 16.0, 18.1) ) # Create a forest plot forestplot( labeltext = subgroup_results$subgroup, mean = subgroup_results$mean_diff, lower = subgroup_results$lower_ci, upper = subgroup_results$upper_ci, xlab = "Treatment Effect (95% CI)", zero = 0, boxsize = 0.2, lineheight = unit(1, "cm"), col = fpColors(box = "#0072B2", line = "#0072B2", summary = "#D55E00") ) ``` ## Time-Related Patterns in Clinical Data ### Longitudinal Trends Visualizing changes over time: ```{r} #| echo: true #| eval: false # Prepare longitudinal data longitudinal_data <- clinical_data_long %>% filter(parameter == "primary_endpoint") %>% group_by(treatment_group, visit) %>% summarise( n = n(), mean_value = mean(value, na.rm = TRUE), sd_value = sd(value, na.rm = TRUE), se_value = sd_value / sqrt(n), lower_ci = mean_value - qt(0.975, n-1) * se_value, upper_ci = mean_value + qt(0.975, n-1) * se_value, .groups = "drop" ) # Create longitudinal plot ggplot(longitudinal_data, aes(x = visit, y = mean_value, group = treatment_group, color = treatment_group)) + geom_line(size = 1) + geom_point(size = 3) + geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), width = 0.2) + labs(title = "Change in Primary Endpoint Over Time", subtitle = "Mean Values with 95% Confidence Intervals", x = "Visit", y = "Mean Value") + theme_minimal() ``` ### Time-to-Event Analysis Exploratory analysis of time-to-event data: ```{r} #| echo: true #| eval: false # Kaplan-Meier curves library(survival) library(survminer) # Create survival object surv_object <- Surv(time = clinical_data$time_to_event, event = clinical_data$event_status) # Fit Kaplan-Meier curves by treatment km_fit <- survfit(surv_object ~ treatment_group, data = clinical_data) # Plot Kaplan-Meier curves ggsurvplot( km_fit, data = clinical_data, risk.table = TRUE, pval = TRUE, conf.int = TRUE, xlab = "Time (months)", ylab = "Survival Probability", ggtheme = theme_minimal(), palette = c("#0072B2", "#D55E00"), legend.title = "Treatment Group", risk.table.height = 0.25 ) ``` ## Exploring Categorical Relationships ### Contingency Tables and Mosaic Plots Examining relationships between categorical variables: ```{r} #| echo: true #| eval: false # Create contingency table contingency_table <- table(clinical_data$response_category, clinical_data$treatment_group) # Print table with chi-square test knitr::kable(contingency_table, caption = "Response Category by Treatment Group") # Chi-square test chi_result <- chisq.test(contingency_table) chi_result # Visualize with mosaic plot library(vcd) mosaic(contingency_table, main = "Response Category by Treatment Group", shade = TRUE) # Alternative visualization with ggplot ggplot(clinical_data, aes(x = treatment_group, fill = response_category)) + geom_bar(position = "fill") + labs(title = "Response Categories by Treatment Group", x = "Treatment Group", y = "Proportion", fill = "Response Category") + scale_fill_brewer(palette = "Set2") + theme_minimal() ``` ## Exploring Adverse Events Specialized visualization for safety data: ```{r} #| echo: true #| eval: false # Prepare adverse event data ae_summary <- adverse_events %>% group_by(system_organ_class, preferred_term, treatment_group) %>% summarise( n_events = n(), n_subjects = n_distinct(subject_id), .groups = "drop" ) %>% # Get total subjects per treatment group left_join( adverse_events %>% distinct(subject_id, treatment_group) %>% count(treatment_group, name = "total_subjects"), by = "treatment_group" ) %>% # Calculate percentages mutate( percent = (n_subjects / total_subjects) * 100 ) %>% # Filter for more common AEs filter(n_subjects >= 5) %>% arrange(system_organ_class, desc(n_subjects)) # Create adverse event plot ggplot(ae_summary, aes(x = reorder(preferred_term, percent), y = percent, fill = treatment_group)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~system_organ_class, scales = "free_y") + labs(title = "Adverse Events by Treatment Group", subtitle = "Events occurring in ≥5 subjects", x = "", y = "Percentage of Subjects (%)", fill = "Treatment Group") + theme_minimal() ``` ## Interactive Exploration For complex clinical data, interactive tools can be valuable: ```{r} #| echo: true #| eval: false library(plotly) library(DT) # Create interactive scatter plot p <- ggplot(clinical_data, aes(x = age, y = endpoint_value, color = treatment_group, text = paste( "Subject ID:", subject_id, "\nAge:", age, "\nSex:", sex, "\nBaseline:", baseline_value, "\nEndpoint:", endpoint_value, "\nChange:", endpoint_value - baseline_value ))) + geom_point(alpha = 0.7) + labs(title = "Relationship Between Age and Endpoint Value", x = "Age (years)", y = "Endpoint Value") + theme_minimal() # Convert to interactive plotly ggplotly(p, tooltip = "text") # Interactive data table datatable( clinical_data %>% select(subject_id, age, sex, treatment_group, baseline_value, endpoint_value, change = endpoint_value - baseline_value), filter = "top", options = list( pageLength = 10, autoWidth = TRUE, scrollX = TRUE ) ) ``` ## Exercises 1. Create a comprehensive baseline characteristics table for a clinical trial dataset. 2. Explore the distribution of a primary efficacy endpoint across different demographic subgroups. 3. Visualize the correlation between laboratory parameters in a longitudinal study. 4. Create a Kaplan-Meier plot for time-to-event data and explore differences between treatment groups. 5. Design a dashboard for exploratory analysis of adverse events from a clinical trial. ## Summary Exploratory data analysis is a critical step in clinical research that informs data quality assessment, understanding of the study population, and guides subsequent formal analyses. Using R's tidyverse ecosystem along with specialized packages like gtsummary, survival, and plotly provides powerful tools for visualizing and summarizing clinical data. The insights gained during EDA not only help identify potential data issues but also inform hypothesis generation and modeling decisions in the analytical stages that follow. In clinical research specifically, EDA serves an additional role in regulatory documentation, as these explorations often form the basis for determining analysis populations, addressing protocol deviations, and establishing the robustness of efficacy and safety conclusions. ## References