3  Exploratory Data Analysis for Clinical Research

3.1 The Role of EDA in Clinical Research

Exploratory Data Analysis (EDA) is a critical step in understanding clinical data before formal modeling or hypothesis testing. In regulated clinical research, thorough exploration helps identify data issues, understand distributions, and guide analysis decisions.

3.1.1 EDA Goals in Clinical Research

Exploratory analysis in clinical settings serves several specific purposes:

  1. Data quality assessment: Identifying issues that may have been missed during data cleaning
  2. Understanding baseline characteristics: Examining the study population’s key features
  3. Treatment pattern exploration: Visualizing medication adherence, dose adjustments, etc.
  4. Outcome variable exploration: Understanding distribution and relationships with predictors
  5. Informing modeling decisions: Guiding choices of statistical approaches
  6. Generating hypotheses: Identifying unexpected relationships for further investigation

3.2 Descriptive Statistics for Clinical Data

3.2.1 Patient Demographics and Baseline Characteristics

A fundamental starting point is characterizing the study population:

Code
library(gtsummary)
library(tidyverse)

# Create a baseline characteristics table
demographics %>%
  select(age, sex, race, ethnicity, bmi, 
         comorbidity_count, treatment_group) %>%
  tbl_summary(
    by = treatment_group,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_p() %>%
  add_overall() %>%
  bold_labels()

3.2.2 Exploring Distributions of Key Variables

Examining the distribution of key clinical measurements:

Code
# Visualize distribution of key clinical measurements
library(patchwork)

p1 <- ggplot(clinical_data, aes(x = systolic_bp)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 140, linetype = "dashed", color = "red") +
  labs(title = "Systolic Blood Pressure",
       x = "mmHg", y = "Count") +
  theme_minimal()

p2 <- ggplot(clinical_data, aes(x = ldl_cholesterol)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 130, linetype = "dashed", color = "red") +
  labs(title = "LDL Cholesterol",
       x = "mg/dL", y = "Count") +
  theme_minimal()

p3 <- ggplot(clinical_data, aes(x = hba1c)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 6.5, linetype = "dashed", color = "red") +
  labs(title = "HbA1c",
       x = "%", y = "Count") +
  theme_minimal()

p4 <- ggplot(clinical_data, aes(x = egfr)) +
  geom_histogram(bins = 30, fill = "#0072B2", alpha = 0.7) +
  geom_vline(xintercept = 60, linetype = "dashed", color = "red") +
  labs(title = "eGFR",
       x = "mL/min/1.73m²", y = "Count") +
  theme_minimal()

(p1 + p2) / (p3 + p4)

3.2.3 Summarizing by Treatment Groups

Comparing key variables across treatment arms:

Code
# Compare key outcome measures by treatment group
outcomes_by_treatment <- clinical_data %>%
  group_by(treatment_group) %>%
  summarise(
    n = n(),
    mean_change = mean(endpoint_change, na.rm = TRUE),
    sd_change = sd(endpoint_change, na.rm = TRUE),
    median_change = median(endpoint_change, na.rm = TRUE),
    q1_change = quantile(endpoint_change, 0.25, na.rm = TRUE),
    q3_change = quantile(endpoint_change, 0.75, na.rm = TRUE),
    min_change = min(endpoint_change, na.rm = TRUE),
    max_change = max(endpoint_change, na.rm = TRUE),
    responder_rate = mean(responder == "Yes", na.rm = TRUE)
  ) %>%
  mutate(
    ci_lower = mean_change - qt(0.975, n-1) * sd_change / sqrt(n),
    ci_upper = mean_change + qt(0.975, n-1) * sd_change / sqrt(n)
  )

# Visualize treatment differences
ggplot(outcomes_by_treatment, 
       aes(x = treatment_group, y = mean_change, 
           ymin = ci_lower, ymax = ci_upper, 
           color = treatment_group)) +
  geom_pointrange(size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Mean Change from Baseline by Treatment Group",
       subtitle = "With 95% Confidence Intervals",
       x = "Treatment Group", 
       y = "Mean Change in Primary Endpoint") +
  theme_minimal() +
  theme(legend.position = "none")

3.3 Visualizing Relationships in Clinical Data

3.3.1 Correlation Analysis

Examining relationships between variables:

Code
# Calculate correlations between key variables
library(corrplot)
library(corrr)

# Select numeric variables of interest
numeric_vars <- clinical_data %>%
  select(age, bmi, systolic_bp, diastolic_bp, ldl_cholesterol, 
         hdl_cholesterol, triglycerides, hba1c, creatinine, egfr) 

# Calculate correlation matrix
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Create correlation plot
corrplot(corr_matrix, 
         method = "circle", 
         type = "upper", 
         tl.col = "black", 
         tl.srt = 45,
         diag = FALSE)

# Alternative with corrr package for a tidy approach
numeric_vars %>%
  correlate() %>%
  rearrange() %>%
  shave(upper = TRUE) %>%
  rplot(print_cor = TRUE)

3.3.2 Exploring Bivariate Relationships

Visualizing relationships between pairs of variables:

Code
# Create scatter plots with regression lines by treatment
ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, 
                          color = treatment_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  facet_wrap(~treatment_group) +
  labs(title = "Relationship Between Baseline and Endpoint Values",
       subtitle = "By Treatment Group",
       x = "Baseline Value", 
       y = "Endpoint Value") +
  theme_minimal()

# Relationship between continuous and categorical variable
ggplot(clinical_data, aes(x = age_category, y = endpoint_value, 
                          fill = age_category)) +
  geom_boxplot() +
  facet_wrap(~treatment_group) +
  labs(title = "Endpoint Values by Age Category and Treatment",
       x = "Age Category", 
       y = "Endpoint Value") +
  theme_minimal() +
  theme(legend.position = "none")

3.3.3 Stratified Analysis

Examining outcome patterns across key subgroups:

Code
# Forest plot for treatment effect across subgroups
library(meta)
library(forestplot)

# For demonstration (in practice would be calculated from the data)
subgroup_results <- tibble(
  subgroup = c("Overall", "Male", "Female", "Age < 65", "Age ≥ 65", 
              "With Comorbidity", "Without Comorbidity"),
  n_treatment = c(150, 80, 70, 90, 60, 85, 65),
  n_control = c(150, 78, 72, 92, 58, 82, 68),
  mean_diff = c(12.3, 14.2, 10.1, 15.7, 8.4, 11.2, 13.5),
  lower_ci = c(8.7, 9.5, 5.2, 10.3, 3.1, 6.4, 8.9),
  upper_ci = c(15.9, 18.9, 15.0, 21.1, 13.7, 16.0, 18.1)
)

# Create a forest plot
forestplot(
  labeltext = subgroup_results$subgroup,
  mean = subgroup_results$mean_diff,
  lower = subgroup_results$lower_ci,
  upper = subgroup_results$upper_ci,
  xlab = "Treatment Effect (95% CI)",
  zero = 0,
  boxsize = 0.2,
  lineheight = unit(1, "cm"),
  col = fpColors(box = "#0072B2", line = "#0072B2", summary = "#D55E00")
)

3.5 Exploring Categorical Relationships

3.5.1 Contingency Tables and Mosaic Plots

Examining relationships between categorical variables:

Code
# Create contingency table
contingency_table <- table(clinical_data$response_category, 
                           clinical_data$treatment_group)

# Print table with chi-square test
knitr::kable(contingency_table, 
             caption = "Response Category by Treatment Group")

# Chi-square test
chi_result <- chisq.test(contingency_table)
chi_result

# Visualize with mosaic plot
library(vcd)
mosaic(contingency_table,
       main = "Response Category by Treatment Group",
       shade = TRUE)

# Alternative visualization with ggplot
ggplot(clinical_data, 
       aes(x = treatment_group, fill = response_category)) +
  geom_bar(position = "fill") +
  labs(title = "Response Categories by Treatment Group",
       x = "Treatment Group", 
       y = "Proportion",
       fill = "Response Category") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

3.6 Exploring Adverse Events

Specialized visualization for safety data:

Code
# Prepare adverse event data
ae_summary <- adverse_events %>%
  group_by(system_organ_class, preferred_term, treatment_group) %>%
  summarise(
    n_events = n(),
    n_subjects = n_distinct(subject_id),
    .groups = "drop"
  ) %>%
  # Get total subjects per treatment group
  left_join(
    adverse_events %>%
      distinct(subject_id, treatment_group) %>%
      count(treatment_group, name = "total_subjects"),
    by = "treatment_group"
  ) %>%
  # Calculate percentages
  mutate(
    percent = (n_subjects / total_subjects) * 100
  ) %>%
  # Filter for more common AEs
  filter(n_subjects >= 5) %>%
  arrange(system_organ_class, desc(n_subjects))

# Create adverse event plot
ggplot(ae_summary, 
       aes(x = reorder(preferred_term, percent), 
           y = percent, 
           fill = treatment_group)) +
  geom_col(position = "dodge") +
  coord_flip() +
  facet_wrap(~system_organ_class, scales = "free_y") +
  labs(title = "Adverse Events by Treatment Group",
       subtitle = "Events occurring in ≥5 subjects",
       x = "", 
       y = "Percentage of Subjects (%)",
       fill = "Treatment Group") +
  theme_minimal()

3.7 Interactive Exploration

For complex clinical data, interactive tools can be valuable:

Code
library(plotly)
library(DT)

# Create interactive scatter plot
p <- ggplot(clinical_data, 
           aes(x = age, y = endpoint_value, 
               color = treatment_group,
               text = paste(
                 "Subject ID:", subject_id,
                 "\nAge:", age,
                 "\nSex:", sex,
                 "\nBaseline:", baseline_value,
                 "\nEndpoint:", endpoint_value,
                 "\nChange:", endpoint_value - baseline_value
               ))) +
  geom_point(alpha = 0.7) +
  labs(title = "Relationship Between Age and Endpoint Value",
       x = "Age (years)", 
       y = "Endpoint Value") +
  theme_minimal()

# Convert to interactive plotly
ggplotly(p, tooltip = "text")

# Interactive data table
datatable(
  clinical_data %>%
    select(subject_id, age, sex, treatment_group, 
           baseline_value, endpoint_value, 
           change = endpoint_value - baseline_value),
  filter = "top",
  options = list(
    pageLength = 10,
    autoWidth = TRUE,
    scrollX = TRUE
  )
)

3.8 Exercises

  1. Create a comprehensive baseline characteristics table for a clinical trial dataset.
  2. Explore the distribution of a primary efficacy endpoint across different demographic subgroups.
  3. Visualize the correlation between laboratory parameters in a longitudinal study.
  4. Create a Kaplan-Meier plot for time-to-event data and explore differences between treatment groups.
  5. Design a dashboard for exploratory analysis of adverse events from a clinical trial.

3.9 Summary

Exploratory data analysis is a critical step in clinical research that informs data quality assessment, understanding of the study population, and guides subsequent formal analyses. Using R’s tidyverse ecosystem along with specialized packages like gtsummary, survival, and plotly provides powerful tools for visualizing and summarizing clinical data. The insights gained during EDA not only help identify potential data issues but also inform hypothesis generation and modeling decisions in the analytical stages that follow.

In clinical research specifically, EDA serves an additional role in regulatory documentation, as these explorations often form the basis for determining analysis populations, addressing protocol deviations, and establishing the robustness of efficacy and safety conclusions.

3.10 References