8  Clinical Research Case Studies

8.1 Introduction to Case-Based Learning in Clinical Research

Case studies provide a powerful framework for applying the concepts, techniques, and tools we’ve explored throughout this book to real-world clinical research scenarios. In this chapter, we present comprehensive case studies that demonstrate how to implement reproducible R workflows in different clinical research contexts.

8.1.1 The Value of Case-Based Learning

Case-based learning offers several advantages for developing proficiency in clinical data analysis:

  1. Integration: Combines multiple concepts and skills into cohesive workflows
  2. Context: Places technical skills within realistic clinical research scenarios
  3. Decision points: Highlights key decision points and their implications
  4. Problem-solving: Develops critical thinking about real-world challenges
  5. Translation: Bridges the gap between theoretical knowledge and practical application

8.2 Case Study 1: Phase II Oncology Trial

8.2.1 Study Background

This case study examines a Phase II oncology trial evaluating a novel treatment for advanced non-small cell lung cancer (NSCLC):

Code
library(tidyverse)
library(survival)
library(here)
library(knitr)

# Study design information
study_design <- tribble(
  ~Parameter, ~Value,
  "Design", "Randomized, double-blind, placebo-controlled Phase II trial",
  "Population", "Adults with stage IIIB/IV NSCLC with EGFR mutations",
  "Sample size", "120 patients (1:1 randomization)",
  "Primary endpoint", "Progression-free survival (PFS)",
  "Secondary endpoints", "Overall survival (OS), objective response rate (ORR), safety",
  "Follow-up", "Up to 24 months or until disease progression"
)

# Display study design
kable(study_design, caption = "Study Design Summary")

8.2.2 Data Preparation

First, we’ll set up our project structure following the reproducible research principles from Chapter 6:

Code
# Create a function to initialize the project (normally run once)
setup_oncology_project <- function() {
  # Create directory structure
  dirs <- c(
    "data/raw",
    "data/processed",
    "R",
    "analysis",
    "reports/figures",
    "reports/tables"
  )
  
  for (d in dirs) {
    dir.create(here(d), recursive = TRUE, showWarnings = FALSE)
  }
  
  # Initialize renv
  renv::init()
  
  # Create helper scripts
  file.create(here("R", "data_cleaning.R"))
  file.create(here("R", "analysis_functions.R"))
  file.create(here("R", "visualization_functions.R"))
}

# Load and clean data
load_oncology_data <- function() {
  # Load sample datasets (in practice, these would be actual trial data)
  demographics <- read_csv(here("data", "raw", "oncology_demographics.csv"))
  baseline <- read_csv(here("data", "raw", "oncology_baseline.csv"))
  efficacy <- read_csv(here("data", "raw", "oncology_efficacy.csv"))
  safety <- read_csv(here("data", "raw", "oncology_safety.csv"))
  
  # Join datasets
  patient_data <- demographics %>%
    left_join(baseline, by = "patient_id") %>%
    mutate(
      treatment = factor(treatment_group, 
                        levels = c("Placebo", "Active")),
      sex = factor(sex),
      ecog = factor(ecog_status),
      stage = factor(disease_stage)
    )
  
  # Create analysis datasets
  efficacy_data <- efficacy %>%
    left_join(patient_data %>% select(patient_id, treatment, age, sex, ecog, stage),
             by = "patient_id")
  
  safety_data <- safety %>%
    left_join(patient_data %>% select(patient_id, treatment, age, sex),
             by = "patient_id")
  
  # Save processed datasets
  write_csv(patient_data, here("data", "processed", "patient_data.csv"))
  write_csv(efficacy_data, here("data", "processed", "efficacy_data.csv"))
  write_csv(safety_data, here("data", "processed", "safety_data.csv"))
  
  # Return list of datasets
  return(list(
    patient_data = patient_data,
    efficacy_data = efficacy_data,
    safety_data = safety_data
  ))
}

8.2.3 Primary Endpoint Analysis: Progression-Free Survival

Now, we’ll analyze the primary endpoint using survival analysis techniques:

Code
# Source helper functions
source(here("R", "analysis_functions.R"))

# Load processed data
efficacy_data <- read_csv(here("data", "processed", "efficacy_data.csv"))

# Prepare survival data
survival_data <- efficacy_data %>%
  select(patient_id, treatment, pfs_time, pfs_event, os_time, os_event) %>%
  mutate(
    pfs_event = as.logical(pfs_event),
    os_event = as.logical(os_event)
  )

# Fit Kaplan-Meier model for PFS
pfs_fit <- survfit(Surv(pfs_time, pfs_event) ~ treatment, data = survival_data)

# Summary statistics
pfs_summary <- summary(pfs_fit)
print(pfs_summary)

# Calculate hazard ratio using Cox proportional hazards model
pfs_cox <- coxph(Surv(pfs_time, pfs_event) ~ treatment, data = survival_data)
pfs_hr <- exp(coef(pfs_cox))
pfs_ci <- exp(confint(pfs_cox))
pfs_pvalue <- summary(pfs_cox)$waldtest["pvalue"]

# Create summary table
pfs_results <- tibble(
  treatment = c("Placebo", "Active"),
  n = pfs_fit$n,
  events = pfs_summary$table[, "events"],
  median = pfs_summary$table[, "median"],
  lower_ci = pfs_summary$table[, "0.95LCL"],
  upper_ci = pfs_summary$table[, "0.95UCL"]
)

# Display results
kable(pfs_results, 
     caption = "Progression-Free Survival by Treatment Group",
     col.names = c("Treatment", "N", "Events", "Median (months)", "95% CI Lower", "95% CI Upper"))

# Display hazard ratio
cat(sprintf("Hazard Ratio (Active vs. Placebo): %.2f (95%% CI: %.2f-%.2f), p = %.4f\n",
           pfs_hr, pfs_ci[1], pfs_ci[2], pfs_pvalue))

8.2.4 Visualizing the Primary Endpoint

We’ll create a publication-quality Kaplan-Meier curve using the techniques from Chapter 7:

Code
library(survminer)

# Source visualization functions
source(here("R", "visualization_functions.R"))

# Create Kaplan-Meier plot
pfs_plot <- ggsurvplot(
  pfs_fit,
  data = survival_data,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  palette = c("#E7B800", "#2E9FDF"),
  legend.labs = c("Placebo", "Active Treatment"),
  risk.table.height = 0.25,
  tables.theme = theme_cleantable(),
  xlab = "Time (Months)",
  ylab = "Progression-Free Survival Probability",
  title = "Kaplan-Meier Estimate of Progression-Free Survival",
  subtitle = paste0("Hazard Ratio: ", sprintf("%.2f (95%% CI: %.2f-%.2f)", 
                                            pfs_hr, pfs_ci[1], pfs_ci[2])),
  ggtheme = theme_bw() +
    theme(
      plot.title = element_text(face = "bold"),
      legend.position = "bottom"
    )
)

# Save the plot
ggsave(here("reports", "figures", "pfs_curve.png"), 
      plot = pfs_plot$plot, 
      width = 8, height = 7, dpi = 300)

# Display the plot
print(pfs_plot)

8.2.5 Response Rate Analysis

Next, we’ll analyze the objective response rate:

Code
# Extract response data
response_data <- efficacy_data %>%
  select(patient_id, treatment, best_response) %>%
  mutate(
    best_response = factor(best_response, 
                          levels = c("CR", "PR", "SD", "PD", "NE")),
    responder = best_response %in% c("CR", "PR")
  )

# Calculate response rates
response_summary <- response_data %>%
  group_by(treatment) %>%
  summarize(
    n = n(),
    responders = sum(responder, na.rm = TRUE),
    response_rate = mean(responder, na.rm = TRUE) * 100,
    ci_lower = binom.test(responders, n)$conf.int[1] * 100,
    ci_upper = binom.test(responders, n)$conf.int[2] * 100,
    .groups = "drop"
  )

# Display results
kable(response_summary,
     caption = "Objective Response Rate by Treatment Group",
     col.names = c("Treatment", "N", "Responders", "ORR (%)", "95% CI Lower", "95% CI Upper"))

# Fisher's exact test for response rate comparison
fisher_test <- fisher.test(
  table(response_data$treatment, response_data$responder)
)

cat(sprintf("Difference in ORR: p = %.4f (Fisher's exact test)\n", 
           fisher_test$p.value))

8.2.6 Safety Analysis

Now, we’ll analyze adverse events and create visualizations:

Code
# Load safety data
safety_data <- read_csv(here("data", "processed", "safety_data.csv"))

# Calculate adverse event frequencies
ae_summary <- safety_data %>%
  group_by(treatment, ae_term) %>%
  summarize(count = n(), .groups = "drop") %>%
  group_by(treatment) %>%
  mutate(percent = count / sum(count) * 100) %>%
  ungroup() %>%
  # Select common AEs (>=5% in any group)
  filter(percent >= 5) %>%
  arrange(desc(count))

# Create adverse event plot
ae_plot <- ggplot(ae_summary, 
                aes(x = reorder(ae_term, -percent), y = percent, 
                    fill = treatment)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Placebo" = "#E7B800", "Active" = "#2E9FDF")) +
  labs(
    title = "Common Adverse Events (≥5% in Any Treatment Group)",
    x = "Adverse Event",
    y = "Percentage of Patients (%)",
    fill = "Treatment Group"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

# Save the plot
ggsave(here("reports", "figures", "ae_summary.png"), 
      plot = ae_plot, 
      width = 10, height = 6, dpi = 300)

# Display the plot
print(ae_plot)

8.2.7 Key Insights

This oncology case study demonstrates several important aspects of clinical trial analysis:

  1. Structured data preparation: Organizing trial data into analysis-ready datasets
  2. Survival analysis: Estimating treatment effects using time-to-event data
  3. Visualization techniques: Creating publication-quality Kaplan-Meier curves
  4. Response analysis: Evaluating treatment efficacy using categorical outcomes
  5. Safety assessment: Monitoring and visualizing adverse events

The integration of these approaches within a reproducible workflow provides a comprehensive analysis of the trial results, supporting accurate scientific conclusions and transparent reporting.

8.3 Case Study 2: Longitudinal Observational Study

8.3.1 Study Background

This case study examines a multi-center observational study of patients with rheumatoid arthritis:

Code
library(tidyverse)
library(lme4)
library(here)
library(knitr)

# Study design information
study_design <- tribble(
  ~Parameter, ~Value,
  "Design", "Prospective, multi-center observational cohort study",
  "Population", "Adults with rheumatoid arthritis on various treatments",
  "Sample size", "500 patients from 20 centers",
  "Primary objective", "Identify predictors of treatment response over time",
  "Follow-up", "5 years with biannual assessments",
  "Key measurements", "Disease Activity Score (DAS28), Health Assessment Questionnaire (HAQ), Biomarkers"
)

# Display study design
kable(study_design, caption = "Study Design Summary")

8.3.2 Data Preparation

We’ll prepare the longitudinal data for analysis:

Code
# Load longitudinal data
ra_data <- read_csv(here("data", "raw", "ra_longitudinal.csv"))

# Convert to long format
ra_long <- ra_data %>%
  pivot_longer(
    cols = matches("^(das28|haq|crp)_month_\\d+$"),
    names_to = c("measure", "time_point"),
    names_pattern = "^(.+)_month_(\\d+)$",
    values_to = "value"
  ) %>%
  mutate(
    time_point = as.numeric(time_point),
    treatment = factor(treatment_group),
    sex = factor(sex),
    center = factor(center_id)
  )

# Create separate datasets for each outcome
das28_data <- ra_long %>%
  filter(measure == "das28") %>%
  select(-measure) %>%
  rename(das28 = value)

haq_data <- ra_long %>%
  filter(measure == "haq") %>%
  select(-measure) %>%
  rename(haq = value)

crp_data <- ra_long %>%
  filter(measure == "crp") %>%
  select(-measure) %>%
  rename(crp = value)

# Join datasets
analysis_data <- das28_data %>%
  left_join(haq_data %>% select(patient_id, time_point, haq),
           by = c("patient_id", "time_point")) %>%
  left_join(crp_data %>% select(patient_id, time_point, crp),
           by = c("patient_id", "time_point"))

# Save processed data
write_csv(analysis_data, here("data", "processed", "ra_analysis_data.csv"))

8.3.3 Exploratory Data Visualization

We’ll create visualizations to explore longitudinal patterns:

Code
# Load processed data
ra_data <- read_csv(here("data", "processed", "ra_analysis_data.csv"))

# Create individual trajectory plot
trajectory_plot <- ggplot(ra_data, 
                         aes(x = time_point, y = das28, 
                             group = patient_id, color = treatment)) +
  geom_line(alpha = 0.2) +
  geom_smooth(aes(group = treatment), method = "loess", se = TRUE) +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~ treatment) +
  labs(
    title = "DAS28 Trajectories by Treatment Group",
    subtitle = "Individual patient trajectories with treatment group trends",
    x = "Month",
    y = "Disease Activity Score (DAS28)",
    color = "Treatment Group"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Save plot
ggsave(here("reports", "figures", "das28_trajectories.png"), 
      plot = trajectory_plot, 
      width = 10, height = 8, dpi = 300)

# Display plot
print(trajectory_plot)

# Create boxplot of DAS28 by time and treatment
boxplot_das28 <- ggplot(ra_data, 
                       aes(x = factor(time_point), y = das28, fill = treatment)) +
  geom_boxplot(alpha = 0.8) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "DAS28 Distribution Over Time by Treatment Group",
    x = "Month",
    y = "Disease Activity Score (DAS28)",
    fill = "Treatment Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 0)
  )

# Save and display boxplot
ggsave(here("reports", "figures", "das28_boxplot.png"), 
      plot = boxplot_das28, 
      width = 10, height = 6, dpi = 300)
print(boxplot_das28)

8.3.4 Mixed Effects Modeling

We’ll use mixed effects models to analyze the longitudinal data:

Code
library(lme4)
library(lmerTest)

# Fit linear mixed effects model
lmm_model <- lmer(
  das28 ~ time_point * treatment + age + sex + (1 + time_point | patient_id) + (1 | center),
  data = ra_data
)

# Model summary
lmm_summary <- summary(lmm_model)
print(lmm_summary)

# Extract fixed effects
fixed_effects <- as.data.frame(coef(summary(lmm_model)))
fixed_effects$p_value <- fixed_effects$`Pr(>|t|)`
fixed_effects$variable <- rownames(fixed_effects)

# Display fixed effects
kable(fixed_effects %>% 
        select(variable, Estimate, `Std. Error`, `t value`, p_value),
     caption = "Fixed Effects from Linear Mixed Model",
     col.names = c("Variable", "Estimate", "Std. Error", "t value", "p-value"))

# Plot model predictions
ra_data$predicted <- predict(lmm_model)

prediction_plot <- ggplot(ra_data, aes(x = time_point, color = treatment)) +
  geom_point(aes(y = das28), alpha = 0.1) +
  geom_smooth(aes(y = predicted, group = treatment), method = "loess") +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Mixed Effects Model Predictions of DAS28 Over Time",
    x = "Month",
    y = "Disease Activity Score (DAS28)",
    color = "Treatment Group"
  ) +
  theme_minimal()

# Save and display prediction plot
ggsave(here("reports", "figures", "lmm_predictions.png"), 
      plot = prediction_plot, 
      width = 8, height = 6, dpi = 300)
print(prediction_plot)

8.3.5 Treatment Response Analysis

We’ll analyze treatment response patterns over time:

Code
# Define treatment response criteria
ra_data <- ra_data %>%
  group_by(patient_id) %>%
  mutate(
    baseline_das28 = first(das28),
    das28_change = das28 - baseline_das28,
    responder = das28_change <= -1.2 | 
                (das28_change <= -0.6 & das28 <= 3.2)
  ) %>%
  ungroup()

# Calculate response rates by time and treatment
response_rates <- ra_data %>%
  filter(time_point %in% c(6, 12, 24, 36, 48, 60)) %>%
  group_by(time_point, treatment) %>%
  summarize(
    n = n(),
    responders = sum(responder, na.rm = TRUE),
    response_rate = responders / n * 100,
    .groups = "drop"
  )

# Display response rates
kable(response_rates,
     caption = "Response Rates by Time Point and Treatment",
     col.names = c("Month", "Treatment", "N", "Responders", "Response Rate (%)"))

# Plot response rates
response_plot <- ggplot(response_rates, 
                       aes(x = time_point, y = response_rate, 
                           color = treatment, group = treatment)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Treatment Response Rates Over Time",
    x = "Month",
    y = "Response Rate (%)",
    color = "Treatment Group"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(6, 12, 24, 36, 48, 60))

# Save and display response plot
ggsave(here("reports", "figures", "response_rates.png"), 
      plot = response_plot, 
      width = 8, height = 6, dpi = 300)
print(response_plot)

8.3.6 Biomarker Analysis

We’ll examine the relationship between biomarkers and treatment response:

Code
# Fit logistic mixed model for response prediction
biomarker_model <- glmer(
  responder ~ treatment * crp + age + sex + (1 | patient_id) + (1 | center),
  family = binomial,
  data = ra_data %>% filter(time_point == 24)  # Looking at 24-month response
)

# Model summary
bio_summary <- summary(biomarker_model)
print(bio_summary)

# Extract odds ratios
odds_ratios <- exp(fixef(biomarker_model))
ci <- exp(confint(biomarker_model, method = "Wald"))

# Create results table
or_table <- data.frame(
  variable = names(fixef(biomarker_model)),
  odds_ratio = odds_ratios,
  lower_ci = ci[,1],
  upper_ci = ci[,2]
)

# Display odds ratios
kable(or_table,
     caption = "Odds Ratios for Treatment Response Prediction",
     col.names = c("Variable", "Odds Ratio", "95% CI Lower", "95% CI Upper"))

# Create prediction plot
ra_data$prob_response <- predict(biomarker_model, type = "response", 
                               newdata = ra_data %>% filter(time_point == 24))

biomarker_plot <- ggplot(ra_data %>% filter(time_point == 24), 
                       aes(x = crp, y = prob_response, color = treatment)) +
  geom_point() +
  geom_smooth(method = "loess") +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Probability of Treatment Response by CRP Level",
    x = "C-Reactive Protein (mg/L)",
    y = "Probability of Response",
    color = "Treatment Group"
  ) +
  theme_minimal()

# Save and display biomarker plot
ggsave(here("reports", "figures", "biomarker_response.png"), 
      plot = biomarker_plot, 
      width = 8, height = 6, dpi = 300)
print(biomarker_plot)

8.3.7 Key Insights

This longitudinal case study highlights several important aspects of observational research analysis:

  1. Longitudinal data management: Transforming complex longitudinal data into analysis-ready formats
  2. Mixed effects modeling: Accounting for within-subject correlations and center effects
  3. Trajectory visualization: Creating informative visualizations of patient trajectories
  4. Response prediction: Using biomarkers to predict treatment response
  5. Time-varying effects: Capturing how treatment effects evolve over time

The methods demonstrated here can be applied to many types of longitudinal clinical studies, from small observational studies to large registry-based analyses.

8.4 Case Study 3: Registry-Based Pharmacovigilance Study

8.4.1 Study Background

This case study examines a large-scale pharmacovigilance study using registry data:

Code
library(tidyverse)
library(survival)
library(here)
library(knitr)

# Study design information
study_design <- tribble(
  ~Parameter, ~Value,
  "Design", "Registry-based retrospective cohort study",
  "Data source", "National health registry data",
  "Population", "Patients receiving one of four antihypertensive medications",
  "Sample size", "Approximately 50,000 patients",
  "Primary outcome", "Incidence of adverse events of special interest (AESI)",
  "Follow-up", "Up to 5 years from treatment initiation"
)

# Display study design
kable(study_design, caption = "Study Design Summary")

8.4.2 Data Preparation and Propensity Score Matching

Given the observational nature of registry data, we’ll implement propensity score matching:

Code
library(MatchIt)

# Load registry data
registry_data <- read_csv(here("data", "raw", "pharmacovigilance_data.csv"))

# Data preparation
analysis_data <- registry_data %>%
  mutate(
    treatment = factor(medication_class),
    sex = factor(sex),
    age_group = cut(age, breaks = c(0, 50, 65, 80, 100), 
                  labels = c("<50", "50-65", "65-80", ">80")),
    comorbidity_score = rowSums(select(., starts_with("comorbid_"))),
    # Calculating exposure time
    exposure_start = as.Date(medication_start),
    exposure_end = if_else(
      is.na(medication_end),
      as.Date(study_end),
      as.Date(medication_end)
    ),
    exposure_time = as.numeric(exposure_end - exposure_start) / 365.25
  )

# Filter to specific treatment comparisons
comparison_data <- analysis_data %>%
  filter(treatment %in% c("ACE_Inhibitor", "ARB"))

# Propensity score matching
ps_model <- matchit(
  treatment ~ age + sex + age_group + comorbidity_score + 
             comorbid_diabetes + comorbid_chf + comorbid_ckd + comorbid_cad,
  data = comparison_data,
  method = "nearest",
  ratio = 1
)

# Extract matched data
matched_data <- match.data(ps_model)

# Check balance
balance_summary <- summary(ps_model)
print(balance_summary)

# Save matched data
write_csv(matched_data, here("data", "processed", "matched_data.csv"))

8.4.3 Incidence Rate Analysis

We’ll analyze adverse event incidence rates in the matched cohort:

Code
# Load matched data
matched_data <- read_csv(here("data", "processed", "matched_data.csv"))

# Calculate incidence rates per 1000 person-years
aesi_summary <- matched_data %>%
  group_by(treatment) %>%
  summarize(
    n_patients = n(),
    person_years = sum(exposure_time, na.rm = TRUE),
    aesi_count = sum(aesi_occurred, na.rm = TRUE),
    incidence_rate = aesi_count / person_years * 1000,
    ir_lower = poisson.test(aesi_count, person_years)$conf.int[1] * 1000,
    ir_upper = poisson.test(aesi_count, person_years)$conf.int[2] * 1000,
    .groups = "drop"
  )

# Display incidence rates
kable(aesi_summary,
     caption = "Incidence Rates of Adverse Events of Special Interest",
     col.names = c("Treatment", "Patients", "Person-Years", "Events", 
                  "Incidence Rate (per 1000 PY)", "95% CI Lower", "95% CI Upper"))

# Calculate incidence rate ratio
ir_ratio <- (aesi_summary$incidence_rate[2] / aesi_summary$incidence_rate[1])
ir_ratio_ci <- epitools::rateratio(
  c(aesi_summary$aesi_count[1], aesi_summary$aesi_count[2]),
  c(aesi_summary$person_years[1], aesi_summary$person_years[2])
)$measure[2, c("estimate", "lower", "upper")]

cat(sprintf("Incidence Rate Ratio: %.2f (95%% CI: %.2f-%.2f)\n",
           ir_ratio, ir_ratio_ci["lower"], ir_ratio_ci["upper"]))

8.4.4 Time-to-Event Analysis

We’ll analyze the time to first adverse event using survival analysis:

Code
# Prepare survival data
surv_data <- matched_data %>%
  mutate(
    time_to_event = if_else(aesi_occurred == 1, time_to_aesi, exposure_time),
    event = aesi_occurred
  )

# Fit Kaplan-Meier model
km_fit <- survfit(Surv(time_to_event, event) ~ treatment, data = surv_data)

# Summary statistics
km_summary <- summary(km_fit)
print(km_summary)

# Create Kaplan-Meier plot
library(survminer)
km_plot <- ggsurvplot(
  km_fit,
  data = surv_data,
  conf.int = TRUE,
  risk.table = TRUE,
  cumevents = TRUE,
  palette = c("#E7B800", "#2E9FDF"),
  xlab = "Time (Years)",
  ylab = "AESI-Free Probability",
  title = "Time to First Adverse Event of Special Interest",
  legend.labs = c("ACE Inhibitor", "ARB"),
  ggtheme = theme_minimal()
)

# Display Kaplan-Meier plot
print(km_plot)

# Fit Cox proportional hazards model
cox_model <- coxph(Surv(time_to_event, event) ~ treatment + 
                  age + sex + comorbidity_score, 
                  data = surv_data)

# Model summary
cox_summary <- summary(cox_model)
print(cox_summary)

# Extract hazard ratios
hr_table <- data.frame(
  variable = names(coef(cox_model)),
  hazard_ratio = exp(coef(cox_model)),
  lower_ci = exp(confint(cox_model))[,1],
  upper_ci = exp(confint(cox_model))[,2],
  p_value = cox_summary$coefficients[,5]
)

# Display hazard ratios
kable(hr_table,
     caption = "Cox Proportional Hazards Model Results",
     col.names = c("Variable", "Hazard Ratio", "95% CI Lower", 
                  "95% CI Upper", "P-value"))

8.4.5 Risk Factor Analysis

We’ll identify risk factors for adverse events using regression analysis:

Code
library(glmnet)
library(caret)

# Prepare data for risk factor analysis
risk_data <- matched_data %>%
  select(
    aesi_occurred, age, sex, treatment, comorbidity_score,
    starts_with("comorbid_"), starts_with("medication_"),
    -medication_class, -medication_start, -medication_end
  ) %>%
  mutate(across(where(is.character), as.factor))

# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(risk_data$aesi_occurred, p = 0.7, list = FALSE)
train_data <- risk_data[train_index, ]
test_data <- risk_data[-train_index, ]

# Prepare model matrix
x_train <- model.matrix(aesi_occurred ~ . - 1, data = train_data)
y_train <- train_data$aesi_occurred
x_test <- model.matrix(aesi_occurred ~ . - 1, data = test_data)
y_test <- test_data$aesi_occurred

# Fit LASSO regression for feature selection
cv_model <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
best_lambda <- cv_model$lambda.1se

# Final model with selected lambda
final_model <- glmnet(x_train, y_train, family = "binomial", 
                     alpha = 1, lambda = best_lambda)

# Extract coefficients
coefs <- coef(final_model)
important_features <- data.frame(
  feature = rownames(coefs)[which(coefs != 0)],
  coefficient = coefs[which(coefs != 0)]
) %>%
  arrange(desc(abs(coefficient)))

# Display important features
print(important_features)

# Evaluate model performance
predictions <- predict(final_model, newx = x_test, type = "response")
auc <- pROC::auc(y_test, predictions)
cat(sprintf("Model AUC: %.3f\n", auc))

# Create risk score categories
risk_scores <- predict(final_model, newx = model.matrix(aesi_occurred ~ . - 1, 
                                                      data = matched_data))
matched_data$risk_score <- risk_scores
matched_data$risk_group <- cut(matched_data$risk_score, 
                             breaks = quantile(matched_data$risk_score, 
                                             probs = seq(0, 1, 0.25)),
                             labels = c("Low", "Medium", "High", "Very High"),
                             include.lowest = TRUE)

# Calculate incidence by risk group
risk_incidence <- matched_data %>%
  group_by(risk_group) %>%
  summarize(
    n_patients = n(),
    person_years = sum(exposure_time, na.rm = TRUE),
    aesi_count = sum(aesi_occurred, na.rm = TRUE),
    incidence_rate = aesi_count / person_years * 1000,
    .groups = "drop"
  )

# Display risk stratification results
kable(risk_incidence,
     caption = "Incidence Rates by Risk Group",
     col.names = c("Risk Group", "Patients", "Person-Years", 
                  "Events", "Incidence Rate (per 1000 PY)"))

# Create risk stratification plot
risk_plot <- ggplot(risk_incidence, 
                  aes(x = risk_group, y = incidence_rate, fill = risk_group)) +
  geom_bar(stat = "identity", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f", incidence_rate)), 
           vjust = -0.5, size = 3.5) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(
    title = "Adverse Event Incidence by Risk Group",
    x = "Risk Group",
    y = "Incidence Rate (per 1000 person-years)",
    fill = "Risk Group"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Display risk plot
print(risk_plot)

8.4.6 Sensitivity Analyses

We’ll conduct sensitivity analyses to test the robustness of our findings:

Code
# 1. Alternative matching methods
ps_model_alt <- matchit(
  treatment ~ age + sex + age_group + comorbidity_score + 
             comorbid_diabetes + comorbid_chf + comorbid_ckd + comorbid_cad,
  data = comparison_data,
  method = "optimal",
  ratio = 1
)

matched_data_alt <- match.data(ps_model_alt)

# Calculate incidence rates for alternative matching
aesi_summary_alt <- matched_data_alt %>%
  group_by(treatment) %>%
  summarize(
    n_patients = n(),
    person_years = sum(exposure_time, na.rm = TRUE),
    aesi_count = sum(aesi_occurred, na.rm = TRUE),
    incidence_rate = aesi_count / person_years * 1000,
    .groups = "drop"
  )

# Display alternative matching results
kable(aesi_summary_alt,
     caption = "Sensitivity Analysis: Optimal Matching Method",
     col.names = c("Treatment", "Patients", "Person-Years", 
                  "Events", "Incidence Rate (per 1000 PY)"))

# 2. IPTW (Inverse Probability of Treatment Weighting) analysis
ps_formula <- treatment ~ age + sex + age_group + comorbidity_score + 
                        comorbid_diabetes + comorbid_chf + comorbid_ckd + comorbid_cad
ps_model_iptw <- glm(ps_formula, family = binomial(), data = comparison_data)
comparison_data$ps <- predict(ps_model_iptw, type = "response")

# Calculate weights
comparison_data <- comparison_data %>%
  mutate(
    iptw_weight = case_when(
      treatment == "ACE_Inhibitor" ~ 1 / ps,
      treatment == "ARB" ~ 1 / (1 - ps)
    )
  )

# IPTW analysis
iptw_summary <- comparison_data %>%
  group_by(treatment) %>%
  summarize(
    n_patients = n(),
    weighted_n = sum(iptw_weight),
    person_years = sum(exposure_time * iptw_weight, na.rm = TRUE),
    aesi_count = sum(aesi_occurred * iptw_weight, na.rm = TRUE),
    incidence_rate = aesi_count / person_years * 1000,
    .groups = "drop"
  )

# Display IPTW results
kable(iptw_summary,
     caption = "Sensitivity Analysis: IPTW Method",
     col.names = c("Treatment", "Patients", "Weighted N", "Weighted Person-Years", 
                  "Weighted Events", "Incidence Rate (per 1000 PY)"))

8.4.7 Key Insights

This registry-based pharmacovigilance case study demonstrates several important techniques for real-world evidence analysis:

  1. Propensity score matching: Reducing confounding bias in observational data
  2. Incidence rate calculation: Properly accounting for varying exposure times
  3. Survival analysis for safety outcomes: Analyzing time-to-event data for adverse events
  4. Risk stratification: Identifying patients at higher risk of adverse events
  5. Sensitivity analyses: Testing the robustness of findings using alternative methods

These methods are essential for generating reliable evidence from real-world data sources, which increasingly complement traditional clinical trials in regulatory decision-making and clinical practice guidelines.

8.5 Conclusion

These case studies demonstrate how the principles, techniques, and tools covered throughout this book can be applied to solve real-world clinical research challenges. By integrating data preparation, statistical analysis, visualization, and reproducible workflows, we can effectively analyze and communicate clinical research findings.

The key lessons from these case studies include:

  1. Integration of methods: Each case study required multiple analytical approaches working together
  2. Reproducible workflows: Structured project organization and dependency management ensured reproducibility
  3. Visualization as communication: Tailored visualizations effectively communicated complex findings
  4. Context-specific approaches: Different clinical research contexts required adapting our analytical approach

In the next chapter, we’ll explore regulatory considerations for R-based clinical research, building on the practical applications demonstrated in these case studies.

8.6 References