4  Statistical Modeling for Clinical Research

4.1 Introduction to Statistical Modeling in Clinical Settings

Statistical modeling is at the core of evidence generation in clinical research. Models help us test hypotheses, quantify treatment effects, adjust for confounding factors, and make predictions about patient outcomes.

4.1.1 Statistical Modeling Objectives in Clinical Research

The key objectives of statistical modeling in clinical research include:

  1. Estimating treatment effects: Quantifying the impact of interventions on clinical outcomes
  2. Controlling for confounding: Adjusting for factors that may bias treatment effect estimates
  3. Predicting outcomes: Developing tools to forecast patient responses or disease progression
  4. Identifying risk factors: Understanding which variables are associated with outcomes
  5. Subgroup analysis: Exploring heterogeneity of treatment effects across patient populations
  6. Handling missing data: Employing methods that properly account for incomplete observations

4.2 Linear Models for Continuous Outcomes

4.2.1 Simple and Multiple Linear Regression

Linear regression remains a workhorse for analyzing continuous outcomes:

Code
library(tidyverse)
library(broom)
library(car)
library(ggplot2)

# Simple linear regression
simple_model <- lm(endpoint_value ~ baseline_value, data = clinical_data)

# Summarize the model
summary(simple_model)

# Tidy model output
tidy(simple_model, conf.int = TRUE)

# Multiple linear regression with treatment and covariates
multiple_model <- lm(endpoint_value ~ baseline_value + treatment_group + 
                      age + sex + comorbidity_score,
                    data = clinical_data)

# Summarize the multiple regression model
summary(multiple_model)

# Check assumptions
par(mfrow = c(2, 2))
plot(multiple_model)

# Check for multicollinearity
vif(multiple_model)

# Visualize the relationship with confidence intervals
ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, 
                         color = treatment_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Relationship Between Baseline and Endpoint Values",
       subtitle = "By Treatment Group with 95% Confidence Bands",
       x = "Baseline Value", 
       y = "Endpoint Value") +
  theme_minimal()

4.2.2 Analysis of Covariance (ANCOVA)

ANCOVA is particularly important for analyzing clinical trial data:

Code
# ANCOVA model for treatment comparison adjusting for baseline
ancova_model <- lm(endpoint_value ~ baseline_value + treatment_group, 
                  data = clinical_data)

# Model summary
summary(ancova_model)

# Extract adjusted means
library(emmeans)
adjusted_means <- emmeans(ancova_model, ~ treatment_group)
adjusted_means

# Pairwise comparisons with adjustment for multiple testing
pairs(adjusted_means, adjust = "tukey")

# Visualize adjusted means with confidence intervals
plot(adjusted_means) +
  labs(title = "Adjusted Means by Treatment Group",
       subtitle = "Controlling for Baseline Value",
       x = "Treatment Group",
       y = "Adjusted Endpoint Value") +
  theme_minimal()

4.2.3 Diagnostics and Assumption Checking

Thorough model checking is essential in clinical research:

Code
# Comprehensive diagnostics for the ANCOVA model
# Check normality of residuals
shapiro.test(residuals(ancova_model))

# Check homogeneity of variance
leveneTest(endpoint_value ~ treatment_group, data = clinical_data)

# Create augmented data for diagnostics
model_diagnostics <- augment(ancova_model)

# Residual plots
ggplot(model_diagnostics, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values", 
       y = "Residuals") +
  theme_minimal()

# QQ plot
ggplot(model_diagnostics, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot of Residuals",
       x = "Theoretical Quantiles", 
       y = "Sample Quantiles") +
  theme_minimal()

# Check for influential observations
influence_measures <- influence.measures(ancova_model)
summary(influence_measures)

# Visual check for influential points
ggplot(model_diagnostics, aes(x = .hat, y = .cooksd)) +
  geom_point() +
  geom_text(aes(label = ifelse(.cooksd > 0.1, rownames(model_diagnostics), "")), 
            vjust = -1) +
  labs(title = "Cook's Distance vs Leverage",
       x = "Leverage", 
       y = "Cook's Distance") +
  theme_minimal()

4.3 Generalized Linear Models for Non-Continuous Outcomes

4.3.1 Logistic Regression for Binary Outcomes

Logistic regression is essential for analyzing binary outcomes like treatment response:

Code
# Logistic regression for binary outcome (e.g., treatment response)
logistic_model <- glm(responder ~ treatment_group + age + sex + 
                      baseline_severity + comorbidity_score,
                     family = binomial(link = "logit"),
                     data = clinical_data)

# Model summary
summary(logistic_model)

# Tidy output with odds ratios
tidy(logistic_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Calculate predicted probabilities
clinical_data <- clinical_data %>%
  mutate(pred_prob = predict(logistic_model, type = "response"))

# Plot predicted probabilities by treatment group
ggplot(clinical_data, aes(x = baseline_severity, y = pred_prob, 
                         color = treatment_group)) +
  geom_line() +
  facet_wrap(~ sex) +
  labs(title = "Predicted Probability of Response",
       subtitle = "By Treatment Group, Baseline Severity, and Sex",
       x = "Baseline Severity", 
       y = "Probability of Response") +
  theme_minimal()

# Assess model performance
library(pROC)
roc_curve <- roc(clinical_data$responder, clinical_data$pred_prob)
auc(roc_curve)

# Plot ROC curve
ggroc(roc_curve) +
  geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "gray") +
  labs(title = "ROC Curve for Logistic Regression Model",
       subtitle = paste("AUC =", round(auc(roc_curve), 3)),
       x = "False Positive Rate", 
       y = "True Positive Rate") +
  theme_minimal()

4.3.2 Poisson and Negative Binomial Regression for Count Data

For analyzing count outcomes such as adverse events:

Code
# Poisson regression for count outcome (e.g., number of adverse events)
poisson_model <- glm(ae_count ~ treatment_group + age + exposure_time,
                    family = poisson(link = "log"),
                    data = clinical_data)

# Check for overdispersion
dispersion_test <- sum(residuals(poisson_model, type = "pearson")^2) / 
                  poisson_model$df.residual
dispersion_test

# If overdispersion is present, use negative binomial
if(dispersion_test > 1.5) {
  library(MASS)
  nb_model <- glm.nb(ae_count ~ treatment_group + age + exposure_time,
                    data = clinical_data)
  summary(nb_model)
  final_count_model <- nb_model
} else {
  summary(poisson_model)
  final_count_model <- poisson_model
}

# Calculate incidence rate ratios
tidy(final_count_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Offset for exposure time
offset_model <- glm(ae_count ~ treatment_group + age + offset(log(exposure_time)),
                   family = poisson(link = "log"),
                   data = clinical_data)

# Compare predictions
clinical_data <- clinical_data %>%
  mutate(
    pred_count = predict(final_count_model, type = "response"),
    pred_rate = predict(offset_model, type = "response")
  )

# Visualize
ggplot(clinical_data, aes(x = age, y = pred_count, color = treatment_group)) +
  geom_line() +
  labs(title = "Predicted Adverse Event Counts",
       subtitle = "By Treatment Group and Age",
       x = "Age", 
       y = "Predicted Number of Adverse Events") +
  theme_minimal()

4.3.3 Ordinal Regression for Ordered Categorical Data

For analyzing ordered outcomes like disease severity stages:

Code
# Ordinal logistic regression for ordered categorical outcome
library(MASS)
ordinal_model <- polr(ordered_outcome ~ treatment_group + age + 
                     baseline_category + comorbidity_score,
                     data = clinical_data,
                     Hess = TRUE)

# Summary of model
summary(ordinal_model)

# Calculate confidence intervals and exponentiated coefficients
ci <- confint(ordinal_model)
exp_coef <- exp(coef(ordinal_model))
exp_ci <- exp(ci)

# Create tidy output with odds ratios
ord_results <- data.frame(
  term = names(coef(ordinal_model)),
  odds_ratio = exp_coef,
  lower_ci = exp_ci[,1],
  upper_ci = exp_ci[,2]
)

# Test proportional odds assumption
library(ordinal)
prop_odds_test <- nominal_test(ordinal_model)
prop_odds_test

4.4 Survival Analysis for Time-to-Event Data

4.4.1 Kaplan-Meier Estimation

Non-parametric survival analysis is fundamental in clinical research:

Code
library(survival)
library(survminer)

# Create a survival object
surv_obj <- Surv(time = clinical_data$time_to_event, 
                event = clinical_data$event_indicator)

# Fit Kaplan-Meier curves by treatment group
km_fit <- survfit(surv_obj ~ treatment_group, data = clinical_data)

# Summary of the Kaplan-Meier fit
summary(km_fit)

# Extract median survival times
km_fit

# Log-rank test for group differences
log_rank <- survdiff(surv_obj ~ treatment_group, data = clinical_data)
log_rank

# Kaplan-Meier plot with p-value
ggsurvplot(
  km_fit,
  data = clinical_data,
  pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  ggtheme = theme_minimal(),
  xlab = "Time (months)",
  ylab = "Survival Probability",
  legend.title = "Treatment Group",
  legend.labs = c("Control", "Treatment"),
  palette = c("#0072B2", "#D55E00")
)

4.4.2 Cox Proportional Hazards Models

Cox regression is the standard approach for multivariable survival analysis:

Code
# Cox proportional hazards model
cox_model <- coxph(surv_obj ~ treatment_group + age + sex + 
                  baseline_risk_score + strata(study_center),
                 data = clinical_data)

# Model summary
summary(cox_model)

# Hazard ratios with confidence intervals
tidy(cox_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Check proportional hazards assumption
ph_test <- cox.zph(cox_model)
ph_test

# Plot Schoenfeld residuals to visualize PH assumption
ggcoxzph(ph_test)

# Forest plot of hazard ratios
ggforest(cox_model, data = clinical_data)

# Adjusted survival curves
adjusted_curves <- survminer::surv_adjustedcurves(
  fit = cox_model,
  data = clinical_data,
  variable = "treatment_group"
)

# Plot adjusted curves
plot(adjusted_curves)

4.4.3 Competing Risks Analysis

When multiple competing events are of interest:

Code
library(cmprsk)

# Create data for competing risks analysis
# Assume event_type: 0=censored, 1=event of interest, 2=competing event
cr_data <- clinical_data %>%
  mutate(
    status = case_when(
      event_indicator == 0 ~ 0,  # censored
      event_type == "primary" ~ 1,  # primary outcome
      event_type == "competing" ~ 2  # competing event
    )
  )

# Cumulative incidence function
cif <- cuminc(
  ftime = cr_data$time_to_event,
  fstatus = cr_data$status,
  group = cr_data$treatment_group
)

# Plot cumulative incidence
plot(cif, col = c("blue", "red", "green", "orange"))

# Fine-Gray competing risks regression
fg_model <- crr(
  ftime = cr_data$time_to_event,
  fstatus = cr_data$status,
  cov = cr_data %>% select(age, sex, baseline_risk_score) %>% as.matrix(),
  failcode = 1,
  cencode = 0
)

# Summarize model
summary(fg_model)

# Plot competing risks cumulative incidence with survminer
library(survminer)
ggcompetingrisks(
  fit = cif,
  palette = "jco",
  ggtheme = theme_minimal()
)

4.5 Mixed-Effects Models for Longitudinal Data

4.5.1 Linear Mixed Models

For analyzing repeated measures with continuous outcomes:

Code
library(lme4)
library(lmerTest)

# Linear mixed model for repeated measures
mixed_model <- lmer(
  outcome ~ treatment_group * visit + baseline_value + age + sex + 
  (1 | subject_id),
  data = longitudinal_data
)

# Model summary
summary(mixed_model)

# Type III ANOVA table
anova(mixed_model, type = 3)

# Extract fixed effects
fixef(mixed_model)

# Extract random effects
ranef(mixed_model)

# Check residuals
plot(mixed_model)
qqnorm(resid(mixed_model))
qqline(resid(mixed_model))

# Plot interaction between treatment and visit
library(emmeans)
emm <- emmeans(mixed_model, ~ treatment_group | visit)
plot(emm)

# Extract and plot predicted values
longitudinal_data$predicted <- predict(mixed_model)

ggplot(longitudinal_data, aes(x = visit, y = outcome, 
                             group = subject_id, color = treatment_group)) +
  geom_line(alpha = 0.3) +
  geom_smooth(aes(y = predicted, group = treatment_group), 
              size = 1.5, se = FALSE) +
  facet_wrap(~ treatment_group) +
  labs(title = "Individual Trajectories with Model Predictions",
       x = "Visit", 
       y = "Outcome Measure") +
  theme_minimal()

4.5.2 Generalized Linear Mixed Models

For non-continuous longitudinal outcomes:

Code
# Generalized linear mixed model for binary repeated outcome
library(lme4)
binary_mixed_model <- glmer(
  binary_outcome ~ treatment_group * visit + baseline_value + age + sex +
  (1 | subject_id),
  family = binomial,
  data = longitudinal_data
)

# Model summary
summary(binary_mixed_model)

# Odds ratios with confidence intervals
fixef_results <- data.frame(
  estimate = fixef(binary_mixed_model),
  se = sqrt(diag(vcov(binary_mixed_model)))
)

fixef_results <- fixef_results %>%
  mutate(
    odds_ratio = exp(estimate),
    lower_ci = exp(estimate - 1.96 * se),
    upper_ci = exp(estimate + 1.96 * se)
  )

# Plot predicted probabilities over time
newdata <- expand.grid(
  visit = unique(longitudinal_data$visit),
  treatment_group = unique(longitudinal_data$treatment_group),
  baseline_value = mean(longitudinal_data$baseline_value),
  age = mean(longitudinal_data$age),
  sex = "Female"  # or whatever reference category you want
)

newdata$predicted_prob <- predict(binary_mixed_model, 
                                 newdata = newdata, 
                                 type = "response",
                                 re.form = NA)  # Population-average predictions

ggplot(newdata, aes(x = visit, y = predicted_prob, 
                    color = treatment_group, group = treatment_group)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Predicted Probability of Outcome Over Time",
       subtitle = "By Treatment Group",
       x = "Visit", 
       y = "Predicted Probability") +
  theme_minimal()

4.6 Advanced Modeling Techniques

4.6.1 Multiple Imputation for Missing Data

Proper handling of missing data is crucial in clinical research:

Code
library(mice)

# Examine missing data patterns
md.pattern(clinical_data)

# Create multiple imputations
imp <- mice(clinical_data, m = 10, printFlag = FALSE)

# Fit model on each imputed dataset
models <- with(imp, lm(endpoint_value ~ baseline_value + treatment_group + age + sex))

# Pool results across imputations
pooled_results <- pool(models)
summary(pooled_results)

# Compare with complete case analysis
complete_case_model <- lm(endpoint_value ~ baseline_value + treatment_group + age + sex,
                         data = clinical_data)
summary(complete_case_model)

# Visualize differences
pooled_coefs <- summary(pooled_results)
complete_coefs <- tidy(complete_case_model)

# Combine results for plotting
comparison <- bind_rows(
  pooled_coefs %>% 
    select(term = term, estimate = estimate, std.error = std.error) %>%
    mutate(method = "Multiple Imputation"),
  complete_coefs %>% 
    select(term = term, estimate = estimate, std.error = std.error) %>%
    mutate(method = "Complete Case")
) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    lower_ci = estimate - 1.96 * std.error,
    upper_ci = estimate + 1.96 * std.error
  )

# Plot comparison
ggplot(comparison, aes(x = term, y = estimate, color = method)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),
                position = position_dodge(width = 0.5),
                width = 0.25) +
  coord_flip() +
  labs(title = "Comparison of Multiple Imputation vs Complete Case Analysis",
       x = "", 
       y = "Coefficient Estimate (95% CI)") +
  theme_minimal()

4.6.2 Bayesian Approaches

Bayesian methods are gaining popularity in clinical research:

Code
library(rstanarm)
library(bayesplot)

# Bayesian linear model
bayes_model <- stan_glm(
  endpoint_value ~ baseline_value + treatment_group + age + sex,
  data = clinical_data,
  prior = normal(0, 5),
  prior_intercept = normal(0, 10),
  chains = 4,
  iter = 2000
)

# Model summary
summary(bayes_model)
prior_summary(bayes_model)

# Check convergence
mcmc_trace(bayes_model)

# Posterior distributions
mcmc_intervals(bayes_model)

# Posterior predictive checks
pp_check(bayes_model)

# Treatment effect with 95% credible interval
treatment_effect <- as.matrix(bayes_model)[, "treatment_groupTreatment"]
mean(treatment_effect)
quantile(treatment_effect, c(0.025, 0.975))

# Probability that treatment effect is greater than 0
mean(treatment_effect > 0)

# Visualize posterior distribution of treatment effect
mcmc_areas(as.matrix(bayes_model), pars = "treatment_groupTreatment") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(title = "Posterior Distribution of Treatment Effect",
       subtitle = "With 95% Credible Interval",
       x = "Treatment Effect", 
       y = "Density") +
  theme_minimal()

4.6.3 Propensity Score Methods

For observational studies and causal inference:

Code
library(MatchIt)
library(cobalt)

# Fit propensity score model
ps_model <- glm(treatment_group ~ age + sex + baseline_value + 
               comorbidity_score + center,
              family = binomial,
              data = observational_data)

# Add propensity scores to data
observational_data$propensity_score <- predict(ps_model, type = "response")

# Check propensity score distribution
ggplot(observational_data, aes(x = propensity_score, fill = treatment_group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Propensity Score Distributions",
       x = "Propensity Score", 
       y = "Density") +
  theme_minimal()

# Propensity score matching
match_obj <- matchit(
  treatment_group ~ age + sex + baseline_value + comorbidity_score + center,
  data = observational_data,
  method = "nearest",
  ratio = 1,
  caliper = 0.2
)

# Examine balance before and after matching
summary(match_obj)

# Get matched data
matched_data <- match.data(match_obj)

# Visualize covariate balance
love.plot(match_obj, binary = "std", 
          thresholds = c(m = 0.1),
          var.order = "unadjusted",
          abs = TRUE) +
  labs(title = "Standardized Mean Differences Before and After Matching")

# Treatment effect analysis on matched data
matched_model <- lm(endpoint_value ~ treatment_group, data = matched_data)
summary(matched_model)

# Alternative: Inverse probability weighting
observational_data$ipw <- ifelse(
  observational_data$treatment_group == "Treatment",
  1 / observational_data$propensity_score,
  1 / (1 - observational_data$propensity_score)
)

# Stabilize weights
observational_data <- observational_data %>%
  mutate(
    ipw_stabilized = ipw * ifelse(
      treatment_group == "Treatment",
      mean(treatment_group == "Treatment"),
      mean(treatment_group == "Control")
    )
  )

# IPW regression
ipw_model <- lm(
  endpoint_value ~ treatment_group,
  data = observational_data,
  weights = ipw_stabilized
)

summary(ipw_model)

4.7 Exercises

  1. Using a clinical trial dataset, conduct a complete ANCOVA analysis for a continuous outcome, including thorough diagnostics and visualization of results.

  2. Analyze a binary outcome from a clinical study using logistic regression, and create plots that illustrate how the probability of response varies with key covariates.

  3. Perform a survival analysis for time-to-event data, including both Kaplan-Meier estimation and Cox proportional hazards modeling. Check and address any violations of the proportional hazards assumption.

  4. Analyze longitudinal data from a clinical trial using a linear mixed-effects model. Explore different correlation structures and select the best-fitting model based on information criteria.

  5. Compare multiple imputation, complete case analysis, and LOCF (Last Observation Carried Forward) approaches on a dataset with missing values, and discuss the impact on parameter estimates and conclusions.

4.8 Summary

Statistical modeling in clinical research requires careful consideration of study design, outcome types, and data characteristics. This chapter has covered a range of modeling approaches commonly used in clinical settings, from traditional methods like linear and logistic regression to more specialized techniques for longitudinal and survival data. We’ve emphasized not just model fitting, but also thorough diagnostics, visualization, and interpretation—all essential components of reproducible and reliable clinical research.

When applying these methods, it’s critical to align the modeling approach with the research questions and to be transparent about assumptions and limitations. Modern R packages like tidymodels, lme4, survival, and rstanarm provide powerful tools for implementing these methods within a reproducible framework, allowing clinical researchers to generate robust evidence while maintaining regulatory compliance.

4.9 References