4 Statistical Modeling for Clinical Research

4.1 Introduction to Statistical Modeling in Clinical Settings

Statistical modeling is at the core of evidence generation in clinical research. Models help us test hypotheses, quantify treatment effects, adjust for confounding factors, and make predictions about patient outcomes.

4.1.1 Statistical Modeling Objectives in Clinical Research

The key objectives of statistical modeling in clinical research include:

Estimating treatment effects: Quantifying the impact of interventions on clinical outcomes
Controlling for confounding: Adjusting for factors that may bias treatment effect estimates
Predicting outcomes: Developing tools to forecast patient responses or disease progression
Identifying risk factors: Understanding which variables are associated with outcomes
Subgroup analysis: Exploring heterogeneity of treatment effects across patient populations
Handling missing data: Employing methods that properly account for incomplete observations

4.2 Linear Models for Continuous Outcomes

4.2.1 Simple and Multiple Linear Regression

Linear regression remains a workhorse for analyzing continuous outcomes:

Code

library(tidyverse)
library(broom)
library(car)
library(ggplot2)

# Simple linear regression
simple_model <- lm(endpoint_value ~ baseline_value, data = clinical_data)

# Summarize the model
summary(simple_model)

# Tidy model output
tidy(simple_model, conf.int = TRUE)

# Multiple linear regression with treatment and covariates
multiple_model <- lm(endpoint_value ~ baseline_value + treatment_group + 
                      age + sex + comorbidity_score,
                    data = clinical_data)

# Summarize the multiple regression model
summary(multiple_model)

# Check assumptions
par(mfrow = c(2, 2))
plot(multiple_model)

# Check for multicollinearity
vif(multiple_model)

# Visualize the relationship with confidence intervals
ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, 
                         color = treatment_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Relationship Between Baseline and Endpoint Values",
       subtitle = "By Treatment Group with 95% Confidence Bands",
       x = "Baseline Value", 
       y = "Endpoint Value") +
  theme_minimal()

4.2.2 Analysis of Covariance (ANCOVA)

ANCOVA is particularly important for analyzing clinical trial data:

Code

# ANCOVA model for treatment comparison adjusting for baseline
ancova_model <- lm(endpoint_value ~ baseline_value + treatment_group, 
                  data = clinical_data)

# Model summary
summary(ancova_model)

# Extract adjusted means
library(emmeans)
adjusted_means <- emmeans(ancova_model, ~ treatment_group)
adjusted_means

# Pairwise comparisons with adjustment for multiple testing
pairs(adjusted_means, adjust = "tukey")

# Visualize adjusted means with confidence intervals
plot(adjusted_means) +
  labs(title = "Adjusted Means by Treatment Group",
       subtitle = "Controlling for Baseline Value",
       x = "Treatment Group",
       y = "Adjusted Endpoint Value") +
  theme_minimal()

4.2.3 Diagnostics and Assumption Checking

Thorough model checking is essential in clinical research:

Code

# Comprehensive diagnostics for the ANCOVA model
# Check normality of residuals
shapiro.test(residuals(ancova_model))

# Check homogeneity of variance
leveneTest(endpoint_value ~ treatment_group, data = clinical_data)

# Create augmented data for diagnostics
model_diagnostics <- augment(ancova_model)

# Residual plots
ggplot(model_diagnostics, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values", 
       y = "Residuals") +
  theme_minimal()

# QQ plot
ggplot(model_diagnostics, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot of Residuals",
       x = "Theoretical Quantiles", 
       y = "Sample Quantiles") +
  theme_minimal()

# Check for influential observations
influence_measures <- influence.measures(ancova_model)
summary(influence_measures)

# Visual check for influential points
ggplot(model_diagnostics, aes(x = .hat, y = .cooksd)) +
  geom_point() +
  geom_text(aes(label = ifelse(.cooksd > 0.1, rownames(model_diagnostics), "")), 
            vjust = -1) +
  labs(title = "Cook's Distance vs Leverage",
       x = "Leverage", 
       y = "Cook's Distance") +
  theme_minimal()

4.3 Generalized Linear Models for Non-Continuous Outcomes

4.3.1 Logistic Regression for Binary Outcomes

Logistic regression is essential for analyzing binary outcomes like treatment response:

Code

# Logistic regression for binary outcome (e.g., treatment response)
logistic_model <- glm(responder ~ treatment_group + age + sex + 
                      baseline_severity + comorbidity_score,
                     family = binomial(link = "logit"),
                     data = clinical_data)

# Model summary
summary(logistic_model)

# Tidy output with odds ratios
tidy(logistic_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Calculate predicted probabilities
clinical_data <- clinical_data %>%
  mutate(pred_prob = predict(logistic_model, type = "response"))

# Plot predicted probabilities by treatment group
ggplot(clinical_data, aes(x = baseline_severity, y = pred_prob, 
                         color = treatment_group)) +
  geom_line() +
  facet_wrap(~ sex) +
  labs(title = "Predicted Probability of Response",
       subtitle = "By Treatment Group, Baseline Severity, and Sex",
       x = "Baseline Severity", 
       y = "Probability of Response") +
  theme_minimal()

# Assess model performance
library(pROC)
roc_curve <- roc(clinical_data$responder, clinical_data$pred_prob)
auc(roc_curve)

# Plot ROC curve
ggroc(roc_curve) +
  geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "gray") +
  labs(title = "ROC Curve for Logistic Regression Model",
       subtitle = paste("AUC =", round(auc(roc_curve), 3)),
       x = "False Positive Rate", 
       y = "True Positive Rate") +
  theme_minimal()

4.3.2 Poisson and Negative Binomial Regression for Count Data

For analyzing count outcomes such as adverse events:

Code

# Poisson regression for count outcome (e.g., number of adverse events)
poisson_model <- glm(ae_count ~ treatment_group + age + exposure_time,
                    family = poisson(link = "log"),
                    data = clinical_data)

# Check for overdispersion
dispersion_test <- sum(residuals(poisson_model, type = "pearson")^2) / 
                  poisson_model$df.residual
dispersion_test

# If overdispersion is present, use negative binomial
if(dispersion_test > 1.5) {
  library(MASS)
  nb_model <- glm.nb(ae_count ~ treatment_group + age + exposure_time,
                    data = clinical_data)
  summary(nb_model)
  final_count_model <- nb_model
} else {
  summary(poisson_model)
  final_count_model <- poisson_model
}

# Calculate incidence rate ratios
tidy(final_count_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Offset for exposure time
offset_model <- glm(ae_count ~ treatment_group + age + offset(log(exposure_time)),
                   family = poisson(link = "log"),
                   data = clinical_data)

# Compare predictions
clinical_data <- clinical_data %>%
  mutate(
    pred_count = predict(final_count_model, type = "response"),
    pred_rate = predict(offset_model, type = "response")
  )

# Visualize
ggplot(clinical_data, aes(x = age, y = pred_count, color = treatment_group)) +
  geom_line() +
  labs(title = "Predicted Adverse Event Counts",
       subtitle = "By Treatment Group and Age",
       x = "Age", 
       y = "Predicted Number of Adverse Events") +
  theme_minimal()

4.3.3 Ordinal Regression for Ordered Categorical Data

For analyzing ordered outcomes like disease severity stages:

Code

# Ordinal logistic regression for ordered categorical outcome
library(MASS)
ordinal_model <- polr(ordered_outcome ~ treatment_group + age + 
                     baseline_category + comorbidity_score,
                     data = clinical_data,
                     Hess = TRUE)

# Summary of model
summary(ordinal_model)

# Calculate confidence intervals and exponentiated coefficients
ci <- confint(ordinal_model)
exp_coef <- exp(coef(ordinal_model))
exp_ci <- exp(ci)

# Create tidy output with odds ratios
ord_results <- data.frame(
  term = names(coef(ordinal_model)),
  odds_ratio = exp_coef,
  lower_ci = exp_ci[,1],
  upper_ci = exp_ci[,2]
)

# Test proportional odds assumption
library(ordinal)
prop_odds_test <- nominal_test(ordinal_model)
prop_odds_test

4.4 Survival Analysis for Time-to-Event Data

4.4.1 Kaplan-Meier Estimation

Non-parametric survival analysis is fundamental in clinical research:

Code

library(survival)
library(survminer)

# Create a survival object
surv_obj <- Surv(time = clinical_data$time_to_event, 
                event = clinical_data$event_indicator)

# Fit Kaplan-Meier curves by treatment group
km_fit <- survfit(surv_obj ~ treatment_group, data = clinical_data)

# Summary of the Kaplan-Meier fit
summary(km_fit)

# Extract median survival times
km_fit

# Log-rank test for group differences
log_rank <- survdiff(surv_obj ~ treatment_group, data = clinical_data)
log_rank

# Kaplan-Meier plot with p-value
ggsurvplot(
  km_fit,
  data = clinical_data,
  pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  ggtheme = theme_minimal(),
  xlab = "Time (months)",
  ylab = "Survival Probability",
  legend.title = "Treatment Group",
  legend.labs = c("Control", "Treatment"),
  palette = c("#0072B2", "#D55E00")
)

4.4.2 Cox Proportional Hazards Models

Cox regression is the standard approach for multivariable survival analysis:

Code

# Cox proportional hazards model
cox_model <- coxph(surv_obj ~ treatment_group + age + sex + 
                  baseline_risk_score + strata(study_center),
                 data = clinical_data)

# Model summary
summary(cox_model)

# Hazard ratios with confidence intervals
tidy(cox_model, exponentiate = TRUE, conf.int = TRUE) %>%
  knitr::kable(digits = 3)

# Check proportional hazards assumption
ph_test <- cox.zph(cox_model)
ph_test

# Plot Schoenfeld residuals to visualize PH assumption
ggcoxzph(ph_test)

# Forest plot of hazard ratios
ggforest(cox_model, data = clinical_data)

# Adjusted survival curves
adjusted_curves <- survminer::surv_adjustedcurves(
  fit = cox_model,
  data = clinical_data,
  variable = "treatment_group"
)

# Plot adjusted curves
plot(adjusted_curves)

4.4.3 Competing Risks Analysis

When multiple competing events are of interest:

Code

library(cmprsk)

# Create data for competing risks analysis
# Assume event_type: 0=censored, 1=event of interest, 2=competing event
cr_data <- clinical_data %>%
  mutate(
    status = case_when(
      event_indicator == 0 ~ 0,  # censored
      event_type == "primary" ~ 1,  # primary outcome
      event_type == "competing" ~ 2  # competing event
    )
  )

# Cumulative incidence function
cif <- cuminc(
  ftime = cr_data$time_to_event,
  fstatus = cr_data$status,
  group = cr_data$treatment_group
)

# Plot cumulative incidence
plot(cif, col = c("blue", "red", "green", "orange"))

# Fine-Gray competing risks regression
fg_model <- crr(
  ftime = cr_data$time_to_event,
  fstatus = cr_data$status,
  cov = cr_data %>% select(age, sex, baseline_risk_score) %>% as.matrix(),
  failcode = 1,
  cencode = 0
)

# Summarize model
summary(fg_model)

# Plot competing risks cumulative incidence with survminer
library(survminer)
ggcompetingrisks(
  fit = cif,
  palette = "jco",
  ggtheme = theme_minimal()
)

4.5 Mixed-Effects Models for Longitudinal Data

4.5.1 Linear Mixed Models

For analyzing repeated measures with continuous outcomes:

Code

library(lme4)
library(lmerTest)

# Linear mixed model for repeated measures
mixed_model <- lmer(
  outcome ~ treatment_group * visit + baseline_value + age + sex + 
  (1 | subject_id),
  data = longitudinal_data
)

# Model summary
summary(mixed_model)

# Type III ANOVA table
anova(mixed_model, type = 3)

# Extract fixed effects
fixef(mixed_model)

# Extract random effects
ranef(mixed_model)

# Check residuals
plot(mixed_model)
qqnorm(resid(mixed_model))
qqline(resid(mixed_model))

# Plot interaction between treatment and visit
library(emmeans)
emm <- emmeans(mixed_model, ~ treatment_group | visit)
plot(emm)

# Extract and plot predicted values
longitudinal_data$predicted <- predict(mixed_model)

ggplot(longitudinal_data, aes(x = visit, y = outcome, 
                             group = subject_id, color = treatment_group)) +
  geom_line(alpha = 0.3) +
  geom_smooth(aes(y = predicted, group = treatment_group), 
              size = 1.5, se = FALSE) +
  facet_wrap(~ treatment_group) +
  labs(title = "Individual Trajectories with Model Predictions",
       x = "Visit", 
       y = "Outcome Measure") +
  theme_minimal()

4.5.2 Generalized Linear Mixed Models

For non-continuous longitudinal outcomes:

Code

# Generalized linear mixed model for binary repeated outcome
library(lme4)
binary_mixed_model <- glmer(
  binary_outcome ~ treatment_group * visit + baseline_value + age + sex +
  (1 | subject_id),
  family = binomial,
  data = longitudinal_data
)

# Model summary
summary(binary_mixed_model)

# Odds ratios with confidence intervals
fixef_results <- data.frame(
  estimate = fixef(binary_mixed_model),
  se = sqrt(diag(vcov(binary_mixed_model)))
)

fixef_results <- fixef_results %>%
  mutate(
    odds_ratio = exp(estimate),
    lower_ci = exp(estimate - 1.96 * se),
    upper_ci = exp(estimate + 1.96 * se)
  )

# Plot predicted probabilities over time
newdata <- expand.grid(
  visit = unique(longitudinal_data$visit),
  treatment_group = unique(longitudinal_data$treatment_group),
  baseline_value = mean(longitudinal_data$baseline_value),
  age = mean(longitudinal_data$age),
  sex = "Female"  # or whatever reference category you want
)

newdata$predicted_prob <- predict(binary_mixed_model, 
                                 newdata = newdata, 
                                 type = "response",
                                 re.form = NA)  # Population-average predictions

ggplot(newdata, aes(x = visit, y = predicted_prob, 
                    color = treatment_group, group = treatment_group)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "Predicted Probability of Outcome Over Time",
       subtitle = "By Treatment Group",
       x = "Visit", 
       y = "Predicted Probability") +
  theme_minimal()

4.6 Advanced Modeling Techniques

4.6.1 Multiple Imputation for Missing Data

Proper handling of missing data is crucial in clinical research:

Code

library(mice)

# Examine missing data patterns
md.pattern(clinical_data)

# Create multiple imputations
imp <- mice(clinical_data, m = 10, printFlag = FALSE)

# Fit model on each imputed dataset
models <- with(imp, lm(endpoint_value ~ baseline_value + treatment_group + age + sex))

# Pool results across imputations
pooled_results <- pool(models)
summary(pooled_results)

# Compare with complete case analysis
complete_case_model <- lm(endpoint_value ~ baseline_value + treatment_group + age + sex,
                         data = clinical_data)
summary(complete_case_model)

# Visualize differences
pooled_coefs <- summary(pooled_results)
complete_coefs <- tidy(complete_case_model)

# Combine results for plotting
comparison <- bind_rows(
  pooled_coefs %>% 
    select(term = term, estimate = estimate, std.error = std.error) %>%
    mutate(method = "Multiple Imputation"),
  complete_coefs %>% 
    select(term = term, estimate = estimate, std.error = std.error) %>%
    mutate(method = "Complete Case")
) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    lower_ci = estimate - 1.96 * std.error,
    upper_ci = estimate + 1.96 * std.error
  )

# Plot comparison
ggplot(comparison, aes(x = term, y = estimate, color = method)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),
                position = position_dodge(width = 0.5),
                width = 0.25) +
  coord_flip() +
  labs(title = "Comparison of Multiple Imputation vs Complete Case Analysis",
       x = "", 
       y = "Coefficient Estimate (95% CI)") +
  theme_minimal()

4.6.2 Bayesian Approaches

Bayesian methods are gaining popularity in clinical research:

Code

library(rstanarm)
library(bayesplot)

# Bayesian linear model
bayes_model <- stan_glm(
  endpoint_value ~ baseline_value + treatment_group + age + sex,
  data = clinical_data,
  prior = normal(0, 5),
  prior_intercept = normal(0, 10),
  chains = 4,
  iter = 2000
)

# Model summary
summary(bayes_model)
prior_summary(bayes_model)

# Check convergence
mcmc_trace(bayes_model)

# Posterior distributions
mcmc_intervals(bayes_model)

# Posterior predictive checks
pp_check(bayes_model)

# Treatment effect with 95% credible interval
treatment_effect <- as.matrix(bayes_model)[, "treatment_groupTreatment"]
mean(treatment_effect)
quantile(treatment_effect, c(0.025, 0.975))

# Probability that treatment effect is greater than 0
mean(treatment_effect > 0)

# Visualize posterior distribution of treatment effect
mcmc_areas(as.matrix(bayes_model), pars = "treatment_groupTreatment") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(title = "Posterior Distribution of Treatment Effect",
       subtitle = "With 95% Credible Interval",
       x = "Treatment Effect", 
       y = "Density") +
  theme_minimal()

4.6.3 Propensity Score Methods

For observational studies and causal inference:

Code

library(MatchIt)
library(cobalt)

# Fit propensity score model
ps_model <- glm(treatment_group ~ age + sex + baseline_value + 
               comorbidity_score + center,
              family = binomial,
              data = observational_data)

# Add propensity scores to data
observational_data$propensity_score <- predict(ps_model, type = "response")

# Check propensity score distribution
ggplot(observational_data, aes(x = propensity_score, fill = treatment_group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Propensity Score Distributions",
       x = "Propensity Score", 
       y = "Density") +
  theme_minimal()

# Propensity score matching
match_obj <- matchit(
  treatment_group ~ age + sex + baseline_value + comorbidity_score + center,
  data = observational_data,
  method = "nearest",
  ratio = 1,
  caliper = 0.2
)

# Examine balance before and after matching
summary(match_obj)

# Get matched data
matched_data <- match.data(match_obj)

# Visualize covariate balance
love.plot(match_obj, binary = "std", 
          thresholds = c(m = 0.1),
          var.order = "unadjusted",
          abs = TRUE) +
  labs(title = "Standardized Mean Differences Before and After Matching")

# Treatment effect analysis on matched data
matched_model <- lm(endpoint_value ~ treatment_group, data = matched_data)
summary(matched_model)

# Alternative: Inverse probability weighting
observational_data$ipw <- ifelse(
  observational_data$treatment_group == "Treatment",
  1 / observational_data$propensity_score,
  1 / (1 - observational_data$propensity_score)
)

# Stabilize weights
observational_data <- observational_data %>%
  mutate(
    ipw_stabilized = ipw * ifelse(
      treatment_group == "Treatment",
      mean(treatment_group == "Treatment"),
      mean(treatment_group == "Control")
    )
  )

# IPW regression
ipw_model <- lm(
  endpoint_value ~ treatment_group,
  data = observational_data,
  weights = ipw_stabilized
)

summary(ipw_model)

4.7 Exercises

Using a clinical trial dataset, conduct a complete ANCOVA analysis for a continuous outcome, including thorough diagnostics and visualization of results.
Analyze a binary outcome from a clinical study using logistic regression, and create plots that illustrate how the probability of response varies with key covariates.
Perform a survival analysis for time-to-event data, including both Kaplan-Meier estimation and Cox proportional hazards modeling. Check and address any violations of the proportional hazards assumption.
Analyze longitudinal data from a clinical trial using a linear mixed-effects model. Explore different correlation structures and select the best-fitting model based on information criteria.
Compare multiple imputation, complete case analysis, and LOCF (Last Observation Carried Forward) approaches on a dataset with missing values, and discuss the impact on parameter estimates and conclusions.

4.8 Summary

Statistical modeling in clinical research requires careful consideration of study design, outcome types, and data characteristics. This chapter has covered a range of modeling approaches commonly used in clinical settings, from traditional methods like linear and logistic regression to more specialized techniques for longitudinal and survival data. We’ve emphasized not just model fitting, but also thorough diagnostics, visualization, and interpretation—all essential components of reproducible and reliable clinical research.

When applying these methods, it’s critical to align the modeling approach with the research questions and to be transparent about assumptions and limitations. Modern R packages like tidymodels, lme4, survival, and rstanarm provide powerful tools for implementing these methods within a reproducible framework, allowing clinical researchers to generate robust evidence while maintaining regulatory compliance.

4.9 References

# Statistical Modeling for Clinical Research ## Introduction to Statistical Modeling in Clinical Settings Statistical modeling is at the core of evidence generation in clinical research. Models help us test hypotheses, quantify treatment effects, adjust for confounding factors, and make predictions about patient outcomes. ```{r} #| echo: false #| fig-cap: "Statistical Modeling Workflow in Clinical Research" library(DiagrammeR) # This would render a workflow diagram in the actual document # Placeholder comment for the diagram code ``` ### Statistical Modeling Objectives in Clinical Research The key objectives of statistical modeling in clinical research include: 1. **Estimating treatment effects**: Quantifying the impact of interventions on clinical outcomes 2. **Controlling for confounding**: Adjusting for factors that may bias treatment effect estimates 3. **Predicting outcomes**: Developing tools to forecast patient responses or disease progression 4. **Identifying risk factors**: Understanding which variables are associated with outcomes 5. **Subgroup analysis**: Exploring heterogeneity of treatment effects across patient populations 6. **Handling missing data**: Employing methods that properly account for incomplete observations ## Linear Models for Continuous Outcomes ### Simple and Multiple Linear Regression Linear regression remains a workhorse for analyzing continuous outcomes: ```{r} #| echo: true #| eval: false library(tidyverse) library(broom) library(car) library(ggplot2) # Simple linear regression simple_model <- lm(endpoint_value ~ baseline_value, data = clinical_data) # Summarize the model summary(simple_model) # Tidy model output tidy(simple_model, conf.int = TRUE) # Multiple linear regression with treatment and covariates multiple_model <- lm(endpoint_value ~ baseline_value + treatment_group + age + sex + comorbidity_score, data = clinical_data) # Summarize the multiple regression model summary(multiple_model) # Check assumptions par(mfrow = c(2, 2)) plot(multiple_model) # Check for multicollinearity vif(multiple_model) # Visualize the relationship with confidence intervals ggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, color = treatment_group)) + geom_point(alpha = 0.6) + geom_smooth(method = "lm", se = TRUE) + labs(title = "Relationship Between Baseline and Endpoint Values", subtitle = "By Treatment Group with 95% Confidence Bands", x = "Baseline Value", y = "Endpoint Value") + theme_minimal() ``` ### Analysis of Covariance (ANCOVA) ANCOVA is particularly important for analyzing clinical trial data: ```{r} #| echo: true #| eval: false # ANCOVA model for treatment comparison adjusting for baseline ancova_model <- lm(endpoint_value ~ baseline_value + treatment_group, data = clinical_data) # Model summary summary(ancova_model) # Extract adjusted means library(emmeans) adjusted_means <- emmeans(ancova_model, ~ treatment_group) adjusted_means # Pairwise comparisons with adjustment for multiple testing pairs(adjusted_means, adjust = "tukey") # Visualize adjusted means with confidence intervals plot(adjusted_means) + labs(title = "Adjusted Means by Treatment Group", subtitle = "Controlling for Baseline Value", x = "Treatment Group", y = "Adjusted Endpoint Value") + theme_minimal() ``` ### Diagnostics and Assumption Checking Thorough model checking is essential in clinical research: ```{r} #| echo: true #| eval: false # Comprehensive diagnostics for the ANCOVA model # Check normality of residuals shapiro.test(residuals(ancova_model)) # Check homogeneity of variance leveneTest(endpoint_value ~ treatment_group, data = clinical_data) # Create augmented data for diagnostics model_diagnostics <- augment(ancova_model) # Residual plots ggplot(model_diagnostics, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + geom_smooth(method = "loess", se = FALSE) + labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") + theme_minimal() # QQ plot ggplot(model_diagnostics, aes(sample = .resid)) + stat_qq() + stat_qq_line() + labs(title = "Normal Q-Q Plot of Residuals", x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_minimal() # Check for influential observations influence_measures <- influence.measures(ancova_model) summary(influence_measures) # Visual check for influential points ggplot(model_diagnostics, aes(x = .hat, y = .cooksd)) + geom_point() + geom_text(aes(label = ifelse(.cooksd > 0.1, rownames(model_diagnostics), "")), vjust = -1) + labs(title = "Cook's Distance vs Leverage", x = "Leverage", y = "Cook's Distance") + theme_minimal() ``` ## Generalized Linear Models for Non-Continuous Outcomes ### Logistic Regression for Binary Outcomes Logistic regression is essential for analyzing binary outcomes like treatment response: ```{r} #| echo: true #| eval: false # Logistic regression for binary outcome (e.g., treatment response) logistic_model <- glm(responder ~ treatment_group + age + sex + baseline_severity + comorbidity_score, family = binomial(link = "logit"), data = clinical_data) # Model summary summary(logistic_model) # Tidy output with odds ratios tidy(logistic_model, exponentiate = TRUE, conf.int = TRUE) %>% knitr::kable(digits = 3) # Calculate predicted probabilities clinical_data <- clinical_data %>% mutate(pred_prob = predict(logistic_model, type = "response")) # Plot predicted probabilities by treatment group ggplot(clinical_data, aes(x = baseline_severity, y = pred_prob, color = treatment_group)) + geom_line() + facet_wrap(~ sex) + labs(title = "Predicted Probability of Response", subtitle = "By Treatment Group, Baseline Severity, and Sex", x = "Baseline Severity", y = "Probability of Response") + theme_minimal() # Assess model performance library(pROC) roc_curve <- roc(clinical_data$responder, clinical_data$pred_prob) auc(roc_curve) # Plot ROC curve ggroc(roc_curve) + geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "gray") + labs(title = "ROC Curve for Logistic Regression Model", subtitle = paste("AUC =", round(auc(roc_curve), 3)), x = "False Positive Rate", y = "True Positive Rate") + theme_minimal() ``` ### Poisson and Negative Binomial Regression for Count Data For analyzing count outcomes such as adverse events: ```{r} #| echo: true #| eval: false # Poisson regression for count outcome (e.g., number of adverse events) poisson_model <- glm(ae_count ~ treatment_group + age + exposure_time, family = poisson(link = "log"), data = clinical_data) # Check for overdispersion dispersion_test <- sum(residuals(poisson_model, type = "pearson")^2) / poisson_model$df.residual dispersion_test # If overdispersion is present, use negative binomial if(dispersion_test > 1.5) { library(MASS) nb_model <- glm.nb(ae_count ~ treatment_group + age + exposure_time, data = clinical_data) summary(nb_model) final_count_model <- nb_model } else { summary(poisson_model) final_count_model <- poisson_model } # Calculate incidence rate ratios tidy(final_count_model, exponentiate = TRUE, conf.int = TRUE) %>% knitr::kable(digits = 3) # Offset for exposure time offset_model <- glm(ae_count ~ treatment_group + age + offset(log(exposure_time)), family = poisson(link = "log"), data = clinical_data) # Compare predictions clinical_data <- clinical_data %>% mutate( pred_count = predict(final_count_model, type = "response"), pred_rate = predict(offset_model, type = "response") ) # Visualize ggplot(clinical_data, aes(x = age, y = pred_count, color = treatment_group)) + geom_line() + labs(title = "Predicted Adverse Event Counts", subtitle = "By Treatment Group and Age", x = "Age", y = "Predicted Number of Adverse Events") + theme_minimal() ``` ### Ordinal Regression for Ordered Categorical Data For analyzing ordered outcomes like disease severity stages: ```{r} #| echo: true #| eval: false # Ordinal logistic regression for ordered categorical outcome library(MASS) ordinal_model <- polr(ordered_outcome ~ treatment_group + age + baseline_category + comorbidity_score, data = clinical_data, Hess = TRUE) # Summary of model summary(ordinal_model) # Calculate confidence intervals and exponentiated coefficients ci <- confint(ordinal_model) exp_coef <- exp(coef(ordinal_model)) exp_ci <- exp(ci) # Create tidy output with odds ratios ord_results <- data.frame( term = names(coef(ordinal_model)), odds_ratio = exp_coef, lower_ci = exp_ci[,1], upper_ci = exp_ci[,2] ) # Test proportional odds assumption library(ordinal) prop_odds_test <- nominal_test(ordinal_model) prop_odds_test ``` ## Survival Analysis for Time-to-Event Data ### Kaplan-Meier Estimation Non-parametric survival analysis is fundamental in clinical research: ```{r} #| echo: true #| eval: false library(survival) library(survminer) # Create a survival object surv_obj <- Surv(time = clinical_data$time_to_event, event = clinical_data$event_indicator) # Fit Kaplan-Meier curves by treatment group km_fit <- survfit(surv_obj ~ treatment_group, data = clinical_data) # Summary of the Kaplan-Meier fit summary(km_fit) # Extract median survival times km_fit # Log-rank test for group differences log_rank <- survdiff(surv_obj ~ treatment_group, data = clinical_data) log_rank # Kaplan-Meier plot with p-value ggsurvplot( km_fit, data = clinical_data, pval = TRUE, conf.int = TRUE, risk.table = TRUE, risk.table.height = 0.25, ggtheme = theme_minimal(), xlab = "Time (months)", ylab = "Survival Probability", legend.title = "Treatment Group", legend.labs = c("Control", "Treatment"), palette = c("#0072B2", "#D55E00") ) ``` ### Cox Proportional Hazards Models Cox regression is the standard approach for multivariable survival analysis: ```{r} #| echo: true #| eval: false # Cox proportional hazards model cox_model <- coxph(surv_obj ~ treatment_group + age + sex + baseline_risk_score + strata(study_center), data = clinical_data) # Model summary summary(cox_model) # Hazard ratios with confidence intervals tidy(cox_model, exponentiate = TRUE, conf.int = TRUE) %>% knitr::kable(digits = 3) # Check proportional hazards assumption ph_test <- cox.zph(cox_model) ph_test # Plot Schoenfeld residuals to visualize PH assumption ggcoxzph(ph_test) # Forest plot of hazard ratios ggforest(cox_model, data = clinical_data) # Adjusted survival curves adjusted_curves <- survminer::surv_adjustedcurves( fit = cox_model, data = clinical_data, variable = "treatment_group" ) # Plot adjusted curves plot(adjusted_curves) ``` ### Competing Risks Analysis When multiple competing events are of interest: ```{r} #| echo: true #| eval: false library(cmprsk) # Create data for competing risks analysis # Assume event_type: 0=censored, 1=event of interest, 2=competing event cr_data <- clinical_data %>% mutate( status = case_when( event_indicator == 0 ~ 0, # censored event_type == "primary" ~ 1, # primary outcome event_type == "competing" ~ 2 # competing event ) ) # Cumulative incidence function cif <- cuminc( ftime = cr_data$time_to_event, fstatus = cr_data$status, group = cr_data$treatment_group ) # Plot cumulative incidence plot(cif, col = c("blue", "red", "green", "orange")) # Fine-Gray competing risks regression fg_model <- crr( ftime = cr_data$time_to_event, fstatus = cr_data$status, cov = cr_data %>% select(age, sex, baseline_risk_score) %>% as.matrix(), failcode = 1, cencode = 0 ) # Summarize model summary(fg_model) # Plot competing risks cumulative incidence with survminer library(survminer) ggcompetingrisks( fit = cif, palette = "jco", ggtheme = theme_minimal() ) ``` ## Mixed-Effects Models for Longitudinal Data ### Linear Mixed Models For analyzing repeated measures with continuous outcomes: ```{r} #| echo: true #| eval: false library(lme4) library(lmerTest) # Linear mixed model for repeated measures mixed_model <- lmer( outcome ~ treatment_group * visit + baseline_value + age + sex + (1 | subject_id), data = longitudinal_data ) # Model summary summary(mixed_model) # Type III ANOVA table anova(mixed_model, type = 3) # Extract fixed effects fixef(mixed_model) # Extract random effects ranef(mixed_model) # Check residuals plot(mixed_model) qqnorm(resid(mixed_model)) qqline(resid(mixed_model)) # Plot interaction between treatment and visit library(emmeans) emm <- emmeans(mixed_model, ~ treatment_group | visit) plot(emm) # Extract and plot predicted values longitudinal_data$predicted <- predict(mixed_model) ggplot(longitudinal_data, aes(x = visit, y = outcome, group = subject_id, color = treatment_group)) + geom_line(alpha = 0.3) + geom_smooth(aes(y = predicted, group = treatment_group), size = 1.5, se = FALSE) + facet_wrap(~ treatment_group) + labs(title = "Individual Trajectories with Model Predictions", x = "Visit", y = "Outcome Measure") + theme_minimal() ``` ### Generalized Linear Mixed Models For non-continuous longitudinal outcomes: ```{r} #| echo: true #| eval: false # Generalized linear mixed model for binary repeated outcome library(lme4) binary_mixed_model <- glmer( binary_outcome ~ treatment_group * visit + baseline_value + age + sex + (1 | subject_id), family = binomial, data = longitudinal_data ) # Model summary summary(binary_mixed_model) # Odds ratios with confidence intervals fixef_results <- data.frame( estimate = fixef(binary_mixed_model), se = sqrt(diag(vcov(binary_mixed_model))) ) fixef_results <- fixef_results %>% mutate( odds_ratio = exp(estimate), lower_ci = exp(estimate - 1.96 * se), upper_ci = exp(estimate + 1.96 * se) ) # Plot predicted probabilities over time newdata <- expand.grid( visit = unique(longitudinal_data$visit), treatment_group = unique(longitudinal_data$treatment_group), baseline_value = mean(longitudinal_data$baseline_value), age = mean(longitudinal_data$age), sex = "Female" # or whatever reference category you want ) newdata$predicted_prob <- predict(binary_mixed_model, newdata = newdata, type = "response", re.form = NA) # Population-average predictions ggplot(newdata, aes(x = visit, y = predicted_prob, color = treatment_group, group = treatment_group)) + geom_line(size = 1.2) + geom_point(size = 3) + scale_y_continuous(limits = c(0, 1)) + labs(title = "Predicted Probability of Outcome Over Time", subtitle = "By Treatment Group", x = "Visit", y = "Predicted Probability") + theme_minimal() ``` ## Advanced Modeling Techniques ### Multiple Imputation for Missing Data Proper handling of missing data is crucial in clinical research: ```{r} #| echo: true #| eval: false library(mice) # Examine missing data patterns md.pattern(clinical_data) # Create multiple imputations imp <- mice(clinical_data, m = 10, printFlag = FALSE) # Fit model on each imputed dataset models <- with(imp, lm(endpoint_value ~ baseline_value + treatment_group + age + sex)) # Pool results across imputations pooled_results <- pool(models) summary(pooled_results) # Compare with complete case analysis complete_case_model <- lm(endpoint_value ~ baseline_value + treatment_group + age + sex, data = clinical_data) summary(complete_case_model) # Visualize differences pooled_coefs <- summary(pooled_results) complete_coefs <- tidy(complete_case_model) # Combine results for plotting comparison <- bind_rows( pooled_coefs %>% select(term = term, estimate = estimate, std.error = std.error) %>% mutate(method = "Multiple Imputation"), complete_coefs %>% select(term = term, estimate = estimate, std.error = std.error) %>% mutate(method = "Complete Case") ) %>% filter(term != "(Intercept)") %>% mutate( lower_ci = estimate - 1.96 * std.error, upper_ci = estimate + 1.96 * std.error ) # Plot comparison ggplot(comparison, aes(x = term, y = estimate, color = method)) + geom_point(position = position_dodge(width = 0.5), size = 3) + geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), position = position_dodge(width = 0.5), width = 0.25) + coord_flip() + labs(title = "Comparison of Multiple Imputation vs Complete Case Analysis", x = "", y = "Coefficient Estimate (95% CI)") + theme_minimal() ``` ### Bayesian Approaches Bayesian methods are gaining popularity in clinical research: ```{r} #| echo: true #| eval: false library(rstanarm) library(bayesplot) # Bayesian linear model bayes_model <- stan_glm( endpoint_value ~ baseline_value + treatment_group + age + sex, data = clinical_data, prior = normal(0, 5), prior_intercept = normal(0, 10), chains = 4, iter = 2000 ) # Model summary summary(bayes_model) prior_summary(bayes_model) # Check convergence mcmc_trace(bayes_model) # Posterior distributions mcmc_intervals(bayes_model) # Posterior predictive checks pp_check(bayes_model) # Treatment effect with 95% credible interval treatment_effect <- as.matrix(bayes_model)[, "treatment_groupTreatment"] mean(treatment_effect) quantile(treatment_effect, c(0.025, 0.975)) # Probability that treatment effect is greater than 0 mean(treatment_effect > 0) # Visualize posterior distribution of treatment effect mcmc_areas(as.matrix(bayes_model), pars = "treatment_groupTreatment") + geom_vline(xintercept = 0, linetype = "dashed") + labs(title = "Posterior Distribution of Treatment Effect", subtitle = "With 95% Credible Interval", x = "Treatment Effect", y = "Density") + theme_minimal() ``` ### Propensity Score Methods For observational studies and causal inference: ```{r} #| echo: true #| eval: false library(MatchIt) library(cobalt) # Fit propensity score model ps_model <- glm(treatment_group ~ age + sex + baseline_value + comorbidity_score + center, family = binomial, data = observational_data) # Add propensity scores to data observational_data$propensity_score <- predict(ps_model, type = "response") # Check propensity score distribution ggplot(observational_data, aes(x = propensity_score, fill = treatment_group)) + geom_density(alpha = 0.5) + labs(title = "Propensity Score Distributions", x = "Propensity Score", y = "Density") + theme_minimal() # Propensity score matching match_obj <- matchit( treatment_group ~ age + sex + baseline_value + comorbidity_score + center, data = observational_data, method = "nearest", ratio = 1, caliper = 0.2 ) # Examine balance before and after matching summary(match_obj) # Get matched data matched_data <- match.data(match_obj) # Visualize covariate balance love.plot(match_obj, binary = "std", thresholds = c(m = 0.1), var.order = "unadjusted", abs = TRUE) + labs(title = "Standardized Mean Differences Before and After Matching") # Treatment effect analysis on matched data matched_model <- lm(endpoint_value ~ treatment_group, data = matched_data) summary(matched_model) # Alternative: Inverse probability weighting observational_data$ipw <- ifelse( observational_data$treatment_group == "Treatment", 1 / observational_data$propensity_score, 1 / (1 - observational_data$propensity_score) ) # Stabilize weights observational_data <- observational_data %>% mutate( ipw_stabilized = ipw * ifelse( treatment_group == "Treatment", mean(treatment_group == "Treatment"), mean(treatment_group == "Control") ) ) # IPW regression ipw_model <- lm( endpoint_value ~ treatment_group, data = observational_data, weights = ipw_stabilized ) summary(ipw_model) ``` ## Exercises 1. Using a clinical trial dataset, conduct a complete ANCOVA analysis for a continuous outcome, including thorough diagnostics and visualization of results. 2. Analyze a binary outcome from a clinical study using logistic regression, and create plots that illustrate how the probability of response varies with key covariates. 3. Perform a survival analysis for time-to-event data, including both Kaplan-Meier estimation and Cox proportional hazards modeling. Check and address any violations of the proportional hazards assumption. 4. Analyze longitudinal data from a clinical trial using a linear mixed-effects model. Explore different correlation structures and select the best-fitting model based on information criteria. 5. Compare multiple imputation, complete case analysis, and LOCF (Last Observation Carried Forward) approaches on a dataset with missing values, and discuss the impact on parameter estimates and conclusions. ## Summary Statistical modeling in clinical research requires careful consideration of study design, outcome types, and data characteristics. This chapter has covered a range of modeling approaches commonly used in clinical settings, from traditional methods like linear and logistic regression to more specialized techniques for longitudinal and survival data. We've emphasized not just model fitting, but also thorough diagnostics, visualization, and interpretation—all essential components of reproducible and reliable clinical research. When applying these methods, it's critical to align the modeling approach with the research questions and to be transparent about assumptions and limitations. Modern R packages like tidymodels, lme4, survival, and rstanarm provide powerful tools for implementing these methods within a reproducible framework, allowing clinical researchers to generate robust evidence while maintaining regulatory compliance. ## References