4.1 Introduction to Statistical Modeling in Clinical Settings
Statistical modeling is at the core of evidence generation in clinical research. Models help us test hypotheses, quantify treatment effects, adjust for confounding factors, and make predictions about patient outcomes.
4.1.1 Statistical Modeling Objectives in Clinical Research
The key objectives of statistical modeling in clinical research include:
Estimating treatment effects: Quantifying the impact of interventions on clinical outcomes
Controlling for confounding: Adjusting for factors that may bias treatment effect estimates
Predicting outcomes: Developing tools to forecast patient responses or disease progression
Identifying risk factors: Understanding which variables are associated with outcomes
Subgroup analysis: Exploring heterogeneity of treatment effects across patient populations
Handling missing data: Employing methods that properly account for incomplete observations
4.2 Linear Models for Continuous Outcomes
4.2.1 Simple and Multiple Linear Regression
Linear regression remains a workhorse for analyzing continuous outcomes:
Code
library(tidyverse)library(broom)library(car)library(ggplot2)# Simple linear regressionsimple_model <-lm(endpoint_value ~ baseline_value, data = clinical_data)# Summarize the modelsummary(simple_model)# Tidy model outputtidy(simple_model, conf.int =TRUE)# Multiple linear regression with treatment and covariatesmultiple_model <-lm(endpoint_value ~ baseline_value + treatment_group + age + sex + comorbidity_score,data = clinical_data)# Summarize the multiple regression modelsummary(multiple_model)# Check assumptionspar(mfrow =c(2, 2))plot(multiple_model)# Check for multicollinearityvif(multiple_model)# Visualize the relationship with confidence intervalsggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, color = treatment_group)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =TRUE) +labs(title ="Relationship Between Baseline and Endpoint Values",subtitle ="By Treatment Group with 95% Confidence Bands",x ="Baseline Value", y ="Endpoint Value") +theme_minimal()
4.2.2 Analysis of Covariance (ANCOVA)
ANCOVA is particularly important for analyzing clinical trial data:
Code
# ANCOVA model for treatment comparison adjusting for baselineancova_model <-lm(endpoint_value ~ baseline_value + treatment_group, data = clinical_data)# Model summarysummary(ancova_model)# Extract adjusted meanslibrary(emmeans)adjusted_means <-emmeans(ancova_model, ~ treatment_group)adjusted_means# Pairwise comparisons with adjustment for multiple testingpairs(adjusted_means, adjust ="tukey")# Visualize adjusted means with confidence intervalsplot(adjusted_means) +labs(title ="Adjusted Means by Treatment Group",subtitle ="Controlling for Baseline Value",x ="Treatment Group",y ="Adjusted Endpoint Value") +theme_minimal()
4.2.3 Diagnostics and Assumption Checking
Thorough model checking is essential in clinical research:
Code
# Comprehensive diagnostics for the ANCOVA model# Check normality of residualsshapiro.test(residuals(ancova_model))# Check homogeneity of varianceleveneTest(endpoint_value ~ treatment_group, data = clinical_data)# Create augmented data for diagnosticsmodel_diagnostics <-augment(ancova_model)# Residual plotsggplot(model_diagnostics, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0, linetype ="dashed", color ="red") +geom_smooth(method ="loess", se =FALSE) +labs(title ="Residuals vs Fitted Values",x ="Fitted Values", y ="Residuals") +theme_minimal()# QQ plotggplot(model_diagnostics, aes(sample = .resid)) +stat_qq() +stat_qq_line() +labs(title ="Normal Q-Q Plot of Residuals",x ="Theoretical Quantiles", y ="Sample Quantiles") +theme_minimal()# Check for influential observationsinfluence_measures <-influence.measures(ancova_model)summary(influence_measures)# Visual check for influential pointsggplot(model_diagnostics, aes(x = .hat, y = .cooksd)) +geom_point() +geom_text(aes(label =ifelse(.cooksd >0.1, rownames(model_diagnostics), "")), vjust =-1) +labs(title ="Cook's Distance vs Leverage",x ="Leverage", y ="Cook's Distance") +theme_minimal()
4.3 Generalized Linear Models for Non-Continuous Outcomes
4.3.1 Logistic Regression for Binary Outcomes
Logistic regression is essential for analyzing binary outcomes like treatment response:
Code
# Logistic regression for binary outcome (e.g., treatment response)logistic_model <-glm(responder ~ treatment_group + age + sex + baseline_severity + comorbidity_score,family =binomial(link ="logit"),data = clinical_data)# Model summarysummary(logistic_model)# Tidy output with odds ratiostidy(logistic_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Calculate predicted probabilitiesclinical_data <- clinical_data %>%mutate(pred_prob =predict(logistic_model, type ="response"))# Plot predicted probabilities by treatment groupggplot(clinical_data, aes(x = baseline_severity, y = pred_prob, color = treatment_group)) +geom_line() +facet_wrap(~ sex) +labs(title ="Predicted Probability of Response",subtitle ="By Treatment Group, Baseline Severity, and Sex",x ="Baseline Severity", y ="Probability of Response") +theme_minimal()# Assess model performancelibrary(pROC)roc_curve <-roc(clinical_data$responder, clinical_data$pred_prob)auc(roc_curve)# Plot ROC curveggroc(roc_curve) +geom_abline(intercept =1, slope =1, linetype ="dashed", color ="gray") +labs(title ="ROC Curve for Logistic Regression Model",subtitle =paste("AUC =", round(auc(roc_curve), 3)),x ="False Positive Rate", y ="True Positive Rate") +theme_minimal()
4.3.2 Poisson and Negative Binomial Regression for Count Data
For analyzing count outcomes such as adverse events:
Code
# Poisson regression for count outcome (e.g., number of adverse events)poisson_model <-glm(ae_count ~ treatment_group + age + exposure_time,family =poisson(link ="log"),data = clinical_data)# Check for overdispersiondispersion_test <-sum(residuals(poisson_model, type ="pearson")^2) / poisson_model$df.residualdispersion_test# If overdispersion is present, use negative binomialif(dispersion_test >1.5) {library(MASS) nb_model <-glm.nb(ae_count ~ treatment_group + age + exposure_time,data = clinical_data)summary(nb_model) final_count_model <- nb_model} else {summary(poisson_model) final_count_model <- poisson_model}# Calculate incidence rate ratiostidy(final_count_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Offset for exposure timeoffset_model <-glm(ae_count ~ treatment_group + age +offset(log(exposure_time)),family =poisson(link ="log"),data = clinical_data)# Compare predictionsclinical_data <- clinical_data %>%mutate(pred_count =predict(final_count_model, type ="response"),pred_rate =predict(offset_model, type ="response") )# Visualizeggplot(clinical_data, aes(x = age, y = pred_count, color = treatment_group)) +geom_line() +labs(title ="Predicted Adverse Event Counts",subtitle ="By Treatment Group and Age",x ="Age", y ="Predicted Number of Adverse Events") +theme_minimal()
4.3.3 Ordinal Regression for Ordered Categorical Data
For analyzing ordered outcomes like disease severity stages:
Code
# Ordinal logistic regression for ordered categorical outcomelibrary(MASS)ordinal_model <-polr(ordered_outcome ~ treatment_group + age + baseline_category + comorbidity_score,data = clinical_data,Hess =TRUE)# Summary of modelsummary(ordinal_model)# Calculate confidence intervals and exponentiated coefficientsci <-confint(ordinal_model)exp_coef <-exp(coef(ordinal_model))exp_ci <-exp(ci)# Create tidy output with odds ratiosord_results <-data.frame(term =names(coef(ordinal_model)),odds_ratio = exp_coef,lower_ci = exp_ci[,1],upper_ci = exp_ci[,2])# Test proportional odds assumptionlibrary(ordinal)prop_odds_test <-nominal_test(ordinal_model)prop_odds_test
4.4 Survival Analysis for Time-to-Event Data
4.4.1 Kaplan-Meier Estimation
Non-parametric survival analysis is fundamental in clinical research:
Code
library(survival)library(survminer)# Create a survival objectsurv_obj <-Surv(time = clinical_data$time_to_event, event = clinical_data$event_indicator)# Fit Kaplan-Meier curves by treatment groupkm_fit <-survfit(surv_obj ~ treatment_group, data = clinical_data)# Summary of the Kaplan-Meier fitsummary(km_fit)# Extract median survival timeskm_fit# Log-rank test for group differenceslog_rank <-survdiff(surv_obj ~ treatment_group, data = clinical_data)log_rank# Kaplan-Meier plot with p-valueggsurvplot( km_fit,data = clinical_data,pval =TRUE,conf.int =TRUE,risk.table =TRUE,risk.table.height =0.25,ggtheme =theme_minimal(),xlab ="Time (months)",ylab ="Survival Probability",legend.title ="Treatment Group",legend.labs =c("Control", "Treatment"),palette =c("#0072B2", "#D55E00"))
4.4.2 Cox Proportional Hazards Models
Cox regression is the standard approach for multivariable survival analysis:
Code
# Cox proportional hazards modelcox_model <-coxph(surv_obj ~ treatment_group + age + sex + baseline_risk_score +strata(study_center),data = clinical_data)# Model summarysummary(cox_model)# Hazard ratios with confidence intervalstidy(cox_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Check proportional hazards assumptionph_test <-cox.zph(cox_model)ph_test# Plot Schoenfeld residuals to visualize PH assumptionggcoxzph(ph_test)# Forest plot of hazard ratiosggforest(cox_model, data = clinical_data)# Adjusted survival curvesadjusted_curves <- survminer::surv_adjustedcurves(fit = cox_model,data = clinical_data,variable ="treatment_group")# Plot adjusted curvesplot(adjusted_curves)
For analyzing repeated measures with continuous outcomes:
Code
library(lme4)library(lmerTest)# Linear mixed model for repeated measuresmixed_model <-lmer( outcome ~ treatment_group * visit + baseline_value + age + sex + (1| subject_id),data = longitudinal_data)# Model summarysummary(mixed_model)# Type III ANOVA tableanova(mixed_model, type =3)# Extract fixed effectsfixef(mixed_model)# Extract random effectsranef(mixed_model)# Check residualsplot(mixed_model)qqnorm(resid(mixed_model))qqline(resid(mixed_model))# Plot interaction between treatment and visitlibrary(emmeans)emm <-emmeans(mixed_model, ~ treatment_group | visit)plot(emm)# Extract and plot predicted valueslongitudinal_data$predicted <-predict(mixed_model)ggplot(longitudinal_data, aes(x = visit, y = outcome, group = subject_id, color = treatment_group)) +geom_line(alpha =0.3) +geom_smooth(aes(y = predicted, group = treatment_group), size =1.5, se =FALSE) +facet_wrap(~ treatment_group) +labs(title ="Individual Trajectories with Model Predictions",x ="Visit", y ="Outcome Measure") +theme_minimal()
4.5.2 Generalized Linear Mixed Models
For non-continuous longitudinal outcomes:
Code
# Generalized linear mixed model for binary repeated outcomelibrary(lme4)binary_mixed_model <-glmer( binary_outcome ~ treatment_group * visit + baseline_value + age + sex + (1| subject_id),family = binomial,data = longitudinal_data)# Model summarysummary(binary_mixed_model)# Odds ratios with confidence intervalsfixef_results <-data.frame(estimate =fixef(binary_mixed_model),se =sqrt(diag(vcov(binary_mixed_model))))fixef_results <- fixef_results %>%mutate(odds_ratio =exp(estimate),lower_ci =exp(estimate -1.96* se),upper_ci =exp(estimate +1.96* se) )# Plot predicted probabilities over timenewdata <-expand.grid(visit =unique(longitudinal_data$visit),treatment_group =unique(longitudinal_data$treatment_group),baseline_value =mean(longitudinal_data$baseline_value),age =mean(longitudinal_data$age),sex ="Female"# or whatever reference category you want)newdata$predicted_prob <-predict(binary_mixed_model, newdata = newdata, type ="response",re.form =NA) # Population-average predictionsggplot(newdata, aes(x = visit, y = predicted_prob, color = treatment_group, group = treatment_group)) +geom_line(size =1.2) +geom_point(size =3) +scale_y_continuous(limits =c(0, 1)) +labs(title ="Predicted Probability of Outcome Over Time",subtitle ="By Treatment Group",x ="Visit", y ="Predicted Probability") +theme_minimal()
4.6 Advanced Modeling Techniques
4.6.1 Multiple Imputation for Missing Data
Proper handling of missing data is crucial in clinical research:
Code
library(mice)# Examine missing data patternsmd.pattern(clinical_data)# Create multiple imputationsimp <-mice(clinical_data, m =10, printFlag =FALSE)# Fit model on each imputed datasetmodels <-with(imp, lm(endpoint_value ~ baseline_value + treatment_group + age + sex))# Pool results across imputationspooled_results <-pool(models)summary(pooled_results)# Compare with complete case analysiscomplete_case_model <-lm(endpoint_value ~ baseline_value + treatment_group + age + sex,data = clinical_data)summary(complete_case_model)# Visualize differencespooled_coefs <-summary(pooled_results)complete_coefs <-tidy(complete_case_model)# Combine results for plottingcomparison <-bind_rows( pooled_coefs %>%select(term = term, estimate = estimate, std.error = std.error) %>%mutate(method ="Multiple Imputation"), complete_coefs %>%select(term = term, estimate = estimate, std.error = std.error) %>%mutate(method ="Complete Case")) %>%filter(term !="(Intercept)") %>%mutate(lower_ci = estimate -1.96* std.error,upper_ci = estimate +1.96* std.error )# Plot comparisonggplot(comparison, aes(x = term, y = estimate, color = method)) +geom_point(position =position_dodge(width =0.5), size =3) +geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),position =position_dodge(width =0.5),width =0.25) +coord_flip() +labs(title ="Comparison of Multiple Imputation vs Complete Case Analysis",x ="", y ="Coefficient Estimate (95% CI)") +theme_minimal()
4.6.2 Bayesian Approaches
Bayesian methods are gaining popularity in clinical research:
Code
library(rstanarm)library(bayesplot)# Bayesian linear modelbayes_model <-stan_glm( endpoint_value ~ baseline_value + treatment_group + age + sex,data = clinical_data,prior =normal(0, 5),prior_intercept =normal(0, 10),chains =4,iter =2000)# Model summarysummary(bayes_model)prior_summary(bayes_model)# Check convergencemcmc_trace(bayes_model)# Posterior distributionsmcmc_intervals(bayes_model)# Posterior predictive checkspp_check(bayes_model)# Treatment effect with 95% credible intervaltreatment_effect <-as.matrix(bayes_model)[, "treatment_groupTreatment"]mean(treatment_effect)quantile(treatment_effect, c(0.025, 0.975))# Probability that treatment effect is greater than 0mean(treatment_effect >0)# Visualize posterior distribution of treatment effectmcmc_areas(as.matrix(bayes_model), pars ="treatment_groupTreatment") +geom_vline(xintercept =0, linetype ="dashed") +labs(title ="Posterior Distribution of Treatment Effect",subtitle ="With 95% Credible Interval",x ="Treatment Effect", y ="Density") +theme_minimal()
4.6.3 Propensity Score Methods
For observational studies and causal inference:
Code
library(MatchIt)library(cobalt)# Fit propensity score modelps_model <-glm(treatment_group ~ age + sex + baseline_value + comorbidity_score + center,family = binomial,data = observational_data)# Add propensity scores to dataobservational_data$propensity_score <-predict(ps_model, type ="response")# Check propensity score distributionggplot(observational_data, aes(x = propensity_score, fill = treatment_group)) +geom_density(alpha =0.5) +labs(title ="Propensity Score Distributions",x ="Propensity Score", y ="Density") +theme_minimal()# Propensity score matchingmatch_obj <-matchit( treatment_group ~ age + sex + baseline_value + comorbidity_score + center,data = observational_data,method ="nearest",ratio =1,caliper =0.2)# Examine balance before and after matchingsummary(match_obj)# Get matched datamatched_data <-match.data(match_obj)# Visualize covariate balancelove.plot(match_obj, binary ="std", thresholds =c(m =0.1),var.order ="unadjusted",abs =TRUE) +labs(title ="Standardized Mean Differences Before and After Matching")# Treatment effect analysis on matched datamatched_model <-lm(endpoint_value ~ treatment_group, data = matched_data)summary(matched_model)# Alternative: Inverse probability weightingobservational_data$ipw <-ifelse( observational_data$treatment_group =="Treatment",1/ observational_data$propensity_score,1/ (1- observational_data$propensity_score))# Stabilize weightsobservational_data <- observational_data %>%mutate(ipw_stabilized = ipw *ifelse( treatment_group =="Treatment",mean(treatment_group =="Treatment"),mean(treatment_group =="Control") ) )# IPW regressionipw_model <-lm( endpoint_value ~ treatment_group,data = observational_data,weights = ipw_stabilized)summary(ipw_model)
4.7 Exercises
Using a clinical trial dataset, conduct a complete ANCOVA analysis for a continuous outcome, including thorough diagnostics and visualization of results.
Analyze a binary outcome from a clinical study using logistic regression, and create plots that illustrate how the probability of response varies with key covariates.
Perform a survival analysis for time-to-event data, including both Kaplan-Meier estimation and Cox proportional hazards modeling. Check and address any violations of the proportional hazards assumption.
Analyze longitudinal data from a clinical trial using a linear mixed-effects model. Explore different correlation structures and select the best-fitting model based on information criteria.
Compare multiple imputation, complete case analysis, and LOCF (Last Observation Carried Forward) approaches on a dataset with missing values, and discuss the impact on parameter estimates and conclusions.
4.8 Summary
Statistical modeling in clinical research requires careful consideration of study design, outcome types, and data characteristics. This chapter has covered a range of modeling approaches commonly used in clinical settings, from traditional methods like linear and logistic regression to more specialized techniques for longitudinal and survival data. We’ve emphasized not just model fitting, but also thorough diagnostics, visualization, and interpretation—all essential components of reproducible and reliable clinical research.
When applying these methods, it’s critical to align the modeling approach with the research questions and to be transparent about assumptions and limitations. Modern R packages like tidymodels, lme4, survival, and rstanarm provide powerful tools for implementing these methods within a reproducible framework, allowing clinical researchers to generate robust evidence while maintaining regulatory compliance.
# Statistical Modeling for Clinical Research## Introduction to Statistical Modeling in Clinical SettingsStatistical modeling is at the core of evidence generation in clinical research. Models help us test hypotheses, quantify treatment effects, adjust for confounding factors, and make predictions about patient outcomes.```{r}#| echo: false#| fig-cap: "Statistical Modeling Workflow in Clinical Research"library(DiagrammeR)# This would render a workflow diagram in the actual document# Placeholder comment for the diagram code```### Statistical Modeling Objectives in Clinical ResearchThe key objectives of statistical modeling in clinical research include:1. **Estimating treatment effects**: Quantifying the impact of interventions on clinical outcomes2. **Controlling for confounding**: Adjusting for factors that may bias treatment effect estimates3. **Predicting outcomes**: Developing tools to forecast patient responses or disease progression4. **Identifying risk factors**: Understanding which variables are associated with outcomes5. **Subgroup analysis**: Exploring heterogeneity of treatment effects across patient populations6. **Handling missing data**: Employing methods that properly account for incomplete observations## Linear Models for Continuous Outcomes### Simple and Multiple Linear RegressionLinear regression remains a workhorse for analyzing continuous outcomes:```{r}#| echo: true#| eval: falselibrary(tidyverse)library(broom)library(car)library(ggplot2)# Simple linear regressionsimple_model <-lm(endpoint_value ~ baseline_value, data = clinical_data)# Summarize the modelsummary(simple_model)# Tidy model outputtidy(simple_model, conf.int =TRUE)# Multiple linear regression with treatment and covariatesmultiple_model <-lm(endpoint_value ~ baseline_value + treatment_group + age + sex + comorbidity_score,data = clinical_data)# Summarize the multiple regression modelsummary(multiple_model)# Check assumptionspar(mfrow =c(2, 2))plot(multiple_model)# Check for multicollinearityvif(multiple_model)# Visualize the relationship with confidence intervalsggplot(clinical_data, aes(x = baseline_value, y = endpoint_value, color = treatment_group)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =TRUE) +labs(title ="Relationship Between Baseline and Endpoint Values",subtitle ="By Treatment Group with 95% Confidence Bands",x ="Baseline Value", y ="Endpoint Value") +theme_minimal()```### Analysis of Covariance (ANCOVA)ANCOVA is particularly important for analyzing clinical trial data:```{r}#| echo: true#| eval: false# ANCOVA model for treatment comparison adjusting for baselineancova_model <-lm(endpoint_value ~ baseline_value + treatment_group, data = clinical_data)# Model summarysummary(ancova_model)# Extract adjusted meanslibrary(emmeans)adjusted_means <-emmeans(ancova_model, ~ treatment_group)adjusted_means# Pairwise comparisons with adjustment for multiple testingpairs(adjusted_means, adjust ="tukey")# Visualize adjusted means with confidence intervalsplot(adjusted_means) +labs(title ="Adjusted Means by Treatment Group",subtitle ="Controlling for Baseline Value",x ="Treatment Group",y ="Adjusted Endpoint Value") +theme_minimal()```### Diagnostics and Assumption CheckingThorough model checking is essential in clinical research:```{r}#| echo: true#| eval: false# Comprehensive diagnostics for the ANCOVA model# Check normality of residualsshapiro.test(residuals(ancova_model))# Check homogeneity of varianceleveneTest(endpoint_value ~ treatment_group, data = clinical_data)# Create augmented data for diagnosticsmodel_diagnostics <-augment(ancova_model)# Residual plotsggplot(model_diagnostics, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0, linetype ="dashed", color ="red") +geom_smooth(method ="loess", se =FALSE) +labs(title ="Residuals vs Fitted Values",x ="Fitted Values", y ="Residuals") +theme_minimal()# QQ plotggplot(model_diagnostics, aes(sample = .resid)) +stat_qq() +stat_qq_line() +labs(title ="Normal Q-Q Plot of Residuals",x ="Theoretical Quantiles", y ="Sample Quantiles") +theme_minimal()# Check for influential observationsinfluence_measures <-influence.measures(ancova_model)summary(influence_measures)# Visual check for influential pointsggplot(model_diagnostics, aes(x = .hat, y = .cooksd)) +geom_point() +geom_text(aes(label =ifelse(.cooksd >0.1, rownames(model_diagnostics), "")), vjust =-1) +labs(title ="Cook's Distance vs Leverage",x ="Leverage", y ="Cook's Distance") +theme_minimal()```## Generalized Linear Models for Non-Continuous Outcomes### Logistic Regression for Binary OutcomesLogistic regression is essential for analyzing binary outcomes like treatment response:```{r}#| echo: true#| eval: false# Logistic regression for binary outcome (e.g., treatment response)logistic_model <-glm(responder ~ treatment_group + age + sex + baseline_severity + comorbidity_score,family =binomial(link ="logit"),data = clinical_data)# Model summarysummary(logistic_model)# Tidy output with odds ratiostidy(logistic_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Calculate predicted probabilitiesclinical_data <- clinical_data %>%mutate(pred_prob =predict(logistic_model, type ="response"))# Plot predicted probabilities by treatment groupggplot(clinical_data, aes(x = baseline_severity, y = pred_prob, color = treatment_group)) +geom_line() +facet_wrap(~ sex) +labs(title ="Predicted Probability of Response",subtitle ="By Treatment Group, Baseline Severity, and Sex",x ="Baseline Severity", y ="Probability of Response") +theme_minimal()# Assess model performancelibrary(pROC)roc_curve <-roc(clinical_data$responder, clinical_data$pred_prob)auc(roc_curve)# Plot ROC curveggroc(roc_curve) +geom_abline(intercept =1, slope =1, linetype ="dashed", color ="gray") +labs(title ="ROC Curve for Logistic Regression Model",subtitle =paste("AUC =", round(auc(roc_curve), 3)),x ="False Positive Rate", y ="True Positive Rate") +theme_minimal()```### Poisson and Negative Binomial Regression for Count DataFor analyzing count outcomes such as adverse events:```{r}#| echo: true#| eval: false# Poisson regression for count outcome (e.g., number of adverse events)poisson_model <-glm(ae_count ~ treatment_group + age + exposure_time,family =poisson(link ="log"),data = clinical_data)# Check for overdispersiondispersion_test <-sum(residuals(poisson_model, type ="pearson")^2) / poisson_model$df.residualdispersion_test# If overdispersion is present, use negative binomialif(dispersion_test >1.5) {library(MASS) nb_model <-glm.nb(ae_count ~ treatment_group + age + exposure_time,data = clinical_data)summary(nb_model) final_count_model <- nb_model} else {summary(poisson_model) final_count_model <- poisson_model}# Calculate incidence rate ratiostidy(final_count_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Offset for exposure timeoffset_model <-glm(ae_count ~ treatment_group + age +offset(log(exposure_time)),family =poisson(link ="log"),data = clinical_data)# Compare predictionsclinical_data <- clinical_data %>%mutate(pred_count =predict(final_count_model, type ="response"),pred_rate =predict(offset_model, type ="response") )# Visualizeggplot(clinical_data, aes(x = age, y = pred_count, color = treatment_group)) +geom_line() +labs(title ="Predicted Adverse Event Counts",subtitle ="By Treatment Group and Age",x ="Age", y ="Predicted Number of Adverse Events") +theme_minimal()```### Ordinal Regression for Ordered Categorical DataFor analyzing ordered outcomes like disease severity stages:```{r}#| echo: true#| eval: false# Ordinal logistic regression for ordered categorical outcomelibrary(MASS)ordinal_model <-polr(ordered_outcome ~ treatment_group + age + baseline_category + comorbidity_score,data = clinical_data,Hess =TRUE)# Summary of modelsummary(ordinal_model)# Calculate confidence intervals and exponentiated coefficientsci <-confint(ordinal_model)exp_coef <-exp(coef(ordinal_model))exp_ci <-exp(ci)# Create tidy output with odds ratiosord_results <-data.frame(term =names(coef(ordinal_model)),odds_ratio = exp_coef,lower_ci = exp_ci[,1],upper_ci = exp_ci[,2])# Test proportional odds assumptionlibrary(ordinal)prop_odds_test <-nominal_test(ordinal_model)prop_odds_test```## Survival Analysis for Time-to-Event Data### Kaplan-Meier EstimationNon-parametric survival analysis is fundamental in clinical research:```{r}#| echo: true#| eval: falselibrary(survival)library(survminer)# Create a survival objectsurv_obj <-Surv(time = clinical_data$time_to_event, event = clinical_data$event_indicator)# Fit Kaplan-Meier curves by treatment groupkm_fit <-survfit(surv_obj ~ treatment_group, data = clinical_data)# Summary of the Kaplan-Meier fitsummary(km_fit)# Extract median survival timeskm_fit# Log-rank test for group differenceslog_rank <-survdiff(surv_obj ~ treatment_group, data = clinical_data)log_rank# Kaplan-Meier plot with p-valueggsurvplot( km_fit,data = clinical_data,pval =TRUE,conf.int =TRUE,risk.table =TRUE,risk.table.height =0.25,ggtheme =theme_minimal(),xlab ="Time (months)",ylab ="Survival Probability",legend.title ="Treatment Group",legend.labs =c("Control", "Treatment"),palette =c("#0072B2", "#D55E00"))```### Cox Proportional Hazards ModelsCox regression is the standard approach for multivariable survival analysis:```{r}#| echo: true#| eval: false# Cox proportional hazards modelcox_model <-coxph(surv_obj ~ treatment_group + age + sex + baseline_risk_score +strata(study_center),data = clinical_data)# Model summarysummary(cox_model)# Hazard ratios with confidence intervalstidy(cox_model, exponentiate =TRUE, conf.int =TRUE) %>% knitr::kable(digits =3)# Check proportional hazards assumptionph_test <-cox.zph(cox_model)ph_test# Plot Schoenfeld residuals to visualize PH assumptionggcoxzph(ph_test)# Forest plot of hazard ratiosggforest(cox_model, data = clinical_data)# Adjusted survival curvesadjusted_curves <- survminer::surv_adjustedcurves(fit = cox_model,data = clinical_data,variable ="treatment_group")# Plot adjusted curvesplot(adjusted_curves)```### Competing Risks AnalysisWhen multiple competing events are of interest:```{r}#| echo: true#| eval: falselibrary(cmprsk)# Create data for competing risks analysis# Assume event_type: 0=censored, 1=event of interest, 2=competing eventcr_data <- clinical_data %>%mutate(status =case_when( event_indicator ==0~0, # censored event_type =="primary"~1, # primary outcome event_type =="competing"~2# competing event ) )# Cumulative incidence functioncif <-cuminc(ftime = cr_data$time_to_event,fstatus = cr_data$status,group = cr_data$treatment_group)# Plot cumulative incidenceplot(cif, col =c("blue", "red", "green", "orange"))# Fine-Gray competing risks regressionfg_model <-crr(ftime = cr_data$time_to_event,fstatus = cr_data$status,cov = cr_data %>%select(age, sex, baseline_risk_score) %>%as.matrix(),failcode =1,cencode =0)# Summarize modelsummary(fg_model)# Plot competing risks cumulative incidence with survminerlibrary(survminer)ggcompetingrisks(fit = cif,palette ="jco",ggtheme =theme_minimal())```## Mixed-Effects Models for Longitudinal Data### Linear Mixed ModelsFor analyzing repeated measures with continuous outcomes:```{r}#| echo: true#| eval: falselibrary(lme4)library(lmerTest)# Linear mixed model for repeated measuresmixed_model <-lmer( outcome ~ treatment_group * visit + baseline_value + age + sex + (1| subject_id),data = longitudinal_data)# Model summarysummary(mixed_model)# Type III ANOVA tableanova(mixed_model, type =3)# Extract fixed effectsfixef(mixed_model)# Extract random effectsranef(mixed_model)# Check residualsplot(mixed_model)qqnorm(resid(mixed_model))qqline(resid(mixed_model))# Plot interaction between treatment and visitlibrary(emmeans)emm <-emmeans(mixed_model, ~ treatment_group | visit)plot(emm)# Extract and plot predicted valueslongitudinal_data$predicted <-predict(mixed_model)ggplot(longitudinal_data, aes(x = visit, y = outcome, group = subject_id, color = treatment_group)) +geom_line(alpha =0.3) +geom_smooth(aes(y = predicted, group = treatment_group), size =1.5, se =FALSE) +facet_wrap(~ treatment_group) +labs(title ="Individual Trajectories with Model Predictions",x ="Visit", y ="Outcome Measure") +theme_minimal()```### Generalized Linear Mixed ModelsFor non-continuous longitudinal outcomes:```{r}#| echo: true#| eval: false# Generalized linear mixed model for binary repeated outcomelibrary(lme4)binary_mixed_model <-glmer( binary_outcome ~ treatment_group * visit + baseline_value + age + sex + (1| subject_id),family = binomial,data = longitudinal_data)# Model summarysummary(binary_mixed_model)# Odds ratios with confidence intervalsfixef_results <-data.frame(estimate =fixef(binary_mixed_model),se =sqrt(diag(vcov(binary_mixed_model))))fixef_results <- fixef_results %>%mutate(odds_ratio =exp(estimate),lower_ci =exp(estimate -1.96* se),upper_ci =exp(estimate +1.96* se) )# Plot predicted probabilities over timenewdata <-expand.grid(visit =unique(longitudinal_data$visit),treatment_group =unique(longitudinal_data$treatment_group),baseline_value =mean(longitudinal_data$baseline_value),age =mean(longitudinal_data$age),sex ="Female"# or whatever reference category you want)newdata$predicted_prob <-predict(binary_mixed_model, newdata = newdata, type ="response",re.form =NA) # Population-average predictionsggplot(newdata, aes(x = visit, y = predicted_prob, color = treatment_group, group = treatment_group)) +geom_line(size =1.2) +geom_point(size =3) +scale_y_continuous(limits =c(0, 1)) +labs(title ="Predicted Probability of Outcome Over Time",subtitle ="By Treatment Group",x ="Visit", y ="Predicted Probability") +theme_minimal()```## Advanced Modeling Techniques### Multiple Imputation for Missing DataProper handling of missing data is crucial in clinical research:```{r}#| echo: true#| eval: falselibrary(mice)# Examine missing data patternsmd.pattern(clinical_data)# Create multiple imputationsimp <-mice(clinical_data, m =10, printFlag =FALSE)# Fit model on each imputed datasetmodels <-with(imp, lm(endpoint_value ~ baseline_value + treatment_group + age + sex))# Pool results across imputationspooled_results <-pool(models)summary(pooled_results)# Compare with complete case analysiscomplete_case_model <-lm(endpoint_value ~ baseline_value + treatment_group + age + sex,data = clinical_data)summary(complete_case_model)# Visualize differencespooled_coefs <-summary(pooled_results)complete_coefs <-tidy(complete_case_model)# Combine results for plottingcomparison <-bind_rows( pooled_coefs %>%select(term = term, estimate = estimate, std.error = std.error) %>%mutate(method ="Multiple Imputation"), complete_coefs %>%select(term = term, estimate = estimate, std.error = std.error) %>%mutate(method ="Complete Case")) %>%filter(term !="(Intercept)") %>%mutate(lower_ci = estimate -1.96* std.error,upper_ci = estimate +1.96* std.error )# Plot comparisonggplot(comparison, aes(x = term, y = estimate, color = method)) +geom_point(position =position_dodge(width =0.5), size =3) +geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),position =position_dodge(width =0.5),width =0.25) +coord_flip() +labs(title ="Comparison of Multiple Imputation vs Complete Case Analysis",x ="", y ="Coefficient Estimate (95% CI)") +theme_minimal()```### Bayesian ApproachesBayesian methods are gaining popularity in clinical research:```{r}#| echo: true#| eval: falselibrary(rstanarm)library(bayesplot)# Bayesian linear modelbayes_model <-stan_glm( endpoint_value ~ baseline_value + treatment_group + age + sex,data = clinical_data,prior =normal(0, 5),prior_intercept =normal(0, 10),chains =4,iter =2000)# Model summarysummary(bayes_model)prior_summary(bayes_model)# Check convergencemcmc_trace(bayes_model)# Posterior distributionsmcmc_intervals(bayes_model)# Posterior predictive checkspp_check(bayes_model)# Treatment effect with 95% credible intervaltreatment_effect <-as.matrix(bayes_model)[, "treatment_groupTreatment"]mean(treatment_effect)quantile(treatment_effect, c(0.025, 0.975))# Probability that treatment effect is greater than 0mean(treatment_effect >0)# Visualize posterior distribution of treatment effectmcmc_areas(as.matrix(bayes_model), pars ="treatment_groupTreatment") +geom_vline(xintercept =0, linetype ="dashed") +labs(title ="Posterior Distribution of Treatment Effect",subtitle ="With 95% Credible Interval",x ="Treatment Effect", y ="Density") +theme_minimal()```### Propensity Score MethodsFor observational studies and causal inference:```{r}#| echo: true#| eval: falselibrary(MatchIt)library(cobalt)# Fit propensity score modelps_model <-glm(treatment_group ~ age + sex + baseline_value + comorbidity_score + center,family = binomial,data = observational_data)# Add propensity scores to dataobservational_data$propensity_score <-predict(ps_model, type ="response")# Check propensity score distributionggplot(observational_data, aes(x = propensity_score, fill = treatment_group)) +geom_density(alpha =0.5) +labs(title ="Propensity Score Distributions",x ="Propensity Score", y ="Density") +theme_minimal()# Propensity score matchingmatch_obj <-matchit( treatment_group ~ age + sex + baseline_value + comorbidity_score + center,data = observational_data,method ="nearest",ratio =1,caliper =0.2)# Examine balance before and after matchingsummary(match_obj)# Get matched datamatched_data <-match.data(match_obj)# Visualize covariate balancelove.plot(match_obj, binary ="std", thresholds =c(m =0.1),var.order ="unadjusted",abs =TRUE) +labs(title ="Standardized Mean Differences Before and After Matching")# Treatment effect analysis on matched datamatched_model <-lm(endpoint_value ~ treatment_group, data = matched_data)summary(matched_model)# Alternative: Inverse probability weightingobservational_data$ipw <-ifelse( observational_data$treatment_group =="Treatment",1/ observational_data$propensity_score,1/ (1- observational_data$propensity_score))# Stabilize weightsobservational_data <- observational_data %>%mutate(ipw_stabilized = ipw *ifelse( treatment_group =="Treatment",mean(treatment_group =="Treatment"),mean(treatment_group =="Control") ) )# IPW regressionipw_model <-lm( endpoint_value ~ treatment_group,data = observational_data,weights = ipw_stabilized)summary(ipw_model)```## Exercises1. Using a clinical trial dataset, conduct a complete ANCOVA analysis for a continuous outcome, including thorough diagnostics and visualization of results.2. Analyze a binary outcome from a clinical study using logistic regression, and create plots that illustrate how the probability of response varies with key covariates.3. Perform a survival analysis for time-to-event data, including both Kaplan-Meier estimation and Cox proportional hazards modeling. Check and address any violations of the proportional hazards assumption.4. Analyze longitudinal data from a clinical trial using a linear mixed-effects model. Explore different correlation structures and select the best-fitting model based on information criteria.5. Compare multiple imputation, complete case analysis, and LOCF (Last Observation Carried Forward) approaches on a dataset with missing values, and discuss the impact on parameter estimates and conclusions.## SummaryStatistical modeling in clinical research requires careful consideration of study design, outcome types, and data characteristics. This chapter has covered a range of modeling approaches commonly used in clinical settings, from traditional methods like linear and logistic regression to more specialized techniques for longitudinal and survival data. We've emphasized not just model fitting, but also thorough diagnostics, visualization, and interpretation—all essential components of reproducible and reliable clinical research.When applying these methods, it's critical to align the modeling approach with the research questions and to be transparent about assumptions and limitations. Modern R packages like tidymodels, lme4, survival, and rstanarm provide powerful tools for implementing these methods within a reproducible framework, allowing clinical researchers to generate robust evidence while maintaining regulatory compliance.## References