2  Data Preparation and Cleaning

2.1 Introduction to Clinical Data Structures

Clinical research generates various data types with specific characteristics and challenges. Understanding these structures is essential for effective data preparation.

2.1.1 Common Clinical Data Sources

Clinical data typically originates from:

  1. Electronic Data Capture (EDC) systems: Purpose-built for clinical trials
  2. Electronic Health Records (EHR): Patient records from healthcare systems
  3. Patient-reported outcomes (ePRO): Data collected directly from patients
  4. Laboratory data: Standardized clinical measurements
  5. Wearable devices: Continuous monitoring data
  6. Imaging data: Radiological and other imaging outputs

2.1.2 Standard Data Formats in Clinical Research

Clinical research often uses standardized data formats:

Code
# Example of importing CDISC SDTM formatted data
library(haven)
library(tidyverse)

# Import SAS dataset with SDTM structure
demographics <- read_sas("data/dm.sas7bdat")
vitals <- read_sas("data/vs.sas7bdat") 
labs <- read_sas("data/lb.sas7bdat")

# View the structure
glimpse(demographics)

The most common standardized formats include:

  • CDISC SDTM (Study Data Tabulation Model): Standardized structure for submission data
  • CDISC ADaM (Analysis Data Model): Datasets optimized for analysis
  • OMOP (Observational Medical Outcomes Partnership): Common data model for observational studies
  • FHIR (Fast Healthcare Interoperability Resources): Standard for healthcare data exchange

For more detailed information on CDISC standards and their implementation in R, see Chapter ?sec-cdisc.