NC TraCS: Technical Resources
This page contains a number of education resources (videos, slides,and Jupyter notebooks) on various parts of the
data science lifecycle/process, but geared towards medical research using
OHDSI's
OMOP Common Data Model.
If there's a specific topic, you would like to covered, please submit a
request.
Data Science Workflow
Video Slides
Exploratory Data Analysis
Video
Notebook
R Source Code
Data
SQL Tutorials
Introduction to SQL
More Advanced Queuries
Sample SQL
Statistical Concepts
In healthcare and biomedical research, statistical methods are essential for drawing reliable conclusions from data, assessing the effectiveness of treatments, and making informed clinical decisions. From hypothesis testing in clinical trials to evaluating predictive models in diagnostics, a solid understanding of key statistical principles enhances the ability to interpret findings and apply evidence-based practices. This tutorial series introduces fundamental statistical concepts essential for understanding and conducting research in healthcare.
Hypothesis Testing
Hypothesis testing is a statistical method used to evaluate assumptions about a population based on sample data. It plays a crucial role in clinical trials and medical research. For example, hypothesis testing can be used to assess whether a new treatment provides greater benefits compared to the current standard of care. A hypothesis is a statement about a measurable population parameter, such as the effect of smoking on specific health outcomes. Hypothesis testing can then be used to assess whether the data provides sufficient evidence to reject or not to reject the statement in favor of an alternative idea.
In general, there are two hypotheses: a null hypothesis (H_0) and an alternative hypothesis (H_1 or H_a). The null hypothesis typically represents the assumption that there is no effect or difference. For example, a null hypothesis could be: “There is no difference in health outcomes between smokers and non-smokers.” The alternative hypothesis is typically the opposite– that there is an effect or difference. For example: “Smokers have worse health outcomes compared to non-smokers.”
Our video tutorials present some key concepts in hypothesis testing:
Rejection Region
Explore key concepts and important information about the Rejection Region in Hypothesis Testing. This tutorial uses real-world examples to guide you through step-by-step calculations and statistical reasoning.
Video
Slides
Python source code
R source code
Type I and Type II Errors
Learn about Type I and Type II errors in hypothesis testing, including their meanings, implications, and how they affect statistical decision-making.
Video Slides
Python source code
R source code
Power Function
Understand the power function in hypothesis testing and its relationship with sample size, significance level, effect size, and Type II errors.
Video Slides
Python source code
R source code
Significance Level
Learn about the significance level (alpha) in hypothesis testing, its role in defining the rejection region, and its impact on Type I and Type II errors.
Video Slides
P-values
Learn about the P-value in hypothesis testing, how it helps determine statistical significance, and its relationship with the rejection region and significance level.
Video Slides
Python source code
R source code
Confidence and Prediction Intervals
In medical (or any) research, it is important to quantify the uncertainty around a point estimate. Confidence and prediction intervals are statistical tools for this purpose.
Python source code
R source code
Bias and Variance
Explore bias and variance in statistical inference and their trade-offs in predictive accuracy.
Video
Slides
R source code
Test Data
Train Data
ROC Curves
In healthcare analytics, predictive models are used for tasks such as diagnosing diseases or predicting patient outcomes. It is critical to evaluate how well these models perform using appropriate metrics. Some commonly used evaluation metrics include ROC Curves (Receiver Operating Characteristic) and the AUC (Area Under the Curve), which measure the model’s ability to distinguish between positive and negative cases across various classification thresholds. Additionally, Precision, Recall, and the F1 Score are used to assess classification performance, highlighting the model’s ability to balance identifying true positives while minimizing false results.
Video
Slides
R source code
R-Squared and Adjusted R-Squared
When building predictive models for healthcare data, it is crucial to understand how well the model represents the data without overfitting or underfitting. We introduce the concepts of R-Squared and Adjusted R-Squared, and discuss bias-variance tradeoff here.
Video Slides
Python source code
R source code
Note: All provided data is synthetic - no real patient data is available on this page.