{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyPpA83DZn7z0sTr93hd8nnb"},"kernelspec":{"name":"ir","display_name":"R"},"language_info":{"name":"R"}},"cells":[{"cell_type":"markdown","source":["# Exploratory Data Analysis Tutorial for OMOP Dataset Extract"],"metadata":{"id":"zK1xJOsU01lS"}},{"cell_type":"markdown","source":["**Exploratory Data Analysis** (EDA) is the process of analyzing and summarizing the main characteristics of a dataset, often using visual methods. It’s an essential step in data analysis that helps to understand the data's structure, detect patterns, identify anomalies, test hypotheses, and determine relationships among variables."],"metadata":{"id":"A0sW6eE205bG"}},{"cell_type":"markdown","source":["### Setting up the enviroment"],"metadata":{"id":"yjMf-A0w1oAu"}},{"cell_type":"code","source":["library(dplyr) # Data Manipulation\n","library(tidyr) # Data Cleaning\n","library(ggplot2) # Data Visualization\n","library(readr) # Reading CSV files\n","library(forcats) # for factor reordering\n","library(reshape2) # for melting correlation matrix"],"metadata":{"id":"HXS8Uwte11Qe","executionInfo":{"status":"ok","timestamp":1731293301274,"user_tz":300,"elapsed":141,"user":{"displayName":"Yuqi Su","userId":"03184251790933787012"}}},"execution_count":93,"outputs":[]},{"cell_type":"markdown","source":["### Loading the dataset\n","The dataset that we will be using in this tutorial is the list of all patients suffering from COVID-19 in the OMOP database and their demographic data."],"metadata":{"id":"tflv4D0--ouy"}},{"cell_type":"code","source":["# Replace '/content/COVID-19-BMI.csv' with the actual path to your dataset\n","df <- read_csv('/content/COVID-19-BMI.csv')\n","\n","# Display the first few rows of the dataset to get a glimpse of its content\n","head(df)"],"metadata":{"id":"Yqk1uE841rfS","colab":{"base_uri":"https://localhost:8080/","height":436},"executionInfo":{"status":"ok","timestamp":1731292183335,"user_tz":300,"elapsed":403,"user":{"displayName":"Yuqi Su","userId":"03184251790933787012"}},"outputId":"0a9346bd-f234-4f9d-9ace-587701018d59"},"execution_count":46,"outputs":[{"output_type":"stream","name":"stderr","text":["\u001b[1mRows: \u001b[22m\u001b[34m88166\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m9\u001b[39m\n","\u001b[36m──\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m────────────────────────────────────────────────────────\u001b[39m\n","\u001b[1mDelimiter:\u001b[22m \",\"\n","\u001b[31mchr\u001b[39m (4): gender, race, height, bmi\n","\u001b[32mdbl\u001b[39m (4): person_id, condition_concept_id, age, weight\n","\u001b[34mdate\u001b[39m (1): condition_start_date\n","\n","\u001b[36mℹ\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n","\u001b[36mℹ\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n"]},{"output_type":"display_data","data":{"text/html":["
person_id | gender | race | condition_concept_id | condition_start_date | age | weight | height | bmi |
---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <dbl> | <date> | <dbl> | <dbl> | <chr> | <chr> |
1 | F | white | 37311061 | 2020-03-11 | 62 | 71.3 | 160.8 | 27.6 |
2 | F | white | 37311061 | 2020-03-02 | 75 | 87.7 | 169.5 | 30.5 |
3 | M | white | 37311061 | 2020-03-14 | 52 | 91.9 | 177.2 | 29.4 |
5 | F | white | 37311061 | 2020-03-10 | 32 | 74.4 | 158.1 | 29.8 |
6 | F | white | 37311061 | 2020-02-12 | 44 | 71.0 | NULL | NULL |
7 | F | black | 37311061 | 2020-03-09 | 62 | 81.2 | 164.0 | 30.2 |