Project Overview¶
My interest in kidney disease prediction stems from a personal passion for health and wellness, as well as a family history of Polycystic Kidney Disease (PKD). With kidney failure rates rising across the United States, I wanted to explore how demographic data, lab results, and dietary factors can be used to predict the likelihood of an individual requiring dialysis or a transplant.
Objective¶
The goal of this phase is to prepare a high-quality dataset for predictive modeling. I drew upon medical literature to select relevant features, followed by a rigorous pipeline of cleaning, transformation, and imputation.
Methodology¶
- Data Preprocessing & Imputation: Cleaned the raw data and handled missing values to ensure a complete dataset.
- Feature Engineering & Selection: * Correlation Analysis: Identified and addressed multicollinearity within the feature matrix.
- Multiple Linear Regression: Evaluated variance and ranked feature importance.
- Recursive Feature Elimination (RFE): Systematically narrowed the dataset down to the 11 most impactful predictors.
Output: The result is a refined .csv dataset ready for the modeling phase, where I will evaluate various bagging and boosting techniques.
Note: This model is designed to assess the likelihood of current kidney failure rather than forecasting future onset.
Data¶
Where does the data come from?¶
Data Source: NHANES (CDC/NCHS) The data for this model is sourced from the National Health and Nutrition Examination Survey (NHANES), a flagship program of the Centers for Disease Control and Prevention (CDC).
Why is this data significant? * National Representation: The survey examines a representative sample of approximately 5,000 individuals across the U.S. annually.
- Gold-Standard Metrics: Unlike many datasets that rely solely on self-reported surveys, NHANES combines in-home interviews with physical examinations and laboratory tests conducted in mobile examination centers.
- Multidimensional Features: It provides a comprehensive view of a patient by linking demographic and socioeconomic factors with clinical lab results.
The Data Ecosystem¶
The power of this prediction model lies in its multi-dimensional approach, merging six distinct datasets from the NHANES 2013-2014 cycle:
- Demographics: Key Variables include Age, Gender, Race/Ethnicity, Education, and Income. Establishes baseline risk groups.
- Examinations: Key Variables include Blood Pressure, BMI, and Grip Strength. Identifies physical indicators of hypertension.
- Laboratory: Key Variables include Albumin, Creatinine, and Cholesterol. Provides direct chemical evidence of kidney efficiency.
- Dietary Data: Key Variables include Sodium and protein consumption.
- Questionnaires: Key Variables include Alcohol use and cardiovascular history.
- Medication: Identifies individuals already being treated for related conditions.
Data Processing & Methodology¶
The Technical Choice: Why R? For this analysis, I utilized R and the Tidyverse ecosystem. While Python is noted for speed, R’s specialized packages—specifically mice for imputation and caret for recursive feature selection—offered the high-precision tools necessary for complex medical data.
library(tidyverse) # Data manipulation
library(mice) # Imputation
library(caret) # Model training & RFE
Overcoming Data Sparsity: Imputation Strategy¶
A significant challenge in the NHANES dataset is missing values. A "listwise deletion" approach (removing any row with an NA) would have resulted in the loss of nearly the entire sample, as medical exams are often incomplete for various participants.
To solve this, I used the MICE (Multivariate Imputation by Chained Equations) algorithm. This allowed me to preserve the statistical power of the dataset by predicting missing values based on the relationships between other observed variables, ensuring the final 11 features remained robust and representative.
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Next we can load in the datasets¶
There are 6 tables in this dataset and after doing a deeper dive into the necessary components of kidney disease prediction I will drop the ones not required to bag/boost the data.
Verifying Data¶
Next it would be helpful to verify the data. I will start with head(), glimpse(), and dim() functions as those are my favorites. I will also open them in their native Excel file to make sure the response and patient ID's match. With this much data it may have been cleaner to just use dim after confirming matching ID's (SEQN) with a join and dim() again.
| SEQN | SDDSRVYR | RIDSTATR | RIAGENDR | RIDAGEYR | |
|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | |
| 1 | 73557 | 8 | 2 | 1 | 69 |
| 2 | 73558 | 8 | 2 | 1 | 54 |
| 3 | 73559 | 8 | 2 | 1 | 72 |
| 4 | 73560 | 8 | 2 | 1 | 9 |
| 5 | 73561 | 8 | 2 | 2 | 73 |
Rows: 10,175 Columns: 5 $ SEQN <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73565… $ SDDSRVYR <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8… $ RIDSTATR <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2… $ RIAGENDR <int> 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1… $ RIDAGEYR <int> 69, 54, 72, 9, 73, 56, 0, 61, 42, 56, 65, 26, 0, 9, 76, 10, 1…
| SEQN | WTDRD1 | WTDR2D | DR1DRSTZ | DR1EXMER | |
|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <int> | <int> | |
| 1 | 73557 | 16888.33 | 12930.89 | 1 | 49 |
| 2 | 73558 | 17932.14 | 12684.15 | 1 | 59 |
| 3 | 73559 | 59641.81 | 39394.24 | 1 | 49 |
| 4 | 73560 | 142203.07 | 125966.37 | 1 | 54 |
| 5 | 73561 | 59052.36 | 39004.89 | 1 | 63 |
Rows: 9,813 Columns: 5 $ SEQN <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566… $ WTDRD1 <dbl> 16888.328, 17932.144, 59641.813, 142203.070, 59052.357, 49890… $ WTDR2D <dbl> 12930.89, 12684.15, 39394.24, 125966.37, 39004.89, 0.00, 4073… $ DR1DRSTZ <int> 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1… $ DR1EXMER <int> 49, 59, 49, 54, 63, 49, 54, 54, 49, 61, 87, 22, 25, 61, NA, 4…
| SEQN | PEASCST1 | PEASCTM1 | PEASCCT1 | BPXCHR | |
|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | |
| 1 | 73557 | 1 | 620 | NA | NA |
| 2 | 73558 | 1 | 766 | NA | NA |
| 3 | 73559 | 1 | 665 | NA | NA |
| 4 | 73560 | 1 | 803 | NA | NA |
| 5 | 73561 | 1 | 949 | NA | NA |
Rows: 9,813 Columns: 5 $ SEQN <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566… $ PEASCST1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ PEASCTM1 <int> 620, 766, 665, 803, 949, 1064, 90, 954, 625, 932, 585, 710, 1… $ PEASCCT1 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… $ BPXCHR <int> NA, NA, NA, NA, NA, NA, 152, NA, NA, NA, NA, NA, NA, NA, NA, …
| SEQN | URXUMA | URXUMS | URXUCR.x | URXCRS | |
|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <int> | <dbl> | |
| 1 | 73557 | 4.3 | 4.3 | 39 | 3447.6 |
| 2 | 73558 | 153.0 | 153.0 | 50 | 4420.0 |
| 3 | 73559 | 11.9 | 11.9 | 113 | 9989.2 |
| 4 | 73560 | 16.0 | 16.0 | 76 | 6718.4 |
| 5 | 73561 | 255.0 | 255.0 | 147 | 12994.8 |
Rows: 9,813 Columns: 5 $ SEQN <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566… $ URXUMA <dbl> 4.3, 153.0, 11.9, 16.0, 255.0, 123.0, NA, 19.0, 1.3, 35.0, 25… $ URXUMS <dbl> 4.3, 153.0, 11.9, 16.0, 255.0, 123.0, NA, 19.0, 1.3, 35.0, 25… $ URXUCR.x <int> 39, 50, 113, 76, 147, 74, NA, 242, 18, 215, 31, 116, 177, 144… $ URXCRS <dbl> 3447.6, 4420.0, 9989.2, 6718.4, 12994.8, 6541.6, NA, 21392.8,…
| SEQN | RXDUSE | RXDDRUG | RXDDRGID | RXQSEEN | |
|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <int> | |
| 1 | 73557 | 1 | 99999 | NA | |
| 2 | 73557 | 1 | INSULIN | d00262 | 2 |
| 3 | 73558 | 1 | GABAPENTIN | d03182 | 1 |
| 4 | 73558 | 1 | INSULIN GLARGINE | d04538 | 1 |
| 5 | 73558 | 1 | OLMESARTAN | d04801 | 1 |
Rows: 20,194 Columns: 5 $ SEQN <int> 73557, 73557, 73558, 73558, 73558, 73558, 73559, 73559, 73559… $ RXDUSE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ RXDDRUG <chr> "99999", "INSULIN", "GABAPENTIN", "INSULIN GLARGINE", "OLMESA… $ RXDDRGID <chr> "", "d00262", "d03182", "d04538", "d04801", "d00746", "d04697… $ RXQSEEN <int> NA, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 1, 2, 2, 2, 2,…
| SEQN | ACD011A | ACD011B | ACD011C | ACD040 | |
|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | |
| 1 | 73557 | 1 | NA | NA | NA |
| 2 | 73558 | 1 | NA | NA | NA |
| 3 | 73559 | 1 | NA | NA | NA |
| 4 | 73560 | 1 | NA | NA | NA |
| 5 | 73561 | 1 | NA | NA | NA |
Rows: 10,175 Columns: 5 $ SEQN <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73565,… $ ACD011A <int> 1, 1, 1, 1, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, 1, 1, 1, NA, NA,… $ ACD011B <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ ACD011C <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ ACD040 <int> NA, NA, NA, NA, NA, 4, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, …
- 10175
- 47
- 9813
- 168
- 9813
- 224
- 9813
- 424
- 20194
- 13
- 10175
- 953
I notice the majority of the data is in int and dbl format. There are a lot of NA's which is expected as not every patient will qulify for every examination or test (Missing Not At Random data)- so it is still useful information, with a meaningful signal, as negative feedback.
The headers are a bit confusing with the short hand they chose to format the long data in. But when I open the files in Excel it has a description of each of the columns available. There are also links available at the data card that describe each column. So I will need to do quite a bit of transformation and verification of the data before it is understandable to me to be able to relay to a stakeholder.
As you can see this is an absolutely massive and complex dataset. I have no formal medical training and am not really interested in how various medications are used on a per person basis. So I will not be manipulating the medication set. I will also only use a few pop culture measures from the labs as I do have a strong background in biology and understand lab reports can be multi faceted. There are various outcomes of high or low readings according to symptomology, which without an EDA and no clear direction, would not be useful. They could also be potentially misleading. The majority of my data used will come from the demographics, examinations, and diet datasets. The questionaire portion will be useful towards the end when I have more insight into the sample population tested.
Checking ID's¶
My plan initially was to open the files in excel and create pivot tables. However, this dataset has so many enteries that my excel will not handle it appropriatley so, all analysis will take place within the notebook.
It looks like we have 10175 participants, but 9813 questioned and then given a physical and have been responding to the questionaire longterm.
There are not any duplicate records of the ID's so every row is a unique respondent.
Dealing with NA¶
If I were to explore medication or do a deep dive into diet the NA's would be very useful information to make a comparative of what the individual participants are and are not doing. NA's do not provide useful information for my prediction and need to be remediated for the model to even run.
Combine datasets and Transform useful columns¶
There is too much data to be able to go through each individual set and column to make a correlation and comparison without first combining the tabels with an inner join using the SEQN (ID) of participants and then renaming columns useful for finding out who these people are that were surveyed. (As stated previously I will be exploring only 4 of the sets initially).
The specific kidney data that was required was not available in the standard set of data. I followed the link to the original data and downloaded the correct year from The national Health Survey subset of the CDC and was returned a file in .XPT format, which is the typical file type of SAS coding. I do not have access to SAS so I will download the 'foreign' package to convert, export, and reimport the data as a .csv...
| SEQN | KIQ022 | KIQ025 | KIQ026 | KID028 | KIQ005 | KIQ010 | KIQ042 | KIQ430 | KIQ044 | KIQ450 | KIQ046 | KIQ470 | KIQ050 | KIQ052 | KIQ480 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 73557 | 2 | NA | 2 | NA | 4 | 2 | 1 | 2 | 2 | NA | 2 | NA | 4 | 4 | 3 |
| 73558 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 2 |
| 73559 | 1 | 2 | 2 | NA | 2 | 3 | 2 | NA | 2 | NA | 2 | NA | NA | NA | 2 |
| 73561 | 1 | 2 | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 2 |
| 73562 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73564 | 2 | NA | 2 | NA | 3 | 1 | 1 | 2 | 2 | NA | 2 | NA | 3 | 1 | 2 |
| 73565 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73566 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73567 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73568 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73571 | 2 | NA | 1 | 3 | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73574 | 2 | NA | 1 | 1 | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73577 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73580 | 2 | NA | 2 | NA | 2 | 1 | 1 | 1 | 1 | 1 | 2 | NA | 1 | 1 | 1 |
| 73581 | 2 | NA | 2 | NA | 2 | 1 | 2 | NA | 1 | 1 | 2 | NA | 2 | 2 | 0 |
| 73582 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73585 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73589 | 2 | NA | 1 | 1 | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73592 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73594 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73595 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73596 | 2 | NA | 2 | NA | 2 | 1 | 2 | NA | 1 | 1 | 2 | NA | 2 | 2 | 1 |
| 73597 | 1 | 1 | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73598 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73600 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 73603 | 2 | NA | 2 | NA | 1 | NA | 1 | 1 | 2 | NA | 1 | 1 | 1 | 1 | 0 |
| 73604 | 2 | NA | 2 | NA | 5 | 1 | 2 | NA | 1 | 1 | 1 | 4 | 3 | 1 | 0 |
| 73607 | 2 | NA | 2 | NA | 4 | 2 | 2 | NA | 1 | 3 | 2 | NA | 1 | 1 | 1 |
| 73610 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 73613 | 2 | NA | 2 | NA | 2 | 1 | 2 | NA | 1 | 1 | 2 | NA | 3 | 1 | 3 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 83678 | 2 | NA | 2 | NA | 2 | 1 | 1 | 2 | 1 | 1 | 2 | NA | 2 | 1 | 1 |
| 83683 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 1 | 1 | 2 | NA | 5 | 2 | 2 |
| 83684 | 2 | NA | 1 | 1 | 1 | NA | 2 | NA | 1 | 1 | 2 | NA | 2 | 1 | 0 |
| 83687 | 2 | NA | 2 | NA | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 2 | 1 | 3 |
| 83688 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 83689 | 2 | NA | 2 | NA | 3 | 1 | 1 | 2 | 2 | NA | 2 | NA | 2 | 1 | 1 |
| 83690 | 2 | NA | 2 | NA | 2 | 1 | 1 | 1 | 2 | NA | 2 | NA | 1 | 1 | 0 |
| 83692 | 2 | NA | 2 | NA | 4 | 2 | 2 | NA | 2 | NA | 1 | 4 | 1 | 1 | 5 |
| 83694 | 2 | NA | 2 | NA | 3 | 1 | 1 | 2 | 2 | NA | 2 | NA | 2 | 1 | 1 |
| 83699 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 83700 | 2 | NA | 2 | NA | 4 | 2 | 1 | 3 | 2 | NA | 2 | NA | 3 | 2 | 1 |
| 83701 | 2 | NA | 2 | NA | 2 | 2 | 2 | NA | 1 | 1 | 2 | NA | 5 | 1 | 1 |
| 83702 | 2 | NA | 2 | NA | 4 | 2 | 2 | NA | 1 | 3 | 2 | NA | 3 | 1 | 4 |
| 83703 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 83705 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83708 | 2 | NA | 2 | NA | 1 | NA | 1 | 1 | 2 | NA | 2 | NA | 1 | 1 | 1 |
| 83709 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 83711 | 2 | NA | 2 | NA | 1 | NA | 1 | 4 | 2 | NA | 2 | NA | 1 | 1 | 2 |
| 83712 | 2 | NA | 2 | NA | 2 | 2 | 2 | NA | 1 | 1 | 2 | NA | 2 | 1 | 2 |
| 83713 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83715 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 1 | 1 | 2 | NA | 1 | 1 | 2 |
| 83717 | 2 | NA | 2 | NA | 5 | 3 | 2 | NA | 1 | 4 | 2 | NA | 4 | 2 | 2 |
| 83718 | 2 | NA | 9 | NA | 3 | 2 | 2 | NA | 1 | 1 | 2 | NA | 2 | 2 | 2 |
| 83720 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 83721 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 1 |
| 83723 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 3 |
| 83724 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 3 |
| 83726 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83727 | 2 | NA | 2 | NA | 1 | NA | 2 | NA | 2 | NA | 2 | NA | NA | NA | 0 |
| 83729 | 2 | NA | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| SEQN | DMDHRGND | DMDEDUC3 | DMDYRSUS | INDFMIN2 | RIDAGEYR | RIDRETH1 | RIDRETH3 | DR1DAY | DR1TCAFF | ⋯ | BPXDI1 | BPXSY1 | LBDSCRSI | URXUCR | URDACT | LBDSALSI | LBXSBU | KID028 | KIQ022 | KIQ026 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <int> | <int> | <int> |
| 73557 | 1 | NA | NA | 4 | 69 | 4 | 4 | 2 | 203 | ⋯ | 72 | 122 | 106.96 | NA | 11.03 | 41 | 10 | NA | 2 | 2 |
| 73558 | 1 | NA | NA | 7 | 54 | 3 | 3 | 1 | 240 | ⋯ | 62 | 156 | 69.84 | NA | 306.00 | 47 | 16 | NA | 2 | 2 |
| 73559 | 1 | NA | NA | 10 | 72 | 3 | 3 | 6 | 45 | ⋯ | 90 | 140 | 107.85 | NA | 10.53 | 37 | 14 | NA | 1 | 2 |
| 73560 | 1 | 3 | NA | 9 | 9 | 3 | 3 | 3 | 0 | ⋯ | 38 | 108 | NA | NA | 21.05 | NA | NA | NA | NA | NA |
| 73561 | 1 | NA | NA | 15 | 73 | 3 | 3 | 1 | 24 | ⋯ | 86 | 136 | 64.53 | NA | 173.47 | 43 | 31 | NA | 1 | 2 |
| 73562 | 1 | NA | NA | 9 | 56 | 1 | 1 | 3 | 144 | ⋯ | 84 | 160 | 78.68 | NA | 166.22 | 43 | 18 | NA | 2 | 2 |
| 73563 | 1 | NA | NA | 15 | 0 | 3 | 3 | 3 | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73564 | 2 | NA | NA | 10 | 61 | 3 | 3 | 7 | 4 | ⋯ | 80 | 118 | 81.33 | NA | 7.85 | 39 | 17 | NA | 2 | 2 |
| 73565 | 1 | NA | NA | 15 | 42 | 2 | 2 | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 2 |
| 73566 | 2 | NA | NA | 4 | 56 | 3 | 3 | 2 | 266 | ⋯ | 74 | 128 | 48.62 | NA | 7.22 | 41 | 9 | NA | 2 | 2 |
| 73567 | 1 | NA | NA | 3 | 65 | 3 | 3 | 7 | 43 | ⋯ | 78 | 140 | 85.75 | NA | 16.28 | 40 | 15 | NA | 2 | 2 |
| 73568 | 2 | NA | NA | 15 | 26 | 3 | 3 | 7 | 199 | ⋯ | 60 | 106 | 65.42 | 31 | 80.65 | 45 | 12 | NA | 2 | 2 |
| 73569 | 2 | NA | NA | 77 | 0 | 5 | 7 | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73570 | 2 | 2 | NA | 5 | 9 | 5 | 7 | 2 | 47 | ⋯ | 44 | 102 | NA | 116 | 23.45 | NA | NA | NA | NA | NA |
| 73571 | 1 | NA | NA | 14 | 76 | 3 | 3 | 7 | 264 | ⋯ | 68 | 124 | 105.20 | 177 | 14.58 | 43 | 17 | 3 | 2 | 1 |
| 73572 | 2 | 3 | NA | 2 | 10 | 4 | 4 | 4 | 0 | ⋯ | 54 | 88 | NA | NA | 35.42 | NA | NA | NA | NA | NA |
| 73573 | 1 | 4 | NA | 8 | 10 | 4 | 4 | NA | NA | ⋯ | 62 | 94 | NA | NA | 8.29 | NA | NA | NA | NA | NA |
| 73574 | 2 | NA | 4 | 8 | 33 | 5 | 6 | 1 | 872 | ⋯ | 56 | 122 | 52.16 | 173 | 7.51 | 43 | 11 | 1 | 2 | 1 |
| 73575 | 2 | NA | NA | 3 | 1 | 4 | 4 | 7 | 3 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 73576 | 2 | 9 | NA | 8 | 16 | 4 | 4 | 6 | 0 | ⋯ | 68 | 108 | 78.68 | 166 | 3.01 | 51 | 14 | NA | NA | NA |
| 73577 | 2 | NA | 4 | 2 | 32 | 1 | 1 | 1 | 210 | ⋯ | 74 | 118 | 60.11 | 191 | 8.90 | 45 | 17 | NA | 2 | 2 |
| 73578 | 2 | 15 | NA | 5 | 18 | 1 | 1 | 4 | 0 | ⋯ | 58 | 120 | NA | 201 | 5.27 | NA | NA | NA | NA | NA |
| 73579 | 2 | 6 | NA | 10 | 12 | 3 | 3 | 1 | 0 | ⋯ | 72 | 108 | 50.39 | NA | 6.32 | 47 | 10 | NA | NA | NA |
| 73580 | 2 | NA | NA | 12 | 38 | 4 | 4 | 4 | 36 | ⋯ | 84 | 124 | 63.65 | NA | 2.79 | 38 | 10 | NA | 2 | 2 |
| 73581 | 1 | NA | 4 | 15 | 50 | 5 | 6 | 7 | 24 | ⋯ | 80 | 138 | 83.98 | NA | 4.95 | 43 | 11 | NA | 2 | 2 |
| 73582 | 2 | NA | NA | 3 | 23 | 4 | 4 | NA | NA | ⋯ | 56 | 98 | NA | NA | 7.61 | NA | NA | NA | 2 | 2 |
| 73583 | 1 | 1 | NA | 15 | 7 | 3 | 3 | 7 | 0 | ⋯ | NA | NA | NA | NA | 19.21 | NA | NA | NA | NA | NA |
| 73584 | 2 | 7 | NA | 9 | 13 | 3 | 3 | 1 | 0 | ⋯ | 54 | 108 | 52.16 | 106 | 4.72 | 42 | 14 | NA | NA | NA |
| 73585 | 1 | NA | 6 | 7 | 28 | 5 | 6 | 4 | 96 | ⋯ | 70 | 106 | 106.96 | NA | 2.70 | 47 | 23 | NA | 2 | 2 |
| 73586 | 1 | NA | NA | 6 | 4 | 5 | 6 | 7 | 0 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 83702 | 2 | NA | NA | 7 | 80 | 3 | 3 | 3 | 95 | ⋯ | 86 | 154 | 90.17 | NA | 13.33 | 41 | 18 | NA | 2 | 2 |
| 83703 | 2 | NA | NA | 99 | 22 | 1 | 1 | 1 | 0 | ⋯ | 64 | 128 | 64.53 | 89 | 9.21 | 46 | 13 | NA | 2 | 2 |
| 83704 | 2 | 8 | NA | 1 | 15 | 3 | 3 | 4 | 5 | ⋯ | 38 | 108 | 63.65 | NA | 4.85 | 41 | 12 | NA | NA | NA |
| 83705 | 1 | NA | 4 | 4 | 35 | 2 | 2 | 4 | 125 | ⋯ | 64 | 100 | 70.72 | NA | 7.03 | 44 | 16 | NA | 2 | 2 |
| 83706 | 2 | 0 | NA | 6 | 6 | 4 | 4 | NA | NA | ⋯ | NA | NA | NA | NA | 10.00 | NA | NA | NA | NA | NA |
| 83707 | 2 | 13 | 2 | 3 | 18 | 1 | 1 | 3 | 5 | ⋯ | 54 | 106 | 74.26 | NA | 13.83 | 46 | 11 | NA | NA | NA |
| 83708 | 1 | NA | NA | 5 | 64 | 3 | 3 | 5 | 0 | ⋯ | 74 | 94 | 176.80 | NA | 10.55 | 39 | 28 | NA | 2 | 2 |
| 83709 | 2 | NA | NA | 15 | 24 | 3 | 3 | 1 | 177 | ⋯ | 62 | 116 | 86.63 | NA | 7.57 | 46 | 19 | NA | 2 | 2 |
| 83710 | 2 | NA | NA | 1 | 2 | 4 | 4 | 1 | 0 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83711 | 1 | NA | NA | 7 | 38 | 3 | 3 | 6 | 0 | ⋯ | 76 | 110 | 68.95 | 62 | 13.87 | 39 | 6 | NA | 2 | 2 |
| 83712 | 1 | NA | NA | 9 | 61 | 3 | 3 | 1 | 160 | ⋯ | 70 | 124 | 91.05 | 85 | 3.41 | 43 | 13 | NA | 2 | 2 |
| 83713 | 2 | NA | 4 | 6 | 34 | 5 | 6 | 5 | 66 | ⋯ | 66 | 118 | 97.24 | 80 | 10.25 | 49 | 12 | NA | 2 | 2 |
| 83714 | 1 | 13 | NA | 6 | 19 | 3 | 3 | 5 | 145 | ⋯ | 64 | 112 | 65.42 | NA | 4.48 | 45 | 9 | NA | NA | NA |
| 83715 | 1 | NA | NA | 3 | 58 | 3 | 3 | 7 | 189 | ⋯ | 74 | 118 | 78.68 | NA | 6.09 | 39 | 12 | NA | 2 | 2 |
| 83716 | 2 | 11 | NA | 15 | 17 | 3 | 3 | 5 | 62 | ⋯ | 60 | 104 | 112.27 | NA | 3.25 | 44 | 12 | NA | NA | NA |
| 83717 | 2 | NA | 6 | 6 | 80 | 1 | 1 | 2 | 0 | ⋯ | NA | NA | 69.84 | 125 | 29.60 | 39 | 20 | NA | 2 | 2 |
| 83718 | 2 | NA | NA | NA | 60 | 4 | 4 | NA | NA | ⋯ | 72 | 114 | 90.17 | NA | 10.00 | 44 | 11 | NA | 2 | 9 |
| 83719 | 1 | NA | NA | 8 | 3 | 1 | 1 | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83720 | 1 | NA | NA | 8 | 36 | 4 | 4 | NA | NA | ⋯ | 88 | 130 | 114.04 | NA | 6.77 | 48 | 10 | NA | 2 | 2 |
| 83721 | 2 | NA | 3 | 15 | 52 | 3 | 3 | 1 | 560 | ⋯ | 70 | 108 | 98.12 | NA | 6.48 | 44 | 12 | NA | 2 | 2 |
| 83722 | 2 | NA | NA | 7 | 0 | 5 | 7 | 6 | 0 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83723 | 1 | NA | NA | 10 | 61 | 1 | 1 | 7 | 239 | ⋯ | NA | NA | 71.60 | 157 | 10.83 | 41 | 17 | NA | 2 | 2 |
| 83724 | 1 | NA | NA | 8 | 80 | 3 | 3 | 3 | 8 | ⋯ | 70 | 164 | 114.04 | NA | 5.98 | 38 | 26 | NA | 2 | 2 |
| 83725 | 1 | 1 | NA | 6 | 7 | 1 | 1 | NA | NA | ⋯ | NA | NA | NA | 88 | 11.36 | NA | NA | NA | NA | NA |
| 83726 | 1 | NA | 5 | 9 | 40 | 1 | 1 | NA | NA | ⋯ | NA | NA | NA | 114 | 4.56 | NA | NA | NA | 2 | 2 |
| 83727 | 2 | NA | NA | 77 | 26 | 2 | 2 | 7 | 37 | ⋯ | 68 | 110 | 97.24 | NA | 4.04 | 49 | 13 | NA | 2 | 2 |
| 83728 | 1 | NA | NA | 8 | 2 | 1 | 1 | 5 | 0 | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 83729 | 2 | NA | 6 | 7 | 42 | 4 | 4 | 6 | 12 | ⋯ | 82 | 136 | 72.49 | NA | 5.13 | 41 | 10 | NA | 2 | 2 |
| 83730 | 2 | 0 | NA | 6 | 7 | 2 | 2 | NA | NA | ⋯ | NA | NA | NA | NA | 5.23 | NA | NA | NA | NA | NA |
| 83731 | 1 | 5 | NA | 15 | 11 | 5 | 6 | 6 | 12 | ⋯ | 68 | 94 | NA | NA | 4.65 | NA | NA | NA | NA | NA |
We can see in the summary of our feature data that a lot of it is catagorical data with some numerical so lets rename the columns to get a better idea of what we are dealing with. Shorthand codes are located at the NHANES website.
id gender education time_in_us
Min. :73557 Min. :1.0 Min. : 0.000 Min. : 1.000
1st Qu.:76100 1st Qu.:1.0 1st Qu.: 2.000 1st Qu.: 4.000
Median :78644 Median :1.0 Median : 5.000 Median : 5.000
Mean :78644 Mean :1.5 Mean : 6.162 Mean : 8.838
3rd Qu.:81188 3rd Qu.:2.0 3rd Qu.: 9.000 3rd Qu.: 7.000
Max. :83731 Max. :2.0 Max. :99.000 Max. :99.000
NA's :7372 NA's :8267
householdincome age race race2 water
Min. : 1.00 Min. : 0.00 Min. :1.000 Min. :1.00 Min. :1.000
1st Qu.: 5.00 1st Qu.:10.00 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:3.000
Median : 7.00 Median :26.00 Median :3.000 Median :3.00 Median :5.000
Mean :10.51 Mean :31.48 Mean :3.092 Mean :3.29 Mean :4.501
3rd Qu.:14.00 3rd Qu.:52.00 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:6.000
Max. :99.00 Max. :80.00 Max. :5.000 Max. :7.00 Max. :7.000
NA's :123 NA's :1392
caffiene niacin protein diabp
Min. : 0.00 Min. : 0.215 Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 13.583 1st Qu.: 45.78 1st Qu.: 58.00
Median : 25.00 Median : 20.196 Median : 66.05 Median : 66.00
Mean : 93.34 Mean : 23.509 Mean : 74.54 Mean : 65.77
3rd Qu.: 130.00 3rd Qu.: 29.152 3rd Qu.: 93.86 3rd Qu.: 76.00
Max. :2448.00 Max. :379.852 Max. :869.49 Max. :122.00
NA's :1644 NA's :1644 NA's :1644 NA's :3003
sysbp creatinine urinecreatinine uacr
Min. : 66.0 Min. : 25.64 Min. : 8.0 Min. : 0.21
1st Qu.:106.0 1st Qu.: 61.00 1st Qu.: 65.0 1st Qu.: 5.02
Median :116.0 Median : 72.49 Median :112.0 Median : 7.78
Mean :118.1 Mean : 77.81 Mean :127.6 Mean : 41.91
3rd Qu.:128.0 3rd Qu.: 86.63 3rd Qu.:171.0 3rd Qu.: 15.29
Max. :228.0 Max. :1539.04 Max. :659.0 Max. :9000.00
NA's :3003 NA's :3622 NA's :7485 NA's :2123
albumin bloodnitro stones failingkidney
Min. :24.00 Min. : 1.00 Min. : 0.00 Min. :1.000
1st Qu.:41.00 1st Qu.: 9.00 1st Qu.: 1.00 1st Qu.:2.000
Median :43.00 Median :12.00 Median : 1.00 Median :2.000
Mean :42.82 Mean :12.86 Mean : 29.84 Mean :1.977
3rd Qu.:45.00 3rd Qu.:15.00 3rd Qu.: 2.00 3rd Qu.:2.000
Max. :56.00 Max. :95.00 Max. :999.00 Max. :9.000
NA's :3622 NA's :3622 NA's :9640 NA's :4406
stonesboolean
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :1.921
3rd Qu.:2.000
Max. :9.000
NA's :4406
Feature Selection¶
Now that we have a dataframe set with our features used to predict the kidney failure rate we should look at all of the NA values within the columns to see how to handle them on a feature by feature basis.
| . | |
|---|---|
| <dbl> | |
| id | 0 |
| gender | 0 |
| education | 7372 |
| time_in_us | 8267 |
| householdincome | 123 |
| age | 0 |
| race | 0 |
| race2 | 0 |
| water | 1392 |
| caffiene | 1644 |
| niacin | 1644 |
| protein | 1644 |
| diabp | 3003 |
| sysbp | 3003 |
| creatinine | 3622 |
| urinecreatinine | 7485 |
| uacr | 2123 |
| albumin | 3622 |
| bloodnitro | 3622 |
| stones | 9640 |
| failingkidney | 4406 |
| stonesboolean | 4406 |
- 10175
- 22
So it looks like there is a lot of missing data for education level, time spent in the US, and how frequently people experience kidney stones. I believe that would be due to- people not wanting to disclose educational information (not having a high school degree or only a GED), having been born in the US and it is not applicable information, and having not experienced a frequency in kidney stones enough to report respectivley. As we still have a boolean value of having experienced kidney stones previously I will drop the extra feature of stones and fill education and time in the us with other methods.
| id | gender | time_in_us | householdincome | age | race | race2 | water | caffiene | niacin | protein | diabp | sysbp | creatinine | urinecreatinine | uacr | albumin | bloodnitro | failingkidney | stonesboolean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <int> | <int> | |
| 1 | 73557 | 1 | NA | 4 | 69 | 4 | 4 | 2 | 203 | 11.804 | 43.63 | 72 | 122 | 106.96 | NA | 11.03 | 41 | 10 | 2 | 2 |
| 2 | 73558 | 1 | NA | 7 | 54 | 3 | 3 | 1 | 240 | 65.396 | 338.13 | 62 | 156 | 69.84 | NA | 306.00 | 47 | 16 | 2 | 2 |
| 3 | 73559 | 1 | NA | 10 | 72 | 3 | 3 | 6 | 45 | 18.342 | 64.61 | 90 | 140 | 107.85 | NA | 10.53 | 37 | 14 | 1 | 2 |
| 4 | 73560 | 1 | NA | 9 | 9 | 3 | 3 | 3 | 0 | 21.903 | 77.75 | 38 | 108 | NA | NA | 21.05 | NA | NA | NA | NA |
| 5 | 73561 | 1 | NA | 15 | 73 | 3 | 3 | 1 | 24 | 15.857 | 55.24 | 86 | 136 | 64.53 | NA | 173.47 | 43 | 31 | 1 | 2 |
| 6 | 73562 | 1 | NA | 9 | 56 | 1 | 1 | 3 | 144 | 17.119 | 55.11 | 84 | 160 | 78.68 | NA | 166.22 | 43 | 18 | 2 | 2 |
| 7 | 73563 | 1 | NA | 15 | 0 | 3 | 3 | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 8 | 73564 | 2 | NA | 10 | 61 | 3 | 3 | 7 | 4 | 29.342 | 91.15 | 80 | 118 | 81.33 | NA | 7.85 | 39 | 17 | 2 | 2 |
| 9 | 73565 | 1 | NA | 15 | 42 | 2 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 2 |
| 10 | 73566 | 2 | NA | 4 | 56 | 3 | 3 | 2 | 266 | 13.148 | 42.26 | 74 | 128 | 48.62 | NA | 7.22 | 41 | 9 | 2 | 2 |
| 11 | 73567 | 1 | NA | 3 | 65 | 3 | 3 | 7 | 43 | 19.301 | 38.09 | 78 | 140 | 85.75 | NA | 16.28 | 40 | 15 | 2 | 2 |
| 12 | 73568 | 2 | NA | 15 | 26 | 3 | 3 | 7 | 199 | 23.003 | 139.21 | 60 | 106 | 65.42 | 31 | 80.65 | 45 | 12 | 2 | 2 |
| 13 | 73569 | 2 | NA | 77 | 0 | 5 | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 14 | 73570 | 2 | NA | 5 | 9 | 5 | 7 | 2 | 47 | 18.372 | 76.40 | 44 | 102 | NA | 116 | 23.45 | NA | NA | NA | NA |
| 15 | 73571 | 1 | NA | 14 | 76 | 3 | 3 | 7 | 264 | 19.075 | 39.40 | 68 | 124 | 105.20 | 177 | 14.58 | 43 | 17 | 2 | 1 |
| 16 | 73572 | 2 | NA | 2 | 10 | 4 | 4 | 4 | 0 | 9.963 | 30.65 | 54 | 88 | NA | NA | 35.42 | NA | NA | NA | NA |
| 17 | 73573 | 1 | NA | 8 | 10 | 4 | 4 | NA | NA | NA | NA | 62 | 94 | NA | NA | 8.29 | NA | NA | NA | NA |
| 18 | 73574 | 2 | 4 | 8 | 33 | 5 | 6 | 1 | 872 | 81.974 | 274.72 | 56 | 122 | 52.16 | 173 | 7.51 | 43 | 11 | 2 | 1 |
| 19 | 73575 | 2 | NA | 3 | 1 | 4 | 4 | 7 | 3 | 6.656 | 21.60 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 20 | 73576 | 2 | NA | 8 | 16 | 4 | 4 | 6 | 0 | 14.930 | 48.91 | 68 | 108 | 78.68 | 166 | 3.01 | 51 | 14 | NA | NA |
| 21 | 73577 | 2 | 4 | 2 | 32 | 1 | 1 | 1 | 210 | 76.601 | 144.92 | 74 | 118 | 60.11 | 191 | 8.90 | 45 | 17 | 2 | 2 |
| 22 | 73578 | 2 | NA | 5 | 18 | 1 | 1 | 4 | 0 | 21.266 | 81.61 | 58 | 120 | NA | 201 | 5.27 | NA | NA | NA | NA |
| 23 | 73579 | 2 | NA | 10 | 12 | 3 | 3 | 1 | 0 | 20.340 | 81.54 | 72 | 108 | 50.39 | NA | 6.32 | 47 | 10 | NA | NA |
| 24 | 73580 | 2 | NA | 12 | 38 | 4 | 4 | 4 | 36 | 21.680 | 87.39 | 84 | 124 | 63.65 | NA | 2.79 | 38 | 10 | 2 | 2 |
| 25 | 73581 | 1 | 4 | 15 | 50 | 5 | 6 | 7 | 24 | 24.026 | 96.42 | 80 | 138 | 83.98 | NA | 4.95 | 43 | 11 | 2 | 2 |
| 26 | 73582 | 2 | NA | 3 | 23 | 4 | 4 | NA | NA | NA | NA | 56 | 98 | NA | NA | 7.61 | NA | NA | 2 | 2 |
| 27 | 73583 | 1 | NA | 15 | 7 | 3 | 3 | 7 | 0 | 9.927 | 25.81 | NA | NA | NA | NA | 19.21 | NA | NA | NA | NA |
| 28 | 73584 | 2 | NA | 9 | 13 | 3 | 3 | 1 | 0 | 7.823 | 13.11 | 54 | 108 | 52.16 | 106 | 4.72 | 42 | 14 | NA | NA |
| 29 | 73585 | 1 | 6 | 7 | 28 | 5 | 6 | 4 | 96 | 70.313 | 285.83 | 70 | 106 | 106.96 | NA | 2.70 | 47 | 23 | 2 | 2 |
| 30 | 73586 | 1 | NA | 6 | 4 | 5 | 6 | 7 | 0 | 7.143 | 24.92 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 10146 | 83702 | 2 | NA | 7 | 80 | 3 | 3 | 3 | 95 | 11.648 | 56.02 | 86 | 154 | 90.17 | NA | 13.33 | 41 | 18 | 2 | 2 |
| 10147 | 83703 | 2 | NA | 99 | 22 | 1 | 1 | 1 | 0 | 16.870 | 58.97 | 64 | 128 | 64.53 | 89 | 9.21 | 46 | 13 | 2 | 2 |
| 10148 | 83704 | 2 | NA | 1 | 15 | 3 | 3 | 4 | 5 | 15.008 | 81.36 | 38 | 108 | 63.65 | NA | 4.85 | 41 | 12 | NA | NA |
| 10149 | 83705 | 1 | 4 | 4 | 35 | 2 | 2 | 4 | 125 | 28.719 | 63.40 | 64 | 100 | 70.72 | NA | 7.03 | 44 | 16 | 2 | 2 |
| 10150 | 83706 | 2 | NA | 6 | 6 | 4 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | 10.00 | NA | NA | NA | NA |
| 10151 | 83707 | 2 | 2 | 3 | 18 | 1 | 1 | 3 | 5 | 11.780 | 34.18 | 54 | 106 | 74.26 | NA | 13.83 | 46 | 11 | NA | NA |
| 10152 | 83708 | 1 | NA | 5 | 64 | 3 | 3 | 5 | 0 | 38.044 | 92.38 | 74 | 94 | 176.80 | NA | 10.55 | 39 | 28 | 2 | 2 |
| 10153 | 83709 | 2 | NA | 15 | 24 | 3 | 3 | 1 | 177 | 15.413 | 47.70 | 62 | 116 | 86.63 | NA | 7.57 | 46 | 19 | 2 | 2 |
| 10154 | 83710 | 2 | NA | 1 | 2 | 4 | 4 | 1 | 0 | 8.022 | 11.90 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 10155 | 83711 | 1 | NA | 7 | 38 | 3 | 3 | 6 | 0 | 12.883 | 28.43 | 76 | 110 | 68.95 | 62 | 13.87 | 39 | 6 | 2 | 2 |
| 10156 | 83712 | 1 | NA | 9 | 61 | 3 | 3 | 1 | 160 | 16.420 | 66.13 | 70 | 124 | 91.05 | 85 | 3.41 | 43 | 13 | 2 | 2 |
| 10157 | 83713 | 2 | 4 | 6 | 34 | 5 | 6 | 5 | 66 | 26.604 | 119.63 | 66 | 118 | 97.24 | 80 | 10.25 | 49 | 12 | 2 | 2 |
| 10158 | 83714 | 1 | NA | 6 | 19 | 3 | 3 | 5 | 145 | 15.292 | 65.31 | 64 | 112 | 65.42 | NA | 4.48 | 45 | 9 | NA | NA |
| 10159 | 83715 | 1 | NA | 3 | 58 | 3 | 3 | 7 | 189 | 22.995 | 85.38 | 74 | 118 | 78.68 | NA | 6.09 | 39 | 12 | 2 | 2 |
| 10160 | 83716 | 2 | NA | 15 | 17 | 3 | 3 | 5 | 62 | 21.254 | 181.74 | 60 | 104 | 112.27 | NA | 3.25 | 44 | 12 | NA | NA |
| 10161 | 83717 | 2 | 6 | 6 | 80 | 1 | 1 | 2 | 0 | 11.092 | 30.03 | NA | NA | 69.84 | 125 | 29.60 | 39 | 20 | 2 | 2 |
| 10162 | 83718 | 2 | NA | NA | 60 | 4 | 4 | NA | NA | NA | NA | 72 | 114 | 90.17 | NA | 10.00 | 44 | 11 | 2 | 9 |
| 10163 | 83719 | 1 | NA | 8 | 3 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 10164 | 83720 | 1 | NA | 8 | 36 | 4 | 4 | NA | NA | NA | NA | 88 | 130 | 114.04 | NA | 6.77 | 48 | 10 | 2 | 2 |
| 10165 | 83721 | 2 | 3 | 15 | 52 | 3 | 3 | 1 | 560 | 35.698 | 35.05 | 70 | 108 | 98.12 | NA | 6.48 | 44 | 12 | 2 | 2 |
| 10166 | 83722 | 2 | NA | 7 | 0 | 5 | 7 | 6 | 0 | 8.043 | 17.36 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 10167 | 83723 | 1 | NA | 10 | 61 | 1 | 1 | 7 | 239 | 19.643 | 70.52 | NA | NA | 71.60 | 157 | 10.83 | 41 | 17 | 2 | 2 |
| 10168 | 83724 | 1 | NA | 8 | 80 | 3 | 3 | 3 | 8 | 48.232 | 77.09 | 70 | 164 | 114.04 | NA | 5.98 | 38 | 26 | 2 | 2 |
| 10169 | 83725 | 1 | NA | 6 | 7 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | 88 | 11.36 | NA | NA | NA | NA |
| 10170 | 83726 | 1 | 5 | 9 | 40 | 1 | 1 | NA | NA | NA | NA | NA | NA | NA | 114 | 4.56 | NA | NA | 2 | 2 |
| 10171 | 83727 | 2 | NA | 77 | 26 | 2 | 2 | 7 | 37 | 68.311 | 223.32 | 68 | 110 | 97.24 | NA | 4.04 | 49 | 13 | 2 | 2 |
| 10172 | 83728 | 1 | NA | 8 | 2 | 1 | 1 | 5 | 0 | 11.309 | 47.55 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 10173 | 83729 | 2 | 6 | 7 | 42 | 4 | 4 | 6 | 12 | 31.590 | 89.37 | 82 | 136 | 72.49 | NA | 5.13 | 41 | 10 | 2 | 2 |
| 10174 | 83730 | 2 | NA | 6 | 7 | 2 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | 5.23 | NA | NA | NA | NA |
| 10175 | 83731 | 1 | NA | 15 | 11 | 5 | 6 | 6 | 12 | 19.119 | 96.06 | 68 | 94 | NA | NA | 4.65 | NA | NA | NA | NA |
Imputation of NA's¶
Since we can see there are quite a few missing values within this dataframe, it is important to fill these values with various techniques to complete the set and allow the algorithm to function effectively.
Simple Imputation: We start with a normal fill for data where I would not assume much variance. For example, since the majority of people have not experienced kidney stones (resulting in many 0's) and household income was frequently unreported, this fill method suffices.
Median Imputation: I use the
median()function to fill variables that should follow a normal distribution. Given the sample size of 9,800 participants, this is a statistically sound approach. I also applied the median for caffeine, niacin, and protein, as these vary widely; achieving a "middling-distribution" is key here.Mode Imputation: I wrote a function to calculate the Mode to impute Race and Water Consumption. As this is a U.S. survey, the majority of participants are Non-Hispanic White. While water consumption could use mean imputation, the mode is more appropriate here as the discrete options for daily consumption align better with the population's most common answers.
Predictive Mean Matching (PMM): Finally, I use PMM—derived from the already imputed values—to estimate the "hard-hitting" features.
Industry Tip: For solid prediction models, PMM or KNN should be used if you cannot drop NA values. Standard mean imputation is often considered a "lazy" way just to get a model moving. I have used a combination of methods to demonstrate versatility across different data types.
Understanding Data Missingness¶
In professional data science, the reason data is missing dictates the method of repair. My approach accounts for the three primary types:
- MCAR (Missing Completely at Random): Handled via Median/Mode.
- MAR (Missing at Random): Addressed through the multi-variable relationships in PMM.
- MNAR (Missing Not at Random): Acknowledged in medical contexts where a test wasn't performed because the patient didn't meet specific health criteria.
iter imp variable 1 1 creatinine urinecreatinine uacr bloodnitro failingkidney 1 2 creatinine urinecreatinine uacr bloodnitro failingkidney 1 3 creatinine urinecreatinine uacr bloodnitro failingkidney 1 4 creatinine urinecreatinine uacr bloodnitro failingkidney 1 5 creatinine urinecreatinine uacr bloodnitro failingkidney 2 1 creatinine urinecreatinine uacr bloodnitro failingkidney 2 2 creatinine urinecreatinine uacr bloodnitro failingkidney 2 3 creatinine urinecreatinine uacr bloodnitro failingkidney 2 4 creatinine urinecreatinine uacr bloodnitro failingkidney 2 5 creatinine urinecreatinine uacr bloodnitro failingkidney 3 1 creatinine urinecreatinine uacr bloodnitro failingkidney 3 2 creatinine urinecreatinine uacr bloodnitro failingkidney 3 3 creatinine urinecreatinine uacr bloodnitro failingkidney 3 4 creatinine urinecreatinine uacr bloodnitro failingkidney 3 5 creatinine urinecreatinine uacr bloodnitro failingkidney 4 1 creatinine urinecreatinine uacr bloodnitro failingkidney 4 2 creatinine urinecreatinine uacr bloodnitro failingkidney 4 3 creatinine urinecreatinine uacr bloodnitro failingkidney 4 4 creatinine urinecreatinine uacr bloodnitro failingkidney 4 5 creatinine urinecreatinine uacr bloodnitro failingkidney 5 1 creatinine urinecreatinine uacr bloodnitro failingkidney 5 2 creatinine urinecreatinine uacr bloodnitro failingkidney 5 3 creatinine urinecreatinine uacr bloodnitro failingkidney 5 4 creatinine urinecreatinine uacr bloodnitro failingkidney 5 5 creatinine urinecreatinine uacr bloodnitro failingkidney 6 1 creatinine urinecreatinine uacr bloodnitro failingkidney 6 2 creatinine urinecreatinine uacr bloodnitro failingkidney 6 3 creatinine urinecreatinine uacr bloodnitro failingkidney 6 4 creatinine urinecreatinine uacr bloodnitro failingkidney 6 5 creatinine urinecreatinine uacr bloodnitro failingkidney 7 1 creatinine urinecreatinine uacr bloodnitro failingkidney 7 2 creatinine urinecreatinine uacr bloodnitro failingkidney 7 3 creatinine urinecreatinine uacr bloodnitro failingkidney 7 4 creatinine urinecreatinine uacr bloodnitro failingkidney 7 5 creatinine urinecreatinine uacr bloodnitro failingkidney 8 1 creatinine urinecreatinine uacr bloodnitro failingkidney 8 2 creatinine urinecreatinine uacr bloodnitro failingkidney 8 3 creatinine urinecreatinine uacr bloodnitro failingkidney 8 4 creatinine urinecreatinine uacr bloodnitro failingkidney 8 5 creatinine urinecreatinine uacr bloodnitro failingkidney 9 1 creatinine urinecreatinine uacr bloodnitro failingkidney 9 2 creatinine urinecreatinine uacr bloodnitro failingkidney 9 3 creatinine urinecreatinine uacr bloodnitro failingkidney 9 4 creatinine urinecreatinine uacr bloodnitro failingkidney 9 5 creatinine urinecreatinine uacr bloodnitro failingkidney 10 1 creatinine urinecreatinine uacr bloodnitro failingkidney 10 2 creatinine urinecreatinine uacr bloodnitro failingkidney 10 3 creatinine urinecreatinine uacr bloodnitro failingkidney 10 4 creatinine urinecreatinine uacr bloodnitro failingkidney 10 5 creatinine urinecreatinine uacr bloodnitro failingkidney
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 25.64 61.00 72.49 77.81 86.63 1539.04 3622
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 4 | 88.40 | 76.02 | 83.98 | 61.88 | 56.58 |
| 7 | 47.74 | 65.42 | 67.18 | 38.90 | 55.69 |
| 9 | 65.42 | 114.92 | 89.28 | 75.14 | 62.76 |
| 13 | 64.53 | 68.07 | 68.95 | 68.95 | 37.13 |
| 14 | 54.81 | 55.69 | 54.81 | 91.94 | 51.27 |
| 16 | 93.70 | 58.34 | 79.56 | 77.79 | 92.82 |
| 17 | 70.72 | 103.43 | 68.07 | 86.63 | 108.73 |
| 19 | 34.48 | 66.30 | 74.26 | 80.44 | 53.92 |
| 22 | 46.85 | 69.84 | 70.72 | 74.26 | 40.66 |
| 26 | 92.82 | 77.79 | 45.97 | 66.30 | 61.88 |
| 27 | 64.53 | 67.18 | 90.17 | 61.00 | 60.11 |
| 30 | 61.00 | 38.01 | 74.26 | 63.65 | 44.20 |
| 32 | 82.21 | 83.10 | 72.49 | 83.10 | 62.76 |
| 34 | 52.16 | 40.66 | 58.34 | 64.53 | 58.34 |
| 35 | 80.44 | 82.21 | 43.32 | 61.88 | 106.08 |
| 37 | 66.30 | 68.95 | 68.07 | 83.98 | 71.60 |
| 41 | 282.88 | 126.41 | 127.30 | 65.42 | 100.78 |
| 46 | 71.60 | 78.68 | 71.60 | 68.95 | 59.23 |
| 50 | 106.96 | 76.02 | 54.81 | 48.62 | 72.49 |
| 52 | 75.14 | 64.53 | 94.59 | 83.10 | 71.60 |
| 53 | 68.95 | 64.53 | 56.58 | 57.46 | 55.69 |
| 55 | 88.40 | 69.84 | 53.92 | 62.76 | 57.46 |
| 56 | 74.26 | 83.10 | 76.91 | 70.72 | 50.39 |
| 61 | 64.53 | 68.07 | 57.46 | 82.21 | 38.90 |
| 64 | 80.44 | 61.88 | 54.81 | 57.46 | 76.02 |
| 69 | 34.48 | 53.92 | 38.90 | 54.81 | 64.53 |
| 70 | 86.63 | 67.18 | 68.07 | 72.49 | 102.54 |
| 71 | 56.58 | 63.65 | 100.78 | 58.34 | 75.14 |
| 74 | 52.16 | 58.34 | 98.12 | 89.28 | 74.26 |
| 78 | 114.92 | 90.17 | 70.72 | 75.14 | 81.33 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 10079 | 119.34 | 100.78 | 78.68 | 44.20 | 62.76 |
| 10085 | 68.95 | 87.52 | 114.92 | 76.02 | 57.46 |
| 10088 | 61.88 | 89.28 | 63.65 | 67.18 | 60.11 |
| 10089 | 83.10 | 38.90 | 45.97 | 66.30 | 57.46 |
| 10093 | 102.54 | 61.88 | 51.27 | 60.11 | 51.27 |
| 10094 | 282.88 | 139.67 | 556.92 | 1103.23 | 146.74 |
| 10095 | 45.97 | 60.11 | 61.00 | 93.70 | 76.91 |
| 10096 | 88.40 | 55.69 | 68.95 | 83.10 | 79.56 |
| 10097 | 89.28 | 79.56 | 68.07 | 83.98 | 99.01 |
| 10098 | 95.47 | 90.17 | 56.58 | 59.23 | 68.95 |
| 10100 | 87.52 | 70.72 | 69.84 | 44.20 | 99.89 |
| 10101 | 56.58 | 87.52 | 76.02 | 38.01 | 102.54 |
| 10105 | 90.17 | 66.30 | 70.72 | 73.37 | 100.78 |
| 10114 | 61.00 | 45.97 | 70.72 | 48.62 | 57.46 |
| 10125 | 39.78 | 67.18 | 79.56 | 71.60 | 59.23 |
| 10129 | 54.81 | 60.11 | 62.76 | 53.92 | 40.66 |
| 10130 | 87.52 | 54.81 | 91.05 | 79.56 | 91.05 |
| 10134 | 65.42 | 56.58 | 85.75 | 63.65 | 35.36 |
| 10139 | 36.24 | 67.18 | 60.11 | 71.60 | 86.63 |
| 10141 | 61.00 | 64.53 | 72.49 | 57.46 | 61.00 |
| 10142 | 70.72 | 53.04 | 64.53 | 57.46 | 75.14 |
| 10150 | 45.97 | 65.42 | 40.66 | 69.84 | 55.69 |
| 10154 | 63.65 | 68.95 | 61.88 | 85.75 | 65.42 |
| 10163 | 89.28 | 74.26 | 57.46 | 46.85 | 78.68 |
| 10166 | 51.27 | 83.10 | 58.34 | 42.43 | 40.66 |
| 10169 | 76.02 | 59.23 | 61.00 | 71.60 | 98.12 |
| 10170 | 73.37 | 72.49 | 59.23 | 52.16 | 45.97 |
| 10172 | 59.23 | 74.26 | 45.97 | 56.58 | 55.69 |
| 10174 | 82.21 | 45.08 | 50.39 | 83.98 | 68.07 |
| 10175 | 72.49 | 76.02 | 81.33 | 49.50 | 39.78 |
Cleaned data¶
From this output the dataset I'm going with is set 1. It seems to hover around the mean/ median of the initial set and has a high variance without throwing super low and high values to adjust mean like set 3 does. I think it will fit the test set and train well for our purposes.
id gender time_in_us householdincome age
Min. :73557 Min. :1.0 Min. :0.0000 Min. : 1.00 Min. : 0.00
1st Qu.:76100 1st Qu.:1.0 1st Qu.:0.0000 1st Qu.: 5.00 1st Qu.:10.00
Median :78644 Median :1.0 Median :0.0000 Median : 7.00 Median :26.00
Mean :78644 Mean :1.5 Mean :0.1875 Mean :10.48 Mean :31.48
3rd Qu.:81188 3rd Qu.:2.0 3rd Qu.:0.0000 3rd Qu.:14.00 3rd Qu.:52.00
Max. :83731 Max. :2.0 Max. :1.0000 Max. :99.00 Max. :80.00
race race2 water caffiene
Min. :1.000 Min. :1.00 Min. :1.000 Min. : 0.0
1st Qu.:2.000 1st Qu.:2.00 1st Qu.:3.000 1st Qu.: 2.0
Median :3.000 Median :3.00 Median :6.000 Median : 25.0
Mean :3.092 Mean :3.29 Mean :4.706 Mean : 82.3
3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:6.000 3rd Qu.: 102.0
Max. :5.000 Max. :7.00 Max. :7.000 Max. :2448.0
niacin protein diabp sysbp
Min. : 0.215 Min. : 0.00 Min. : 0.00 Min. : 66.0
1st Qu.: 14.829 1st Qu.: 49.63 1st Qu.: 62.00 1st Qu.:110.0
Median : 20.196 Median : 66.05 Median : 66.00 Median :116.0
Mean : 22.974 Mean : 73.17 Mean : 65.84 Mean :117.5
3rd Qu.: 26.979 3rd Qu.: 87.37 3rd Qu.: 72.00 3rd Qu.:122.0
Max. :379.852 Max. :869.49 Max. :122.00 Max. :228.0
creatinine urinecreatinine uacr albumin
Min. : 25.64 Min. : 8.0 Min. : 0.21 Min. :24.00
1st Qu.: 59.23 1st Qu.: 68.0 1st Qu.: 5.00 1st Qu.:42.00
Median : 70.72 Median :117.0 Median : 7.78 Median :43.00
Mean : 74.75 Mean :131.9 Mean : 42.15 Mean :42.88
3rd Qu.: 83.98 3rd Qu.:177.0 3rd Qu.: 15.17 3rd Qu.:44.00
Max. :1539.04 Max. :659.0 Max. :9000.00 Max. :56.00
bloodnitro failingkidney stonesboolean
Min. : 1.00 Min. :1.000 Min. :1.000
1st Qu.: 8.00 1st Qu.:2.000 1st Qu.:2.000
Median :11.00 Median :2.000 Median :2.000
Mean :11.77 Mean :1.974 Mean :1.921
3rd Qu.:14.00 3rd Qu.:2.000 3rd Qu.:2.000
Max. :95.00 Max. :2.000 Max. :9.000
| id | gender | time_in_us | householdincome | age | race | race2 | water | caffiene | niacin | protein | diabp | sysbp | creatinine | urinecreatinine | uacr | albumin | bloodnitro | failingkidney | stonesboolean | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <int> | <int> | |
| 1 | 73557 | 1 | 0 | 4 | 69 | 4 | 4 | 2 | 203 | 11.804 | 43.63 | 72 | 122 | 106.96 | 77 | 11.03 | 41 | 10 | 2 | 2 |
| 2 | 73558 | 1 | 0 | 7 | 54 | 3 | 3 | 1 | 240 | 65.396 | 338.13 | 62 | 156 | 69.84 | 93 | 306.00 | 47 | 16 | 2 | 2 |
| 3 | 73559 | 1 | 0 | 10 | 72 | 3 | 3 | 6 | 45 | 18.342 | 64.61 | 90 | 140 | 107.85 | 59 | 10.53 | 37 | 14 | 1 | 2 |
| 4 | 73560 | 1 | 0 | 9 | 9 | 3 | 3 | 3 | 0 | 21.903 | 77.75 | 38 | 108 | 88.40 | 247 | 21.05 | 43 | 9 | 2 | 2 |
| 5 | 73561 | 1 | 0 | 15 | 73 | 3 | 3 | 1 | 24 | 15.857 | 55.24 | 86 | 136 | 64.53 | 58 | 173.47 | 43 | 31 | 1 | 2 |
| 6 | 73562 | 1 | 0 | 9 | 56 | 1 | 1 | 3 | 144 | 17.119 | 55.11 | 84 | 160 | 78.68 | 216 | 166.22 | 43 | 18 | 2 | 2 |
Feature Selection¶
Now that we have our test data I ran a summary to see if it resembles the original datset. I found the set has similar values and about a 2.6% instance of kidney failure like the original data did, however, the minimum of bloodnitrogen and urinecreatinine are large negative values. We need to switch imputation style so only positive values come through.
After going back, setting the seed and switching from normal distribution to pmm I retreived good values with a 2.5% instance in kidney failure. So a bit lower than the 3% purported by the original data, however, it is still good for prediction as the normal population is estimated at 15% and this is just a sample of generally healthy individuals willing to respond to a survey.
So now that we have our test data lets check for variance in other potential features to drop. With 19 features currently it would be best to start with a correlation coefficient matrix to chekc for high correlations, a variance check, then we can do a final redundancy of the data to maybe get it closer to 10-15 strong predictors.
CorrelationMatrix¶
First we dropped stones for the number of nulls present, then education for the lack of variance and excessive NA's as well. Let us plot the correlation matrix to see if anything is too strong of a predictor (usually above .75).
Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
The following object is masked from ‘package:httr’:
progress
gender time_in_us householdincome age
gender 1.000000000 -0.070957298 -0.046330235 -0.082621030
time_in_us -0.070957298 1.000000000 0.070156206 0.250952173
householdincome -0.046330235 0.070156206 1.000000000 0.016968739
age -0.082621030 0.250952173 0.016968739 1.000000000
race 0.037428882 0.006511826 -0.024476058 0.033181697
race2 0.018964993 0.069458756 -0.009630431 0.003366239
water 0.006417464 0.044031784 0.025873527 -0.090197910
caffiene -0.048288863 0.023033037 -0.010892899 0.333588149
niacin -0.064593314 0.038004931 0.020672666 0.104217369
protein -0.054792397 0.086996195 0.030088611 0.133614525
diabp -0.051381249 0.102191198 0.012517244 0.246988021
sysbp -0.041467206 0.079805562 -0.020232133 0.469085841
creatinine -0.072448777 -0.028408331 -0.010199230 0.245916420
urinecreatinine 0.001079895 -0.086689244 -0.034533930 -0.094212102
uacr -0.013128156 0.016765898 -0.007049332 0.088496324
albumin -0.058780716 0.021094993 0.042399173 -0.215474694
bloodnitro -0.066532038 0.086720558 0.035360495 0.470218255
failingkidney 0.003707065 0.003385810 0.006910452 -0.094526714
stonesboolean 0.010721949 0.024160363 0.013036832 -0.018995284
race race2 water caffiene
gender 0.037428882 0.018964993 0.006417464 -0.048288863
time_in_us 0.006511826 0.069458756 0.044031784 0.023033037
householdincome -0.024476058 -0.009630431 0.025873527 -0.010892899
age 0.033181697 0.003366239 -0.090197910 0.333588149
race 1.000000000 0.968422645 0.018338264 -0.029153653
race2 0.968422645 1.000000000 0.029382992 -0.040223845
water 0.018338264 0.029382992 1.000000000 -0.069299776
caffiene -0.029153653 -0.040223845 -0.069299776 1.000000000
niacin -0.010211471 -0.014517053 -0.017193084 0.241651730
protein -0.020089961 -0.019974394 -0.021333536 0.163381075
diabp 0.024428765 0.020131955 -0.054758404 0.146285281
sysbp 0.027894218 0.008789197 -0.091343046 0.138699937
creatinine 0.068712539 0.038795467 -0.044335298 0.076117292
urinecreatinine 0.099162535 0.063330033 -0.014658323 -0.066392552
uacr -0.004316759 -0.006724843 -0.017678085 0.005019089
albumin -0.038297602 -0.012912400 -0.020615979 -0.017004475
bloodnitro -0.007119886 -0.017275327 -0.066284352 0.121141917
failingkidney 0.002493181 0.001898238 0.025811126 -0.011766013
stonesboolean 0.006597954 0.006142984 -0.003506018 -0.023592773
niacin protein diabp sysbp
gender -0.0645933137 -0.054792397 -0.051381249 -0.041467206
time_in_us 0.0380049310 0.086996195 0.102191198 0.079805562
householdincome 0.0206726655 0.030088611 0.012517244 -0.020232133
age 0.1042173689 0.133614525 0.246988021 0.469085841
race -0.0102114706 -0.020089961 0.024428765 0.027894218
race2 -0.0145170528 -0.019974394 0.020131955 0.008789197
water -0.0171930844 -0.021333536 -0.054758404 -0.091343046
caffiene 0.2416517295 0.163381075 0.146285281 0.138699937
niacin 1.0000000000 0.748998797 0.049268557 0.023011049
protein 0.7489987969 1.000000000 0.055128755 0.033175259
diabp 0.0492685569 0.055128755 1.000000000 0.431022979
sysbp 0.0230110492 0.033175259 0.431022979 1.000000000
creatinine 0.0522865710 0.057378289 0.044675209 0.146271368
urinecreatinine 0.0005421783 -0.008769935 0.012549420 -0.006899802
uacr -0.0018907787 -0.008434662 0.026052555 0.151907488
albumin 0.0937018082 0.083605458 -0.046720161 -0.122186102
bloodnitro 0.1019835607 0.152868831 0.037840350 0.220577543
failingkidney -0.0057558518 -0.012458321 0.047009889 -0.044141730
stonesboolean 0.0026000087 -0.006544445 0.009182641 -0.003186755
creatinine urinecreatinine uacr albumin
gender -0.07244878 0.0010798949 -0.013128156 -0.05878072
time_in_us -0.02840833 -0.0866892438 0.016765898 0.02109499
householdincome -0.01019923 -0.0345339303 -0.007049332 0.04239917
age 0.24591642 -0.0942121017 0.088496324 -0.21547469
race 0.06871254 0.0991625351 -0.004316759 -0.03829760
race2 0.03879547 0.0633300332 -0.006724843 -0.01291240
water -0.04433530 -0.0146583234 -0.017678085 -0.02061598
caffiene 0.07611729 -0.0663925522 0.005019089 -0.01700447
niacin 0.05228657 0.0005421783 -0.001890779 0.09370181
protein 0.05737829 -0.0087699347 -0.008434662 0.08360546
diabp 0.04467521 0.0125494199 0.026052555 -0.04672016
sysbp 0.14627137 -0.0068998019 0.151907488 -0.12218610
creatinine 1.00000000 0.0555276970 0.445551908 -0.07669201
urinecreatinine 0.05552770 1.0000000000 -0.045688757 0.06350680
uacr 0.44555191 -0.0456887571 1.000000000 -0.17442082
albumin -0.07669201 0.0635068036 -0.174420819 1.00000000
bloodnitro 0.54673934 -0.0404925656 0.276554972 -0.07161760
failingkidney -0.30758735 0.0647795285 -0.232206084 0.07911537
stonesboolean -0.02766800 0.0611988893 -0.029704753 0.03025162
bloodnitro failingkidney stonesboolean
gender -0.066532038 0.003707065 0.010721949
time_in_us 0.086720558 0.003385810 0.024160363
householdincome 0.035360495 0.006910452 0.013036832
age 0.470218255 -0.094526714 -0.018995284
race -0.007119886 0.002493181 0.006597954
race2 -0.017275327 0.001898238 0.006142984
water -0.066284352 0.025811126 -0.003506018
caffiene 0.121141917 -0.011766013 -0.023592773
niacin 0.101983561 -0.005755852 0.002600009
protein 0.152868831 -0.012458321 -0.006544445
diabp 0.037840350 0.047009889 0.009182641
sysbp 0.220577543 -0.044141730 -0.003186755
creatinine 0.546739342 -0.307587353 -0.027667998
urinecreatinine -0.040492566 0.064779528 0.061198889
uacr 0.276554972 -0.232206084 -0.029704753
albumin -0.071617599 0.079115367 0.030251624
bloodnitro 1.000000000 -0.242272649 -0.030230486
failingkidney -0.242272649 1.000000000 0.040316542
stonesboolean -0.030230486 0.040316542 1.000000000
So most of these are very weakly correlated to failing kidneys. The highest correlations were negative to creatinine, urine analysis of creatinine, and blood nitrogen. Which is accurate in respect to the literature. Surpisingly in our model dystolic and systolic blood pressures were not heavily correlated at all. Typically kidney stress or failure comes from higher blood pressure. But with the direct measures of how the kidneys are functioning through creatinine and nitrogen output I think we may be on to something.
Variance ranking from MLR¶
Let's plot the importance of each variable in respect to the estimation of kidney failure.
ROC curve variable importance
Importance
creatinine 0.7098
bloodnitro 0.6922
age 0.6613
uacr 0.6469
urinecreatinine 0.6232
albumin 0.6035
diabp 0.5699
householdincome 0.5481
water 0.5481
sysbp 0.5480
stonesboolean 0.5476
id 0.5142
gender 0.5059
race 0.5056
time_in_us 0.5042
race2 0.5038
protein 0.5034
niacin 0.5009
caffiene 0.5008
After this regression has completed it may not come as a surprise, but if your prediction power is on par with your randomly assigned ID it is probably not a good prediction feature. All of these features at and below id also have a common thread- they are predictions of long term kidney health. Not nesseciarly garuantees of kidney failure. Gender, race (hisp/non-hisp), race2 (black, asian, white), protein (intake in last week), niacin (unless supplemented who knows weekly consumption?), caffiene consumption (this one is surprising to not predict), and time in us (if race didn't predict american vs non american wouldn't give us additional info). These are all lifestyle habits that cumulate over time to create kidney scarring and damage or support kidney function (niacin). But as this is not time-series data. At-a-glimpse anyone who has already suffered kidney failure as asked in the questionaire would proabbly not be participating in most of these habits. Drop em.
| householdincome | age | water | diabp | sysbp | creatinine | urinecreatinine | uacr | albumin | bloodnitro | stonesboolean | failingkidney | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <int> | <dbl> | <int> | <dbl> | <int> | <int> | <int> | <fct> | |
| 1 | 4 | 69 | 2 | 72 | 122 | 106.96 | 77 | 11.03 | 41 | 10 | 2 | 0 |
| 2 | 7 | 54 | 1 | 62 | 156 | 69.84 | 93 | 306.00 | 47 | 16 | 2 | 0 |
| 3 | 10 | 72 | 6 | 90 | 140 | 107.85 | 59 | 10.53 | 37 | 14 | 2 | 1 |
| 4 | 9 | 9 | 3 | 38 | 108 | 88.40 | 247 | 21.05 | 43 | 9 | 2 | 0 |
| 5 | 15 | 73 | 1 | 86 | 136 | 64.53 | 58 | 173.47 | 43 | 31 | 2 | 1 |
| 6 | 9 | 56 | 3 | 84 | 160 | 78.68 | 216 | 166.22 | 43 | 18 | 2 | 0 |
| 7 | 15 | 0 | 3 | 66 | 116 | 47.74 | 127 | 16.00 | 43 | 13 | 2 | 0 |
| 8 | 10 | 61 | 7 | 80 | 118 | 81.33 | 115 | 7.85 | 39 | 17 | 2 | 0 |
| 9 | 15 | 42 | 6 | 66 | 116 | 65.42 | 237 | 5.00 | 43 | 13 | 2 | 0 |
| 10 | 4 | 56 | 2 | 74 | 128 | 48.62 | 114 | 7.22 | 41 | 9 | 2 | 0 |
| 11 | 3 | 65 | 7 | 78 | 140 | 85.75 | 19 | 16.28 | 40 | 15 | 2 | 0 |
| 12 | 15 | 26 | 7 | 60 | 106 | 65.42 | 31 | 80.65 | 45 | 12 | 2 | 0 |
| 13 | 77 | 0 | 6 | 66 | 116 | 64.53 | 30 | 4.36 | 43 | 7 | 2 | 0 |
| 14 | 5 | 9 | 2 | 44 | 102 | 54.81 | 116 | 23.45 | 43 | 8 | 2 | 0 |
| 15 | 14 | 76 | 7 | 68 | 124 | 105.20 | 177 | 14.58 | 43 | 17 | 1 | 0 |
| 16 | 2 | 10 | 4 | 54 | 88 | 93.70 | 73 | 35.42 | 43 | 9 | 1 | 0 |
| 17 | 8 | 10 | 6 | 62 | 94 | 70.72 | 86 | 8.29 | 43 | 12 | 1 | 0 |
| 18 | 8 | 33 | 1 | 56 | 122 | 52.16 | 173 | 7.51 | 43 | 11 | 1 | 0 |
| 19 | 3 | 1 | 7 | 66 | 116 | 34.48 | 153 | 7.61 | 43 | 12 | 1 | 0 |
| 20 | 8 | 16 | 6 | 68 | 108 | 78.68 | 166 | 3.01 | 51 | 14 | 1 | 0 |
| 21 | 2 | 32 | 1 | 74 | 118 | 60.11 | 191 | 8.90 | 45 | 17 | 2 | 0 |
| 22 | 5 | 18 | 4 | 58 | 120 | 46.85 | 201 | 5.27 | 43 | 12 | 2 | 0 |
| 23 | 10 | 12 | 1 | 72 | 108 | 50.39 | 143 | 6.32 | 47 | 10 | 2 | 0 |
| 24 | 12 | 38 | 4 | 84 | 124 | 63.65 | 120 | 2.79 | 38 | 10 | 2 | 0 |
| 25 | 15 | 50 | 7 | 80 | 138 | 83.98 | 101 | 4.95 | 43 | 11 | 2 | 0 |
| 26 | 3 | 23 | 6 | 56 | 98 | 92.82 | 205 | 7.61 | 43 | 10 | 2 | 0 |
| 27 | 15 | 7 | 7 | 66 | 116 | 64.53 | 120 | 19.21 | 43 | 6 | 2 | 0 |
| 28 | 9 | 13 | 1 | 54 | 108 | 52.16 | 106 | 4.72 | 42 | 14 | 2 | 0 |
| 29 | 7 | 28 | 4 | 70 | 106 | 106.96 | 206 | 2.70 | 47 | 23 | 2 | 0 |
| 30 | 6 | 4 | 7 | 66 | 116 | 61.00 | 120 | 6.53 | 43 | 8 | 2 | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 10146 | 7 | 80 | 3 | 86 | 154 | 90.17 | 56 | 13.33 | 41 | 18 | 2 | 0 |
| 10147 | 99 | 22 | 1 | 64 | 128 | 64.53 | 89 | 9.21 | 46 | 13 | 2 | 0 |
| 10148 | 1 | 15 | 4 | 38 | 108 | 63.65 | 88 | 4.85 | 41 | 12 | 2 | 0 |
| 10149 | 4 | 35 | 4 | 64 | 100 | 70.72 | 201 | 7.03 | 44 | 16 | 2 | 0 |
| 10150 | 6 | 6 | 6 | 66 | 116 | 45.97 | 123 | 10.00 | 43 | 7 | 2 | 0 |
| 10151 | 3 | 18 | 3 | 54 | 106 | 74.26 | 156 | 13.83 | 46 | 11 | 2 | 0 |
| 10152 | 5 | 64 | 5 | 74 | 94 | 176.80 | 239 | 10.55 | 39 | 28 | 2 | 0 |
| 10153 | 15 | 24 | 1 | 62 | 116 | 86.63 | 93 | 7.57 | 46 | 19 | 2 | 0 |
| 10154 | 1 | 2 | 1 | 66 | 116 | 63.65 | 381 | 35.66 | 43 | 10 | 2 | 0 |
| 10155 | 7 | 38 | 6 | 76 | 110 | 68.95 | 62 | 13.87 | 39 | 6 | 2 | 0 |
| 10156 | 9 | 61 | 1 | 70 | 124 | 91.05 | 85 | 3.41 | 43 | 13 | 2 | 0 |
| 10157 | 6 | 34 | 5 | 66 | 118 | 97.24 | 80 | 10.25 | 49 | 12 | 2 | 0 |
| 10158 | 6 | 19 | 5 | 64 | 112 | 65.42 | 81 | 4.48 | 45 | 9 | 2 | 0 |
| 10159 | 3 | 58 | 7 | 74 | 118 | 78.68 | 45 | 6.09 | 39 | 12 | 2 | 0 |
| 10160 | 15 | 17 | 5 | 60 | 104 | 112.27 | 91 | 3.25 | 44 | 12 | 2 | 0 |
| 10161 | 6 | 80 | 2 | 66 | 116 | 69.84 | 125 | 29.60 | 39 | 20 | 2 | 0 |
| 10162 | 6 | 60 | 6 | 72 | 114 | 90.17 | 659 | 10.00 | 44 | 11 | 9 | 0 |
| 10163 | 8 | 3 | 6 | 66 | 116 | 89.28 | 149 | 5.53 | 43 | 7 | 9 | 0 |
| 10164 | 8 | 36 | 6 | 88 | 130 | 114.04 | 367 | 6.77 | 48 | 10 | 2 | 0 |
| 10165 | 15 | 52 | 1 | 70 | 108 | 98.12 | 434 | 6.48 | 44 | 12 | 2 | 0 |
| 10166 | 7 | 0 | 6 | 66 | 116 | 51.27 | 120 | 6.60 | 43 | 10 | 2 | 0 |
| 10167 | 10 | 61 | 7 | 66 | 116 | 71.60 | 157 | 10.83 | 41 | 17 | 2 | 0 |
| 10168 | 8 | 80 | 3 | 70 | 164 | 114.04 | 192 | 5.98 | 38 | 26 | 2 | 0 |
| 10169 | 6 | 7 | 6 | 66 | 116 | 76.02 | 88 | 11.36 | 43 | 7 | 2 | 0 |
| 10170 | 9 | 40 | 6 | 66 | 116 | 73.37 | 114 | 4.56 | 43 | 10 | 2 | 0 |
| 10171 | 77 | 26 | 7 | 68 | 110 | 97.24 | 173 | 4.04 | 49 | 13 | 2 | 0 |
| 10172 | 8 | 2 | 5 | 66 | 116 | 59.23 | 91 | 14.27 | 43 | 13 | 2 | 0 |
| 10173 | 7 | 42 | 6 | 82 | 136 | 72.49 | 143 | 5.13 | 41 | 10 | 2 | 0 |
| 10174 | 6 | 7 | 6 | 66 | 116 | 82.21 | 280 | 5.23 | 43 | 10 | 2 | 0 |
| 10175 | 15 | 11 | 6 | 68 | 94 | 72.49 | 61 | 4.65 | 43 | 7 | 2 | 0 |
Random Forest RFE¶
Although we reached our goal of 10-15 predicting features (11). I think it would be beneficial to run a Random Forest Recursive Feature Elimination to see if any of the other features are obvious eliminations to speed up the machine learning process. (on a laptop the previous regression took about 10 mins to process).
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.9759 0.2283 0.002891 0.09754
2 0.9752 0.2188 0.003363 0.12320
3 0.9760 0.2334 0.002497 0.10780
4 0.9760 0.2507 0.002499 0.10997
5 0.9764 0.2492 0.002618 0.12167 *
6 0.9763 0.2412 0.002514 0.10984
7 0.9761 0.2412 0.002272 0.08857
8 0.9758 0.2253 0.002586 0.09853
9 0.9761 0.2508 0.002226 0.08428
10 0.9762 0.2436 0.002733 0.09494
11 0.9760 0.2415 0.002503 0.08900
The top 5 variables (out of 5):
creatinine, bloodnitro, age, diabp, urinecreatinine
- 'creatinine'
- 'bloodnitro'
- 'age'
- 'diabp'
- 'urinecreatinine'
This Recursion is saying that Creatinine, bloodnitrogen, age, diastolyic blood pressure, and urinecreatinine are the strongest predictiors for the model and adding more features can cause a detriment to prediction power. However, looking at the scale of analysis it is all within 97-98% and adding weaker predictors for their variance will be beneficial to the final model. So, we will keep all 11 predictors for the next steps of the process. So, lets offload a .csv of the complete data and move notebooks....
Conclusion¶
That concludes this first portion of the model. According to the medical literature I referenced, the final feature set aligns closely with the strongest known predictors of kidney disease, with one notable exception: I chose not to include diabetes in this specific model. While diabetes is a primary driver of renal issues, the comorbidities associated with chronic illness are vast and complex. I wanted to focus this model on features that are largely intervenable or actionable, with the exception of age. By narrowing the scope this way, the model becomes a more focused tool for looking at physiological markers directly.
At this stage, the notebook runtime is nearing the one-hour mark on my laptop, and the dataset is now fully cleaned, imputed, and tapered down to its most essential predictors. The next logical step is to transition to a fresh notebook to perform a final check for feature collinearity, split the data into training and test sets, and begin running the actual model predictions.
Thank you for taking the time to read through this extensive preprocessing and cleaning phase! If you have any tips for optimizing R code or suggestions for machine learning techniques that could help speed up the runtime of the MICE or Caret packages in this environment, please share them in the comments.
References¶
https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2013
https://www.cdc.gov/kidneydisease/publications-resources/kidney-tests.html
https://www.niddk.nih.gov/health-information/health-statistics/kidney-disease
https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/