Project Overview¶

My interest in kidney disease prediction stems from a personal passion for health and wellness, as well as a family history of Polycystic Kidney Disease (PKD). With kidney failure rates rising across the United States, I wanted to explore how demographic data, lab results, and dietary factors can be used to predict the likelihood of an individual requiring dialysis or a transplant.

Objective¶

The goal of this phase is to prepare a high-quality dataset for predictive modeling. I drew upon medical literature to select relevant features, followed by a rigorous pipeline of cleaning, transformation, and imputation.

Methodology¶

  • Data Preprocessing & Imputation: Cleaned the raw data and handled missing values to ensure a complete dataset.
  • Feature Engineering & Selection: * Correlation Analysis: Identified and addressed multicollinearity within the feature matrix.
    • Multiple Linear Regression: Evaluated variance and ranked feature importance.
    • Recursive Feature Elimination (RFE): Systematically narrowed the dataset down to the 11 most impactful predictors.

Output: The result is a refined .csv dataset ready for the modeling phase, where I will evaluate various bagging and boosting techniques.

Note: This model is designed to assess the likelihood of current kidney failure rather than forecasting future onset.


Data¶

Where does the data come from?¶

Data Source: NHANES (CDC/NCHS) The data for this model is sourced from the National Health and Nutrition Examination Survey (NHANES), a flagship program of the Centers for Disease Control and Prevention (CDC).

Why is this data significant? * National Representation: The survey examines a representative sample of approximately 5,000 individuals across the U.S. annually.

  • Gold-Standard Metrics: Unlike many datasets that rely solely on self-reported surveys, NHANES combines in-home interviews with physical examinations and laboratory tests conducted in mobile examination centers.
  • Multidimensional Features: It provides a comprehensive view of a patient by linking demographic and socioeconomic factors with clinical lab results.

The Data Ecosystem¶

The power of this prediction model lies in its multi-dimensional approach, merging six distinct datasets from the NHANES 2013-2014 cycle:

  1. Demographics: Key Variables include Age, Gender, Race/Ethnicity, Education, and Income. Establishes baseline risk groups.
  2. Examinations: Key Variables include Blood Pressure, BMI, and Grip Strength. Identifies physical indicators of hypertension.
  3. Laboratory: Key Variables include Albumin, Creatinine, and Cholesterol. Provides direct chemical evidence of kidney efficiency.
  4. Dietary Data: Key Variables include Sodium and protein consumption.
  5. Questionnaires: Key Variables include Alcohol use and cardiovascular history.
  6. Medication: Identifies individuals already being treated for related conditions.

Variable Dictionary


Data Processing & Methodology¶

The Technical Choice: Why R? For this analysis, I utilized R and the Tidyverse ecosystem. While Python is noted for speed, R’s specialized packages—specifically mice for imputation and caret for recursive feature selection—offered the high-precision tools necessary for complex medical data.

library(tidyverse) # Data manipulation
library(mice)      # Imputation
library(caret)     # Model training & RFE

Overcoming Data Sparsity: Imputation Strategy¶

A significant challenge in the NHANES dataset is missing values. A "listwise deletion" approach (removing any row with an NA) would have resulted in the loss of nearly the entire sample, as medical exams are often incomplete for various participants.

To solve this, I used the MICE (Multivariate Imputation by Chained Equations) algorithm. This allowed me to preserve the statistical power of the dataset by predicting missing values based on the relationships between other observed variables, ensuring the final 11 features remained robust and representative.

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Next we can load in the datasets¶

There are 6 tables in this dataset and after doing a deeper dive into the necessary components of kidney disease prediction I will drop the ones not required to bag/boost the data.

Verifying Data¶

Next it would be helpful to verify the data. I will start with head(), glimpse(), and dim() functions as those are my favorites. I will also open them in their native Excel file to make sure the response and patient ID's match. With this much data it may have been cleaner to just use dim after confirming matching ID's (SEQN) with a join and dim() again.

A data.frame: 5 × 5
SEQNSDDSRVYRRIDSTATRRIAGENDRRIDAGEYR
<int><int><int><int><int>
17355782169
27355882154
37355982172
473560821 9
57356182273
Rows: 10,175
Columns: 5
$ SEQN     <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73565…
$ SDDSRVYR <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ RIDSTATR <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2…
$ RIAGENDR <int> 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1…
$ RIDAGEYR <int> 69, 54, 72, 9, 73, 56, 0, 61, 42, 56, 65, 26, 0, 9, 76, 10, 1…
A data.frame: 5 × 5
SEQNWTDRD1WTDR2DDR1DRSTZDR1EXMER
<int><dbl><dbl><int><int>
173557 16888.33 12930.89149
273558 17932.14 12684.15159
373559 59641.81 39394.24149
473560142203.07125966.37154
573561 59052.36 39004.89163
Rows: 9,813
Columns: 5
$ SEQN     <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ WTDRD1   <dbl> 16888.328, 17932.144, 59641.813, 142203.070, 59052.357, 49890…
$ WTDR2D   <dbl> 12930.89, 12684.15, 39394.24, 125966.37, 39004.89, 0.00, 4073…
$ DR1DRSTZ <int> 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1…
$ DR1EXMER <int> 49, 59, 49, 54, 63, 49, 54, 54, 49, 61, 87, 22, 25, 61, NA, 4…
A data.frame: 5 × 5
SEQNPEASCST1PEASCTM1PEASCCT1BPXCHR
<int><int><int><int><int>
1735571620NANA
2735581766NANA
3735591665NANA
4735601803NANA
5735611949NANA
Rows: 9,813
Columns: 5
$ SEQN     <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ PEASCST1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ PEASCTM1 <int> 620, 766, 665, 803, 949, 1064, 90, 954, 625, 932, 585, 710, 1…
$ PEASCCT1 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BPXCHR   <int> NA, NA, NA, NA, NA, NA, 152, NA, NA, NA, NA, NA, NA, NA, NA, …
A data.frame: 5 × 5
SEQNURXUMAURXUMSURXUCR.xURXCRS
<int><dbl><dbl><int><dbl>
173557 4.3 4.3 39 3447.6
273558153.0153.0 50 4420.0
373559 11.9 11.9113 9989.2
473560 16.0 16.0 76 6718.4
573561255.0255.014712994.8
Rows: 9,813
Columns: 5
$ SEQN     <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ URXUMA   <dbl> 4.3, 153.0, 11.9, 16.0, 255.0, 123.0, NA, 19.0, 1.3, 35.0, 25…
$ URXUMS   <dbl> 4.3, 153.0, 11.9, 16.0, 255.0, 123.0, NA, 19.0, 1.3, 35.0, 25…
$ URXUCR.x <int> 39, 50, 113, 76, 147, 74, NA, 242, 18, 215, 31, 116, 177, 144…
$ URXCRS   <dbl> 3447.6, 4420.0, 9989.2, 6718.4, 12994.8, 6541.6, NA, 21392.8,…
A data.frame: 5 × 5
SEQNRXDUSERXDDRUGRXDDRGIDRXQSEEN
<int><int><chr><chr><int>
173557199999 NA
2735571INSULIN d00262 2
3735581GABAPENTIN d03182 1
4735581INSULIN GLARGINEd04538 1
5735581OLMESARTAN d04801 1
Rows: 20,194
Columns: 5
$ SEQN     <int> 73557, 73557, 73558, 73558, 73558, 73558, 73559, 73559, 73559…
$ RXDUSE   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ RXDDRUG  <chr> "99999", "INSULIN", "GABAPENTIN", "INSULIN GLARGINE", "OLMESA…
$ RXDDRGID <chr> "", "d00262", "d03182", "d04538", "d04801", "d00746", "d04697…
$ RXQSEEN  <int> NA, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 1, 2, 2, 2, 2,…
A data.frame: 5 × 5
SEQNACD011AACD011BACD011CACD040
<int><int><int><int><int>
1735571NANANA
2735581NANANA
3735591NANANA
4735601NANANA
5735611NANANA
Rows: 10,175
Columns: 5
$ SEQN    <int> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73565,…
$ ACD011A <int> 1, 1, 1, 1, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, 1, 1, 1, NA, NA,…
$ ACD011B <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ACD011C <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ ACD040  <int> NA, NA, NA, NA, NA, 4, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, …
  1. 10175
  2. 47
  1. 9813
  2. 168
  1. 9813
  2. 224
  1. 9813
  2. 424
  1. 20194
  2. 13
  1. 10175
  2. 953

I notice the majority of the data is in int and dbl format. There are a lot of NA's which is expected as not every patient will qulify for every examination or test (Missing Not At Random data)- so it is still useful information, with a meaningful signal, as negative feedback.

The headers are a bit confusing with the short hand they chose to format the long data in. But when I open the files in Excel it has a description of each of the columns available. There are also links available at the data card that describe each column. So I will need to do quite a bit of transformation and verification of the data before it is understandable to me to be able to relay to a stakeholder.

As you can see this is an absolutely massive and complex dataset. I have no formal medical training and am not really interested in how various medications are used on a per person basis. So I will not be manipulating the medication set. I will also only use a few pop culture measures from the labs as I do have a strong background in biology and understand lab reports can be multi faceted. There are various outcomes of high or low readings according to symptomology, which without an EDA and no clear direction, would not be useful. They could also be potentially misleading. The majority of my data used will come from the demographics, examinations, and diet datasets. The questionaire portion will be useful towards the end when I have more insight into the sample population tested.

Checking ID's¶

My plan initially was to open the files in excel and create pivot tables. However, this dataset has so many enteries that my excel will not handle it appropriatley so, all analysis will take place within the notebook.

10175
9813
9813
9813
10175

It looks like we have 10175 participants, but 9813 questioned and then given a physical and have been responding to the questionaire longterm.

0
0
0
0
0

There are not any duplicate records of the ID's so every row is a unique respondent.

Dealing with NA¶

If I were to explore medication or do a deep dive into diet the NA's would be very useful information to make a comparative of what the individual participants are and are not doing. NA's do not provide useful information for my prediction and need to be remediated for the model to even run.

Combine datasets and Transform useful columns¶

There is too much data to be able to go through each individual set and column to make a correlation and comparison without first combining the tabels with an inner join using the SEQN (ID) of participants and then renaming columns useful for finding out who these people are that were surveyed. (As stated previously I will be exploring only 4 of the sets initially).

The specific kidney data that was required was not available in the standard set of data. I followed the link to the original data and downloaded the correct year from The national Health Survey subset of the CDC and was returned a file in .XPT format, which is the typical file type of SAS coding. I do not have access to SAS so I will download the 'foreign' package to convert, export, and reimport the data as a .csv...

A data.frame: 5769 × 16
SEQNKIQ022KIQ025KIQ026KID028KIQ005KIQ010KIQ042KIQ430KIQ044KIQ450KIQ046KIQ470KIQ050KIQ052KIQ480
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
735572NA2NA 4 2 1 2 2NA 2NA 4 4 3
735582NA2NA 1NA 2NA 2NA 2NANANA 2
735591 22NA 2 3 2NA 2NA 2NANANA 2
735611 22NA 1NA 2NA 2NA 2NANANA 2
735622NA2NA 1NA 2NA 2NA 2NANANA 0
735642NA2NA 3 1 1 2 2NA 2NA 3 1 2
735652NA2NANANANANANANANANANANANA
735662NA2NA 1NA 2NA 2NA 2NANANA 1
735672NA2NA 1NA 2NA 2NA 2NANANA 1
735682NA2NA 1NA 2NA 2NA 2NANANA 0
735712NA1 3 1NA 2NA 2NA 2NANANA 1
735742NA1 1 1NA 2NA 2NA 2NANANA 1
735772NA2NA 1NA 2NA 2NA 2NANANA 0
735802NA2NA 2 1 1 1 1 1 2NA 1 1 1
735812NA2NA 2 1 2NA 1 1 2NA 2 2 0
735822NA2NANANANANANANANANANANANA
735852NA2NA 1NA 2NA 2NA 2NANANA 1
735892NA1 1 1NA 2NA 2NA 2NANANA 1
735922NA2NA 1NA 2NA 2NA 2NANANA 1
735942NA2NANANANANANANANANANANANA
735952NA2NA 1NA 2NA 2NA 2NANANA 0
735962NA2NA 2 1 2NA 1 1 2NA 2 2 1
735971 12NA 1NA 2NA 2NA 2NANANA 0
735982NA2NA 1NA 2NA 2NA 2NANANA 1
736002NA2NA 1NA 2NA 2NA 2NANANA 0
736032NA2NA 1NA 1 1 2NA 1 1 1 1 0
736042NA2NA 5 1 2NA 1 1 1 4 3 1 0
736072NA2NA 4 2 2NA 1 3 2NA 1 1 1
736102NA2NA 1NA 2NA 2NA 2NANANA 1
736132NA2NA 2 1 2NA 1 1 2NA 3 1 3
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
836782NA2NA 2 1 1 2 1 1 2NA 2 1 1
836832NA2NA 1NA 2NA 1 1 2NA 5 2 2
836842NA1 1 1NA 2NA 1 1 2NA 2 1 0
836872NA2NA 3 1 1 1 1 1 1 4 2 1 3
836882NA2NA 1NA 2NA 2NA 2NANANA 1
836892NA2NA 3 1 1 2 2NA 2NA 2 1 1
836902NA2NA 2 1 1 1 2NA 2NA 1 1 0
836922NA2NA 4 2 2NA 2NA 1 4 1 1 5
836942NA2NA 3 1 1 2 2NA 2NA 2 1 1
836992NA2NA 1NA 2NA 2NA 2NANANA 0
837002NA2NA 4 2 1 3 2NA 2NA 3 2 1
837012NA2NA 2 2 2NA 1 1 2NA 5 1 1
837022NA2NA 4 2 2NA 1 3 2NA 3 1 4
837032NA2NA 1NA 2NA 2NA 2NANANA 1
837052NA2NANANANANANANANANANANANA
837082NA2NA 1NA 1 1 2NA 2NA 1 1 1
837092NA2NA 1NA 2NA 2NA 2NANANA 0
837112NA2NA 1NA 1 4 2NA 2NA 1 1 2
837122NA2NA 2 2 2NA 1 1 2NA 2 1 2
837132NA2NANANANANANANANANANANANA
837152NA2NA 1NA 2NA 1 1 2NA 1 1 2
837172NA2NA 5 3 2NA 1 4 2NA 4 2 2
837182NA9NA 3 2 2NA 1 1 2NA 2 2 2
837202NA2NA 1NA 2NA 2NA 2NANANA 1
837212NA2NA 1NA 2NA 2NA 2NANANA 1
837232NA2NA 1NA 2NA 2NA 2NANANA 3
837242NA2NA 1NA 2NA 2NA 2NANANA 3
837262NA2NANANANANANANANANANANANA
837272NA2NA 1NA 2NA 2NA 2NANANA 0
837292NA2NANANANANANANANANANANANA
A data.frame: 10175 × 22
SEQNDMDHRGNDDMDEDUC3DMDYRSUSINDFMIN2RIDAGEYRRIDRETH1RIDRETH3DR1DAYDR1TCAFF⋯BPXDI1BPXSY1LBDSCRSIURXUCRURDACTLBDSALSILBXSBUKID028KIQ022KIQ026
<int><int><int><int><int><int><int><int><int><int>⋯<int><int><dbl><int><dbl><int><int><int><int><int>
735571NANA 46944 2203⋯72122106.96 NA 11.034110NA 2 2
735581NANA 75433 1240⋯62156 69.84 NA306.004716NA 2 2
735591NANA107233 6 45⋯90140107.85 NA 10.533714NA 1 2
735601 3NA 9 933 3 0⋯38108 NA NA 21.05NANANANANA
735611NANA157333 1 24⋯86136 64.53 NA173.474331NA 1 2
735621NANA 95611 3144⋯84160 78.68 NA166.224318NA 2 2
735631NANA15 033 3 NA⋯NA NA NA NA NANANANANANA
735642NANA106133 7 4⋯80118 81.33 NA 7.853917NA 2 2
735651NANA154222NA NA⋯NA NA NA NA NANANANA 2 2
735662NANA 45633 2266⋯74128 48.62 NA 7.2241 9NA 2 2
735671NANA 36533 7 43⋯78140 85.75 NA 16.284015NA 2 2
735682NANA152633 7199⋯60106 65.42 31 80.654512NA 2 2
735692NANA77 057NA NA⋯NA NA NA NA NANANANANANA
735702 2NA 5 957 2 47⋯44102 NA116 23.45NANANANANA
735711NANA147633 7264⋯68124105.20177 14.584317 3 2 1
735722 3NA 21044 4 0⋯54 88 NA NA 35.42NANANANANA
735731 4NA 81044NA NA⋯62 94 NA NA 8.29NANANANANA
735742NA 4 83356 1872⋯56122 52.16173 7.514311 1 2 1
735752NANA 3 144 7 3⋯NA NA NA NA NANANANANANA
735762 9NA 81644 6 0⋯68108 78.68166 3.015114NANANA
735772NA 4 23211 1210⋯74118 60.11191 8.904517NA 2 2
73578215NA 51811 4 0⋯58120 NA201 5.27NANANANANA
735792 6NA101233 1 0⋯72108 50.39 NA 6.324710NANANA
735802NANA123844 4 36⋯84124 63.65 NA 2.793810NA 2 2
735811NA 4155056 7 24⋯80138 83.98 NA 4.954311NA 2 2
735822NANA 32344NA NA⋯56 98 NA NA 7.61NANANA 2 2
735831 1NA15 733 7 0⋯NA NA NA NA 19.21NANANANANA
735842 7NA 91333 1 0⋯54108 52.16106 4.724214NANANA
735851NA 6 72856 4 96⋯70106106.96 NA 2.704723NA 2 2
735861NANA 6 456 7 0⋯NA NA NA NA NANANANANANA
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
837022NANA 78033 3 95⋯86154 90.17 NA13.334118NA 2 2
837032NANA992211 1 0⋯64128 64.53 89 9.214613NA 2 2
837042 8NA 11533 4 5⋯38108 63.65 NA 4.854112NANANA
837051NA 4 43522 4125⋯64100 70.72 NA 7.034416NA 2 2
837062 0NA 6 644NA NA⋯NA NA NA NA10.00NANANANANA
83707213 2 31811 3 5⋯54106 74.26 NA13.834611NANANA
837081NANA 56433 5 0⋯74 94176.80 NA10.553928NA 2 2
837092NANA152433 1177⋯62116 86.63 NA 7.574619NA 2 2
837102NANA 1 244 1 0⋯NA NA NA NA NANANANANANA
837111NANA 73833 6 0⋯76110 68.95 6213.8739 6NA 2 2
837121NANA 96133 1160⋯70124 91.05 85 3.414313NA 2 2
837132NA 4 63456 5 66⋯66118 97.24 8010.254912NA 2 2
83714113NA 61933 5145⋯64112 65.42 NA 4.4845 9NANANA
837151NANA 35833 7189⋯74118 78.68 NA 6.093912NA 2 2
83716211NA151733 5 62⋯60104112.27 NA 3.254412NANANA
837172NA 6 68011 2 0⋯NA NA 69.8412529.603920NA 2 2
837182NANANA6044NA NA⋯72114 90.17 NA10.004411NA 2 9
837191NANA 8 311NA NA⋯NA NA NA NA NANANANANANA
837201NANA 83644NA NA⋯88130114.04 NA 6.774810NA 2 2
837212NA 3155233 1560⋯70108 98.12 NA 6.484412NA 2 2
837222NANA 7 057 6 0⋯NA NA NA NA NANANANANANA
837231NANA106111 7239⋯NA NA 71.6015710.834117NA 2 2
837241NANA 88033 3 8⋯70164114.04 NA 5.983826NA 2 2
837251 1NA 6 711NA NA⋯NA NA NA 8811.36NANANANANA
837261NA 5 94011NA NA⋯NA NA NA114 4.56NANANA 2 2
837272NANA772622 7 37⋯68110 97.24 NA 4.044913NA 2 2
837281NANA 8 211 5 0⋯NA NA NA NA NANANANANANA
837292NA 6 74244 6 12⋯82136 72.49 NA 5.134110NA 2 2
837302 0NA 6 722NA NA⋯NA NA NA NA 5.23NANANANANA
837311 5NA151156 6 12⋯68 94 NA NA 4.65NANANANANA

We can see in the summary of our feature data that a lot of it is catagorical data with some numerical so lets rename the columns to get a better idea of what we are dealing with. Shorthand codes are located at the NHANES website.

       id            gender      education        time_in_us    
 Min.   :73557   Min.   :1.0   Min.   : 0.000   Min.   : 1.000  
 1st Qu.:76100   1st Qu.:1.0   1st Qu.: 2.000   1st Qu.: 4.000  
 Median :78644   Median :1.0   Median : 5.000   Median : 5.000  
 Mean   :78644   Mean   :1.5   Mean   : 6.162   Mean   : 8.838  
 3rd Qu.:81188   3rd Qu.:2.0   3rd Qu.: 9.000   3rd Qu.: 7.000  
 Max.   :83731   Max.   :2.0   Max.   :99.000   Max.   :99.000  
                               NA's   :7372     NA's   :8267    
 householdincome      age             race           race2          water      
 Min.   : 1.00   Min.   : 0.00   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 1st Qu.: 5.00   1st Qu.:10.00   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:3.000  
 Median : 7.00   Median :26.00   Median :3.000   Median :3.00   Median :5.000  
 Mean   :10.51   Mean   :31.48   Mean   :3.092   Mean   :3.29   Mean   :4.501  
 3rd Qu.:14.00   3rd Qu.:52.00   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:6.000  
 Max.   :99.00   Max.   :80.00   Max.   :5.000   Max.   :7.00   Max.   :7.000  
 NA's   :123                                                    NA's   :1392   
    caffiene           niacin           protein           diabp       
 Min.   :   0.00   Min.   :  0.215   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:   0.00   1st Qu.: 13.583   1st Qu.: 45.78   1st Qu.: 58.00  
 Median :  25.00   Median : 20.196   Median : 66.05   Median : 66.00  
 Mean   :  93.34   Mean   : 23.509   Mean   : 74.54   Mean   : 65.77  
 3rd Qu.: 130.00   3rd Qu.: 29.152   3rd Qu.: 93.86   3rd Qu.: 76.00  
 Max.   :2448.00   Max.   :379.852   Max.   :869.49   Max.   :122.00  
 NA's   :1644      NA's   :1644      NA's   :1644     NA's   :3003    
     sysbp         creatinine      urinecreatinine      uacr        
 Min.   : 66.0   Min.   :  25.64   Min.   :  8.0   Min.   :   0.21  
 1st Qu.:106.0   1st Qu.:  61.00   1st Qu.: 65.0   1st Qu.:   5.02  
 Median :116.0   Median :  72.49   Median :112.0   Median :   7.78  
 Mean   :118.1   Mean   :  77.81   Mean   :127.6   Mean   :  41.91  
 3rd Qu.:128.0   3rd Qu.:  86.63   3rd Qu.:171.0   3rd Qu.:  15.29  
 Max.   :228.0   Max.   :1539.04   Max.   :659.0   Max.   :9000.00  
 NA's   :3003    NA's   :3622      NA's   :7485    NA's   :2123     
    albumin        bloodnitro        stones       failingkidney  
 Min.   :24.00   Min.   : 1.00   Min.   :  0.00   Min.   :1.000  
 1st Qu.:41.00   1st Qu.: 9.00   1st Qu.:  1.00   1st Qu.:2.000  
 Median :43.00   Median :12.00   Median :  1.00   Median :2.000  
 Mean   :42.82   Mean   :12.86   Mean   : 29.84   Mean   :1.977  
 3rd Qu.:45.00   3rd Qu.:15.00   3rd Qu.:  2.00   3rd Qu.:2.000  
 Max.   :56.00   Max.   :95.00   Max.   :999.00   Max.   :9.000  
 NA's   :3622    NA's   :3622    NA's   :9640     NA's   :4406   
 stonesboolean  
 Min.   :1.000  
 1st Qu.:2.000  
 Median :2.000  
 Mean   :1.921  
 3rd Qu.:2.000  
 Max.   :9.000  
 NA's   :4406   

Feature Selection¶

Now that we have a dataframe set with our features used to predict the kidney failure rate we should look at all of the NA values within the columns to see how to handle them on a feature by feature basis.

A data.frame: 22 × 1
.
<dbl>
id 0
gender 0
education7372
time_in_us8267
householdincome 123
age 0
race 0
race2 0
water1392
caffiene1644
niacin1644
protein1644
diabp3003
sysbp3003
creatinine3622
urinecreatinine7485
uacr2123
albumin3622
bloodnitro3622
stones9640
failingkidney4406
stonesboolean4406
  1. 10175
  2. 22

So it looks like there is a lot of missing data for education level, time spent in the US, and how frequently people experience kidney stones. I believe that would be due to- people not wanting to disclose educational information (not having a high school degree or only a GED), having been born in the US and it is not applicable information, and having not experienced a frequency in kidney stones enough to report respectivley. As we still have a boolean value of having experienced kidney stones previously I will drop the extra feature of stones and fill education and time in the us with other methods.

A data.frame: 10175 × 20
idgendertime_in_ushouseholdincomeageracerace2watercaffieneniacinproteindiabpsysbpcreatinineurinecreatinineuacralbuminbloodnitrofailingkidneystonesboolean
<int><int><int><int><int><int><int><int><int><dbl><dbl><int><int><dbl><int><dbl><int><int><int><int>
1735571NA 46944 220311.804 43.6372122106.96 NA 11.034110 2 2
2735581NA 75433 124065.396338.1362156 69.84 NA306.004716 2 2
3735591NA107233 6 4518.342 64.6190140107.85 NA 10.533714 1 2
4735601NA 9 933 3 021.903 77.7538108 NA NA 21.05NANANANA
5735611NA157333 1 2415.857 55.2486136 64.53 NA173.474331 1 2
6735621NA 95611 314417.119 55.1184160 78.68 NA166.224318 2 2
7735631NA15 033 3 NA NA NANA NA NA NA NANANANANA
8735642NA106133 7 429.342 91.1580118 81.33 NA 7.853917 2 2
9735651NA154222NA NA NA NANA NA NA NA NANANA 2 2
10735662NA 45633 226613.148 42.2674128 48.62 NA 7.2241 9 2 2
11735671NA 36533 7 4319.301 38.0978140 85.75 NA 16.284015 2 2
12735682NA152633 719923.003139.2160106 65.42 31 80.654512 2 2
13735692NA77 057NA NA NA NANA NA NA NA NANANANANA
14735702NA 5 957 2 4718.372 76.4044102 NA116 23.45NANANANA
15735711NA147633 726419.075 39.4068124105.20177 14.584317 2 1
16735722NA 21044 4 0 9.963 30.6554 88 NA NA 35.42NANANANA
17735731NA 81044NA NA NA NA62 94 NA NA 8.29NANANANA
18735742 4 83356 187281.974274.7256122 52.16173 7.514311 2 1
19735752NA 3 144 7 3 6.656 21.60NA NA NA NA NANANANANA
20735762NA 81644 6 014.930 48.9168108 78.68166 3.015114NANA
21735772 4 23211 121076.601144.9274118 60.11191 8.904517 2 2
22735782NA 51811 4 021.266 81.6158120 NA201 5.27NANANANA
23735792NA101233 1 020.340 81.5472108 50.39 NA 6.324710NANA
24735802NA123844 4 3621.680 87.3984124 63.65 NA 2.793810 2 2
25735811 4155056 7 2424.026 96.4280138 83.98 NA 4.954311 2 2
26735822NA 32344NA NA NA NA56 98 NA NA 7.61NANA 2 2
27735831NA15 733 7 0 9.927 25.81NA NA NA NA 19.21NANANANA
28735842NA 91333 1 0 7.823 13.1154108 52.16106 4.724214NANA
29735851 6 72856 4 9670.313285.8370106106.96 NA 2.704723 2 2
30735861NA 6 456 7 0 7.143 24.92NA NA NA NA NANANANANA
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
10146837022NA 78033 3 9511.648 56.0286154 90.17 NA13.334118 2 2
10147837032NA992211 1 016.870 58.9764128 64.53 89 9.214613 2 2
10148837042NA 11533 4 515.008 81.3638108 63.65 NA 4.854112NANA
10149837051 4 43522 412528.719 63.4064100 70.72 NA 7.034416 2 2
10150837062NA 6 644NA NA NA NANA NA NA NA10.00NANANANA
10151837072 2 31811 3 511.780 34.1854106 74.26 NA13.834611NANA
10152837081NA 56433 5 038.044 92.3874 94176.80 NA10.553928 2 2
10153837092NA152433 117715.413 47.7062116 86.63 NA 7.574619 2 2
10154837102NA 1 244 1 0 8.022 11.90NA NA NA NA NANANANANA
10155837111NA 73833 6 012.883 28.4376110 68.95 6213.8739 6 2 2
10156837121NA 96133 116016.420 66.1370124 91.05 85 3.414313 2 2
10157837132 4 63456 5 6626.604119.6366118 97.24 8010.254912 2 2
10158837141NA 61933 514515.292 65.3164112 65.42 NA 4.4845 9NANA
10159837151NA 35833 718922.995 85.3874118 78.68 NA 6.093912 2 2
10160837162NA151733 5 6221.254181.7460104112.27 NA 3.254412NANA
10161837172 6 68011 2 011.092 30.03NA NA 69.8412529.603920 2 2
10162837182NANA6044NA NA NA NA72114 90.17 NA10.004411 2 9
10163837191NA 8 311NA NA NA NANA NA NA NA NANANANANA
10164837201NA 83644NA NA NA NA88130114.04 NA 6.774810 2 2
10165837212 3155233 156035.698 35.0570108 98.12 NA 6.484412 2 2
10166837222NA 7 057 6 0 8.043 17.36NA NA NA NA NANANANANA
10167837231NA106111 723919.643 70.52NA NA 71.6015710.834117 2 2
10168837241NA 88033 3 848.232 77.0970164114.04 NA 5.983826 2 2
10169837251NA 6 711NA NA NA NANA NA NA 8811.36NANANANA
10170837261 5 94011NA NA NA NANA NA NA114 4.56NANA 2 2
10171837272NA772622 7 3768.311223.3268110 97.24 NA 4.044913 2 2
10172837281NA 8 211 5 011.309 47.55NA NA NA NA NANANANANA
10173837292 6 74244 6 1231.590 89.3782136 72.49 NA 5.134110 2 2
10174837302NA 6 722NA NA NA NANA NA NA NA 5.23NANANANA
10175837311NA151156 6 1219.119 96.0668 94 NA NA 4.65NANANANA

Imputation of NA's¶

Since we can see there are quite a few missing values within this dataframe, it is important to fill these values with various techniques to complete the set and allow the algorithm to function effectively.

  • Simple Imputation: We start with a normal fill for data where I would not assume much variance. For example, since the majority of people have not experienced kidney stones (resulting in many 0's) and household income was frequently unreported, this fill method suffices.

  • Median Imputation: I use the median() function to fill variables that should follow a normal distribution. Given the sample size of 9,800 participants, this is a statistically sound approach. I also applied the median for caffeine, niacin, and protein, as these vary widely; achieving a "middling-distribution" is key here.

  • Mode Imputation: I wrote a function to calculate the Mode to impute Race and Water Consumption. As this is a U.S. survey, the majority of participants are Non-Hispanic White. While water consumption could use mean imputation, the mode is more appropriate here as the discrete options for daily consumption align better with the population's most common answers.

  • Predictive Mean Matching (PMM): Finally, I use PMM—derived from the already imputed values—to estimate the "hard-hitting" features.

Industry Tip: For solid prediction models, PMM or KNN should be used if you cannot drop NA values. Standard mean imputation is often considered a "lazy" way just to get a model moving. I have used a combination of methods to demonstrate versatility across different data types.


Understanding Data Missingness¶

In professional data science, the reason data is missing dictates the method of repair. My approach accounts for the three primary types:

  1. MCAR (Missing Completely at Random): Handled via Median/Mode.
  2. MAR (Missing at Random): Addressed through the multi-variable relationships in PMM.
  3. MNAR (Missing Not at Random): Acknowledged in medical contexts where a test wasn't performed because the patient didn't meet specific health criteria.
 iter imp variable
  1   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  1   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  1   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  1   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  1   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  2   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  2   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  2   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  2   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  2   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  3   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  3   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  3   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  3   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  3   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  4   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  4   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  4   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  4   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  4   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  5   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  5   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  5   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  5   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  5   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  6   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  6   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  6   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  6   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  6   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  7   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  7   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  7   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  7   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  7   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  8   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  8   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  8   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  8   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  8   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  9   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  9   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  9   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  9   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  9   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  10   1  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  10   2  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  10   3  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  10   4  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
  10   5  creatinine  urinecreatinine  uacr  bloodnitro  failingkidney
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  25.64   61.00   72.49   77.81   86.63 1539.04    3622 
A data.frame: 3622 × 5
12345
<dbl><dbl><dbl><dbl><dbl>
4 88.40 76.02 83.9861.88 56.58
7 47.74 65.42 67.1838.90 55.69
9 65.42114.92 89.2875.14 62.76
13 64.53 68.07 68.9568.95 37.13
14 54.81 55.69 54.8191.94 51.27
16 93.70 58.34 79.5677.79 92.82
17 70.72103.43 68.0786.63108.73
19 34.48 66.30 74.2680.44 53.92
22 46.85 69.84 70.7274.26 40.66
26 92.82 77.79 45.9766.30 61.88
27 64.53 67.18 90.1761.00 60.11
30 61.00 38.01 74.2663.65 44.20
32 82.21 83.10 72.4983.10 62.76
34 52.16 40.66 58.3464.53 58.34
35 80.44 82.21 43.3261.88106.08
37 66.30 68.95 68.0783.98 71.60
41282.88126.41127.3065.42100.78
46 71.60 78.68 71.6068.95 59.23
50106.96 76.02 54.8148.62 72.49
52 75.14 64.53 94.5983.10 71.60
53 68.95 64.53 56.5857.46 55.69
55 88.40 69.84 53.9262.76 57.46
56 74.26 83.10 76.9170.72 50.39
61 64.53 68.07 57.4682.21 38.90
64 80.44 61.88 54.8157.46 76.02
69 34.48 53.92 38.9054.81 64.53
70 86.63 67.18 68.0772.49102.54
71 56.58 63.65100.7858.34 75.14
74 52.16 58.34 98.1289.28 74.26
78114.92 90.17 70.7275.14 81.33
⋮⋮⋮⋮⋮⋮
10079119.34100.78 78.68 44.20 62.76
10085 68.95 87.52114.92 76.02 57.46
10088 61.88 89.28 63.65 67.18 60.11
10089 83.10 38.90 45.97 66.30 57.46
10093102.54 61.88 51.27 60.11 51.27
10094282.88139.67556.921103.23146.74
10095 45.97 60.11 61.00 93.70 76.91
10096 88.40 55.69 68.95 83.10 79.56
10097 89.28 79.56 68.07 83.98 99.01
10098 95.47 90.17 56.58 59.23 68.95
10100 87.52 70.72 69.84 44.20 99.89
10101 56.58 87.52 76.02 38.01102.54
10105 90.17 66.30 70.72 73.37100.78
10114 61.00 45.97 70.72 48.62 57.46
10125 39.78 67.18 79.56 71.60 59.23
10129 54.81 60.11 62.76 53.92 40.66
10130 87.52 54.81 91.05 79.56 91.05
10134 65.42 56.58 85.75 63.65 35.36
10139 36.24 67.18 60.11 71.60 86.63
10141 61.00 64.53 72.49 57.46 61.00
10142 70.72 53.04 64.53 57.46 75.14
10150 45.97 65.42 40.66 69.84 55.69
10154 63.65 68.95 61.88 85.75 65.42
10163 89.28 74.26 57.46 46.85 78.68
10166 51.27 83.10 58.34 42.43 40.66
10169 76.02 59.23 61.00 71.60 98.12
10170 73.37 72.49 59.23 52.16 45.97
10172 59.23 74.26 45.97 56.58 55.69
10174 82.21 45.08 50.39 83.98 68.07
10175 72.49 76.02 81.33 49.50 39.78

Cleaned data¶

From this output the dataset I'm going with is set 1. It seems to hover around the mean/ median of the initial set and has a high variance without throwing super low and high values to adjust mean like set 3 does. I think it will fit the test set and train well for our purposes.

       id            gender      time_in_us     householdincome      age       
 Min.   :73557   Min.   :1.0   Min.   :0.0000   Min.   : 1.00   Min.   : 0.00  
 1st Qu.:76100   1st Qu.:1.0   1st Qu.:0.0000   1st Qu.: 5.00   1st Qu.:10.00  
 Median :78644   Median :1.0   Median :0.0000   Median : 7.00   Median :26.00  
 Mean   :78644   Mean   :1.5   Mean   :0.1875   Mean   :10.48   Mean   :31.48  
 3rd Qu.:81188   3rd Qu.:2.0   3rd Qu.:0.0000   3rd Qu.:14.00   3rd Qu.:52.00  
 Max.   :83731   Max.   :2.0   Max.   :1.0000   Max.   :99.00   Max.   :80.00  
      race           race2          water          caffiene     
 Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :   0.0  
 1st Qu.:2.000   1st Qu.:2.00   1st Qu.:3.000   1st Qu.:   2.0  
 Median :3.000   Median :3.00   Median :6.000   Median :  25.0  
 Mean   :3.092   Mean   :3.29   Mean   :4.706   Mean   :  82.3  
 3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:6.000   3rd Qu.: 102.0  
 Max.   :5.000   Max.   :7.00   Max.   :7.000   Max.   :2448.0  
     niacin           protein           diabp            sysbp      
 Min.   :  0.215   Min.   :  0.00   Min.   :  0.00   Min.   : 66.0  
 1st Qu.: 14.829   1st Qu.: 49.63   1st Qu.: 62.00   1st Qu.:110.0  
 Median : 20.196   Median : 66.05   Median : 66.00   Median :116.0  
 Mean   : 22.974   Mean   : 73.17   Mean   : 65.84   Mean   :117.5  
 3rd Qu.: 26.979   3rd Qu.: 87.37   3rd Qu.: 72.00   3rd Qu.:122.0  
 Max.   :379.852   Max.   :869.49   Max.   :122.00   Max.   :228.0  
   creatinine      urinecreatinine      uacr            albumin     
 Min.   :  25.64   Min.   :  8.0   Min.   :   0.21   Min.   :24.00  
 1st Qu.:  59.23   1st Qu.: 68.0   1st Qu.:   5.00   1st Qu.:42.00  
 Median :  70.72   Median :117.0   Median :   7.78   Median :43.00  
 Mean   :  74.75   Mean   :131.9   Mean   :  42.15   Mean   :42.88  
 3rd Qu.:  83.98   3rd Qu.:177.0   3rd Qu.:  15.17   3rd Qu.:44.00  
 Max.   :1539.04   Max.   :659.0   Max.   :9000.00   Max.   :56.00  
   bloodnitro    failingkidney   stonesboolean  
 Min.   : 1.00   Min.   :1.000   Min.   :1.000  
 1st Qu.: 8.00   1st Qu.:2.000   1st Qu.:2.000  
 Median :11.00   Median :2.000   Median :2.000  
 Mean   :11.77   Mean   :1.974   Mean   :1.921  
 3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.:2.000  
 Max.   :95.00   Max.   :2.000   Max.   :9.000  
A data.frame: 6 × 20
idgendertime_in_ushouseholdincomeageracerace2watercaffieneniacinproteindiabpsysbpcreatinineurinecreatinineuacralbuminbloodnitrofailingkidneystonesboolean
<int><int><dbl><int><int><int><int><int><int><dbl><dbl><int><int><dbl><int><dbl><int><int><int><int>
17355710 46944220311.804 43.6372122106.96 77 11.03411022
27355810 75433124065.396338.1362156 69.84 93306.00471622
373559101072336 4518.342 64.6190140107.85 59 10.53371412
47356010 9 9333 021.903 77.7538108 88.40247 21.0543 922
573561101573331 2415.857 55.2486136 64.53 58173.47433112
67356210 95611314417.119 55.1184160 78.68216166.22431822

Feature Selection¶

Now that we have our test data I ran a summary to see if it resembles the original datset. I found the set has similar values and about a 2.6% instance of kidney failure like the original data did, however, the minimum of bloodnitrogen and urinecreatinine are large negative values. We need to switch imputation style so only positive values come through.

After going back, setting the seed and switching from normal distribution to pmm I retreived good values with a 2.5% instance in kidney failure. So a bit lower than the 3% purported by the original data, however, it is still good for prediction as the normal population is estimated at 15% and this is just a sample of generally healthy individuals willing to respond to a survey.

So now that we have our test data lets check for variance in other potential features to drop. With 19 features currently it would be best to start with a correlation coefficient matrix to chekc for high correlations, a variance check, then we can do a final redundancy of the data to maybe get it closer to 10-15 strong predictors.

CorrelationMatrix¶

First we dropped stones for the number of nulls present, then education for the lack of variance and excessive NA's as well. Let us plot the correlation matrix to see if anything is too strong of a predictor (usually above .75).

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


The following object is masked from ‘package:httr’:

    progress


                      gender   time_in_us householdincome          age
gender           1.000000000 -0.070957298    -0.046330235 -0.082621030
time_in_us      -0.070957298  1.000000000     0.070156206  0.250952173
householdincome -0.046330235  0.070156206     1.000000000  0.016968739
age             -0.082621030  0.250952173     0.016968739  1.000000000
race             0.037428882  0.006511826    -0.024476058  0.033181697
race2            0.018964993  0.069458756    -0.009630431  0.003366239
water            0.006417464  0.044031784     0.025873527 -0.090197910
caffiene        -0.048288863  0.023033037    -0.010892899  0.333588149
niacin          -0.064593314  0.038004931     0.020672666  0.104217369
protein         -0.054792397  0.086996195     0.030088611  0.133614525
diabp           -0.051381249  0.102191198     0.012517244  0.246988021
sysbp           -0.041467206  0.079805562    -0.020232133  0.469085841
creatinine      -0.072448777 -0.028408331    -0.010199230  0.245916420
urinecreatinine  0.001079895 -0.086689244    -0.034533930 -0.094212102
uacr            -0.013128156  0.016765898    -0.007049332  0.088496324
albumin         -0.058780716  0.021094993     0.042399173 -0.215474694
bloodnitro      -0.066532038  0.086720558     0.035360495  0.470218255
failingkidney    0.003707065  0.003385810     0.006910452 -0.094526714
stonesboolean    0.010721949  0.024160363     0.013036832 -0.018995284
                        race        race2        water     caffiene
gender           0.037428882  0.018964993  0.006417464 -0.048288863
time_in_us       0.006511826  0.069458756  0.044031784  0.023033037
householdincome -0.024476058 -0.009630431  0.025873527 -0.010892899
age              0.033181697  0.003366239 -0.090197910  0.333588149
race             1.000000000  0.968422645  0.018338264 -0.029153653
race2            0.968422645  1.000000000  0.029382992 -0.040223845
water            0.018338264  0.029382992  1.000000000 -0.069299776
caffiene        -0.029153653 -0.040223845 -0.069299776  1.000000000
niacin          -0.010211471 -0.014517053 -0.017193084  0.241651730
protein         -0.020089961 -0.019974394 -0.021333536  0.163381075
diabp            0.024428765  0.020131955 -0.054758404  0.146285281
sysbp            0.027894218  0.008789197 -0.091343046  0.138699937
creatinine       0.068712539  0.038795467 -0.044335298  0.076117292
urinecreatinine  0.099162535  0.063330033 -0.014658323 -0.066392552
uacr            -0.004316759 -0.006724843 -0.017678085  0.005019089
albumin         -0.038297602 -0.012912400 -0.020615979 -0.017004475
bloodnitro      -0.007119886 -0.017275327 -0.066284352  0.121141917
failingkidney    0.002493181  0.001898238  0.025811126 -0.011766013
stonesboolean    0.006597954  0.006142984 -0.003506018 -0.023592773
                       niacin      protein        diabp        sysbp
gender          -0.0645933137 -0.054792397 -0.051381249 -0.041467206
time_in_us       0.0380049310  0.086996195  0.102191198  0.079805562
householdincome  0.0206726655  0.030088611  0.012517244 -0.020232133
age              0.1042173689  0.133614525  0.246988021  0.469085841
race            -0.0102114706 -0.020089961  0.024428765  0.027894218
race2           -0.0145170528 -0.019974394  0.020131955  0.008789197
water           -0.0171930844 -0.021333536 -0.054758404 -0.091343046
caffiene         0.2416517295  0.163381075  0.146285281  0.138699937
niacin           1.0000000000  0.748998797  0.049268557  0.023011049
protein          0.7489987969  1.000000000  0.055128755  0.033175259
diabp            0.0492685569  0.055128755  1.000000000  0.431022979
sysbp            0.0230110492  0.033175259  0.431022979  1.000000000
creatinine       0.0522865710  0.057378289  0.044675209  0.146271368
urinecreatinine  0.0005421783 -0.008769935  0.012549420 -0.006899802
uacr            -0.0018907787 -0.008434662  0.026052555  0.151907488
albumin          0.0937018082  0.083605458 -0.046720161 -0.122186102
bloodnitro       0.1019835607  0.152868831  0.037840350  0.220577543
failingkidney   -0.0057558518 -0.012458321  0.047009889 -0.044141730
stonesboolean    0.0026000087 -0.006544445  0.009182641 -0.003186755
                 creatinine urinecreatinine         uacr     albumin
gender          -0.07244878    0.0010798949 -0.013128156 -0.05878072
time_in_us      -0.02840833   -0.0866892438  0.016765898  0.02109499
householdincome -0.01019923   -0.0345339303 -0.007049332  0.04239917
age              0.24591642   -0.0942121017  0.088496324 -0.21547469
race             0.06871254    0.0991625351 -0.004316759 -0.03829760
race2            0.03879547    0.0633300332 -0.006724843 -0.01291240
water           -0.04433530   -0.0146583234 -0.017678085 -0.02061598
caffiene         0.07611729   -0.0663925522  0.005019089 -0.01700447
niacin           0.05228657    0.0005421783 -0.001890779  0.09370181
protein          0.05737829   -0.0087699347 -0.008434662  0.08360546
diabp            0.04467521    0.0125494199  0.026052555 -0.04672016
sysbp            0.14627137   -0.0068998019  0.151907488 -0.12218610
creatinine       1.00000000    0.0555276970  0.445551908 -0.07669201
urinecreatinine  0.05552770    1.0000000000 -0.045688757  0.06350680
uacr             0.44555191   -0.0456887571  1.000000000 -0.17442082
albumin         -0.07669201    0.0635068036 -0.174420819  1.00000000
bloodnitro       0.54673934   -0.0404925656  0.276554972 -0.07161760
failingkidney   -0.30758735    0.0647795285 -0.232206084  0.07911537
stonesboolean   -0.02766800    0.0611988893 -0.029704753  0.03025162
                  bloodnitro failingkidney stonesboolean
gender          -0.066532038   0.003707065   0.010721949
time_in_us       0.086720558   0.003385810   0.024160363
householdincome  0.035360495   0.006910452   0.013036832
age              0.470218255  -0.094526714  -0.018995284
race            -0.007119886   0.002493181   0.006597954
race2           -0.017275327   0.001898238   0.006142984
water           -0.066284352   0.025811126  -0.003506018
caffiene         0.121141917  -0.011766013  -0.023592773
niacin           0.101983561  -0.005755852   0.002600009
protein          0.152868831  -0.012458321  -0.006544445
diabp            0.037840350   0.047009889   0.009182641
sysbp            0.220577543  -0.044141730  -0.003186755
creatinine       0.546739342  -0.307587353  -0.027667998
urinecreatinine -0.040492566   0.064779528   0.061198889
uacr             0.276554972  -0.232206084  -0.029704753
albumin         -0.071617599   0.079115367   0.030251624
bloodnitro       1.000000000  -0.242272649  -0.030230486
failingkidney   -0.242272649   1.000000000   0.040316542
stonesboolean   -0.030230486   0.040316542   1.000000000

So most of these are very weakly correlated to failing kidneys. The highest correlations were negative to creatinine, urine analysis of creatinine, and blood nitrogen. Which is accurate in respect to the literature. Surpisingly in our model dystolic and systolic blood pressures were not heavily correlated at all. Typically kidney stress or failure comes from higher blood pressure. But with the direct measures of how the kidneys are functioning through creatinine and nitrogen output I think we may be on to something.

Variance ranking from MLR¶

Let's plot the importance of each variable in respect to the estimation of kidney failure.

ROC curve variable importance

                Importance
creatinine          0.7098
bloodnitro          0.6922
age                 0.6613
uacr                0.6469
urinecreatinine     0.6232
albumin             0.6035
diabp               0.5699
householdincome     0.5481
water               0.5481
sysbp               0.5480
stonesboolean       0.5476
id                  0.5142
gender              0.5059
race                0.5056
time_in_us          0.5042
race2               0.5038
protein             0.5034
niacin              0.5009
caffiene            0.5008
No description has been provided for this image

After this regression has completed it may not come as a surprise, but if your prediction power is on par with your randomly assigned ID it is probably not a good prediction feature. All of these features at and below id also have a common thread- they are predictions of long term kidney health. Not nesseciarly garuantees of kidney failure. Gender, race (hisp/non-hisp), race2 (black, asian, white), protein (intake in last week), niacin (unless supplemented who knows weekly consumption?), caffiene consumption (this one is surprising to not predict), and time in us (if race didn't predict american vs non american wouldn't give us additional info). These are all lifestyle habits that cumulate over time to create kidney scarring and damage or support kidney function (niacin). But as this is not time-series data. At-a-glimpse anyone who has already suffered kidney failure as asked in the questionaire would proabbly not be participating in most of these habits. Drop em.

A data.frame: 10175 × 12
householdincomeagewaterdiabpsysbpcreatinineurinecreatinineuacralbuminbloodnitrostonesbooleanfailingkidney
<int><int><int><int><int><dbl><int><dbl><int><int><int><fct>
1 469272122106.96 77 11.03411020
2 754162156 69.84 93306.00471620
31072690140107.85 59 10.53371421
4 9 9338108 88.40247 21.0543 920
51573186136 64.53 58173.47433121
6 956384160 78.68216166.22431820
715 0366116 47.74127 16.00431320
81061780118 81.33115 7.85391720
91542666116 65.42237 5.00431320
10 456274128 48.62114 7.2241 920
11 365778140 85.75 19 16.28401520
121526760106 65.42 31 80.65451220
1377 0666116 64.53 30 4.3643 720
14 5 9244102 54.81116 23.4543 820
151476768124105.20177 14.58431710
16 210454 88 93.70 73 35.4243 910
17 810662 94 70.72 86 8.29431210
18 833156122 52.16173 7.51431110
19 3 1766116 34.48153 7.61431210
20 816668108 78.68166 3.01511410
21 232174118 60.11191 8.90451720
22 518458120 46.85201 5.27431220
231012172108 50.39143 6.32471020
241238484124 63.65120 2.79381020
251550780138 83.98101 4.95431120
26 323656 98 92.82205 7.61431020
2715 7766116 64.53120 19.2143 620
28 913154108 52.16106 4.72421420
29 728470106106.96206 2.70472320
30 6 4766116 61.00120 6.5343 820
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
10146 780386154 90.17 5613.33411820
101479922164128 64.53 89 9.21461320
10148 115438108 63.65 88 4.85411220
10149 435464100 70.72201 7.03441620
10150 6 6666116 45.9712310.0043 720
10151 318354106 74.2615613.83461120
10152 564574 94176.8023910.55392820
101531524162116 86.63 93 7.57461920
10154 1 2166116 63.6538135.66431020
10155 738676110 68.95 6213.8739 620
10156 961170124 91.05 85 3.41431320
10157 634566118 97.24 8010.25491220
10158 619564112 65.42 81 4.4845 920
10159 358774118 78.68 45 6.09391220
101601517560104112.27 91 3.25441220
10161 680266116 69.8412529.60392020
10162 660672114 90.1765910.00441190
10163 8 3666116 89.28149 5.5343 790
10164 836688130114.04367 6.77481020
101651552170108 98.12434 6.48441220
10166 7 0666116 51.27120 6.60431020
101671061766116 71.6015710.83411720
10168 880370164114.04192 5.98382620
10169 6 7666116 76.02 8811.3643 720
10170 940666116 73.37114 4.56431020
101717726768110 97.24173 4.04491320
10172 8 2566116 59.23 9114.27431320
10173 742682136 72.49143 5.13411020
10174 6 7666116 82.21280 5.23431020
101751511668 94 72.49 61 4.6543 720

Random Forest RFE¶

Although we reached our goal of 10-15 predicting features (11). I think it would be beneficial to run a Random Forest Recursive Feature Elimination to see if any of the other features are obvious eliminations to speed up the machine learning process. (on a laptop the previous regression took about 10 mins to process).

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.9759 0.2283   0.002891 0.09754         
         2   0.9752 0.2188   0.003363 0.12320         
         3   0.9760 0.2334   0.002497 0.10780         
         4   0.9760 0.2507   0.002499 0.10997         
         5   0.9764 0.2492   0.002618 0.12167        *
         6   0.9763 0.2412   0.002514 0.10984         
         7   0.9761 0.2412   0.002272 0.08857         
         8   0.9758 0.2253   0.002586 0.09853         
         9   0.9761 0.2508   0.002226 0.08428         
        10   0.9762 0.2436   0.002733 0.09494         
        11   0.9760 0.2415   0.002503 0.08900         

The top 5 variables (out of 5):
   creatinine, bloodnitro, age, diabp, urinecreatinine

  1. 'creatinine'
  2. 'bloodnitro'
  3. 'age'
  4. 'diabp'
  5. 'urinecreatinine'
No description has been provided for this image

This Recursion is saying that Creatinine, bloodnitrogen, age, diastolyic blood pressure, and urinecreatinine are the strongest predictiors for the model and adding more features can cause a detriment to prediction power. However, looking at the scale of analysis it is all within 97-98% and adding weaker predictors for their variance will be beneficial to the final model. So, we will keep all 11 predictors for the next steps of the process. So, lets offload a .csv of the complete data and move notebooks....

Conclusion¶

That concludes this first portion of the model. According to the medical literature I referenced, the final feature set aligns closely with the strongest known predictors of kidney disease, with one notable exception: I chose not to include diabetes in this specific model. While diabetes is a primary driver of renal issues, the comorbidities associated with chronic illness are vast and complex. I wanted to focus this model on features that are largely intervenable or actionable, with the exception of age. By narrowing the scope this way, the model becomes a more focused tool for looking at physiological markers directly.

At this stage, the notebook runtime is nearing the one-hour mark on my laptop, and the dataset is now fully cleaned, imputed, and tapered down to its most essential predictors. The next logical step is to transition to a fresh notebook to perform a final check for feature collinearity, split the data into training and test sets, and begin running the actual model predictions.

Thank you for taking the time to read through this extensive preprocessing and cleaning phase! If you have any tips for optimizing R code or suggestions for machine learning techniques that could help speed up the runtime of the MICE or Caret packages in this environment, please share them in the comments.

References¶

  1. https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2013

  2. https://www.cdc.gov/kidneydisease/publications-resources/kidney-tests.html

  3. https://www.niddk.nih.gov/health-information/health-statistics/kidney-disease

  4. https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/