Kidney Failure prediction

Chronic kidney disease (CKD) and kidney failure affect millions of people each year, and early prediction can help improve treatment and outcomes. In a previous notebook, I combined multiple datasets from the CDC and National Center for Health Statistics (NCHS) to create a unified, clean dataset for predicting kidney failure.

The final dataset included key health indicators such as:

Demographics & Lifestyle: Household income, age, daily water intake

Vitals: Systolic and diastolic blood pressure

Lab Measurements: Blood creatinine, urine creatinine, urine albumin-creatinine ratio (UACR), blood albumin, blood nitrogen levels

Medical History: Kidney stone history (boolean indicator)

I brought the data together and cleaned/imputed using Rstudio and packages such as: janitor, GGplot2, Mice, and the almighty Tidyverse. I also used Variance inflation Factor (VIF) over the final set. Helping to remove redundant features with a multicollieanerity measure.

These all went through different frames of imputation from basic mean/median/mode to Predictive Mean Matching (PMM) to fill in missing values. These datasets were very sparsely populated for the particular outcomes and even smaller for individuals who actually had kidney failure reported. 

Using imputation and particularly PMM in the first notebook over the whole set is a form of data leakage as the PMM was looking at rows to fill in values before splitting into the Train/Test/Valid. However from the model results seen here, and the scarcity of the set. Was necessary to push the modelling forward, while acknowledging the limitations.  

Statistical plan

As the outcome is Binary 3 main models came to mind: Logistic regression, RandomForest, and XGBoost. I also wanted to use SVM, but it is unlikley to outperform tree-based models on structed overlapping clinical data.

Goals

Produce well-calibrated probability estimates for kidney-failure risk.

Maximize clinically-relevant sensitivity (recall) for the positive class while managing false positives.

Demonstrate robustness using cross-validation

Modelling

Using Logistic Regression as a baseline and interpretation. Preprocessing the data into bins and log transformations to give it the best opportunity to draw lines between the features. Then running Elastic Net for a reduced model and comparing between full to reduced.

Then comparing results to a basic and tuned Random Forest. Once the Random forest is complete running an XGboost to test boosted tree modelling and tuning to see if it out performs the RandomForest/LogReg.

Finally processing the training data to upsample the outcomes and running the best models through before making a final model comparison- focusing on PR-AUC and ROC-AUC.
householdincome      int64
age                  int64
water                int64
diabp                int64
sysbp                int64
creatinine         float64
urinecreatinine      int64
uacr               float64
albumin              int64
bloodnitro           int64
stonesboolean        int64
failingkidney        int64
dtype: object

Splitting Train/Test/Valid and checking to make sure the classes are equal in the outcomes with few available positive cases.

Rows before: 10175
Rows after: 10175
Rows before: 10175
Rows after: 10175
Train set:      6105 samples
Validation set: 2035 samples
Test set:       2035 samples

Train set:
  Class 0: 5948 samples (97.43%)
  Class 1: 157 samples (2.57%)
------------------------------

Test set:
  Class 0: 1983 samples (97.44%)
  Class 1: 52 samples (2.56%)
------------------------------

Validation set:
  Class 0: 1982 samples (97.40%)
  Class 1: 53 samples (2.60%)
------------------------------

Checking Log Reg Assumptions

Running Box Tidwell to see if the variables are linearly related. We did fail some so we moved to transformation to tryand get them more in line for Log Reg to pick up.

creatinine        0.000000
uacr              0.000000
householdincome   0.014020
sysbp             0.041797
urinecreatinine   0.079348
age_shifted       0.113579
albumin           0.120395
bloodnitro        0.364051
diabp_shifted     0.635525
water             0.808855
stonesboolean          NaN
Name: Box-Tidwell p-values, dtype: float64

Checking to see which transformations would work the best for Log Reg outcomes. To prevent further data leakage through 'Peeking' it is best to wrap the transformation and model build into a pipeline with cross validation.

                     raw      log    bin_5 best_category
householdincome 0.479194 0.479194 0.555281         bin_5
age             0.657161 0.657161 0.652746           raw
water           0.565570 0.565570 0.548484           raw
diabp           0.587725 0.587725 0.567551           raw
sysbp           0.556482 0.556482 0.558819         bin_5
creatinine      0.722856 0.722856 0.664630           raw
urinecreatinine 0.613126 0.613126 0.608567           raw
uacr            0.613696 0.613696 0.571270           raw
albumin         0.603902 0.603902 0.588164           raw
bloodnitro      0.692000 0.692000 0.657949           raw
stonesboolean   0.541439 0.541439 0.540544           raw

This wrapper shows that with a standard set of 5 bins there are 2 features that would be best shown by binning, next would be finding the best fit for bins to explain the data.

householdincome: 7 bins, score = 0.5566
sysbp: 7 bins, score = 0.5749

Once we have our bins set there was also two that would be best transformed to a log. 'Creatinine' and 'UACR'to better fit the data linearity assumption.

We need to scale the data to bring it all within a standardized size for good model prediction as we have values from 0-500 in the set and are trying to predict 0/1.

Checking the outliers the highest would be household income, but with over 6,000 data points I would like to keep the outliers in and model accordingly. Acknowledging it may skew the Log Reg line to the higher side from majority High outliers vs low.

                 high  low
householdincome   192    0
age                 0    0
water               0    0
diabp              16   66
sysbp              96    1
creatinine         34    0
urinecreatinine    77    0
uacr               48    0
albumin            18   59
bloodnitro         84    0
stonesboolean      13    0
UACR_log          124    0
Creatinine_log     43    3
Optimization terminated successfully.
         Current function value: 0.095486
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          failingkidney   No. Observations:                 6105
Model:                          Logit   Df Residuals:                     6089
Method:                           MLE   Df Model:                           15
Date:                Fri, 27 Feb 2026   Pseudo R-squ.:                  0.2011
Time:                        15:41:46   Log-Likelihood:                -582.94
converged:                       True   LL-Null:                       -729.68
Covariance Type:            nonrobust   LLR p-value:                 1.265e-53
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -3.8618      0.330    -11.688      0.000      -4.509      -3.214
householdincome            0.0925      0.100      0.923      0.356      -0.104       0.289
age                        0.0797      0.108      0.736      0.462      -0.133       0.292
water                     -0.2184      0.086     -2.553      0.011      -0.386      -0.051
diabp                     -0.3465      0.067     -5.168      0.000      -0.478      -0.215
sysbp                      0.1028      0.137      0.753      0.452      -0.165       0.370
creatinine                -0.0390      0.140     -0.279      0.780      -0.313       0.235
urinecreatinine           -0.5497      0.112     -4.915      0.000      -0.769      -0.331
uacr                      -0.0291      0.052     -0.563      0.574      -0.131       0.072
albumin                   -0.1508      0.087     -1.736      0.083      -0.321       0.019
bloodnitro                 0.0504      0.080      0.627      0.531      -0.107       0.208
stonesboolean             -0.2464      0.107     -2.313      0.021      -0.455      -0.038
householdincome_binned    -0.0827      0.055     -1.506      0.132      -0.190       0.025
sysbp_binned              -0.0528      0.097     -0.544      0.586      -0.243       0.137
UACR_log                   0.1526      0.084      1.807      0.071      -0.013       0.318
Creatinine_log             0.7904      0.173      4.563      0.000       0.451       1.130
==========================================================================================
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
 [[1979    4]
 [  41   11]]
ROC-AUC: 0.7585825672058653
Report:               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.73      0.21      0.33        52

    accuracy                           0.98      2035
   macro avg       0.86      0.60      0.66      2035
weighted avg       0.97      0.98      0.97      2035

Running the initial Log Reg, we can see that systolic blood pressure, householdincome, and uacr do not meet the < 0.05 p-value threshold. That is to be expected as there are variables that explain the variance that would be used by them through binned values in HHI or sysbp. And UACR is better described by the raw values or the log variety. 

We also have very strong accuracy and precision, but this is due to the heavily imbalanced dataset, so focusing on recall, we did terribly in our value of interest, which is positive Kidney Failure cases at 0.21 or 21% True Positive rate, and 73% of cases were misclassified.

The ROC-AUC is good at .7586, but this is primarily driven by the false positive rate and correct classification of false cases.

With a basic .5 threshold, we are getting an F1 score of 0.328, which is very low, which suggests their may be a better threshold to divide this data by. To drive up more recall at the risk of precision.
C values: [0.35938137]
Best l1 ratio: [0.9]
householdincome           0.058643
age                       0.082588
water                    -0.199325
diabp                    -0.336035
sysbp                     0.035424
creatinine                0.000000
urinecreatinine          -0.513618
uacr                     -0.009090
albumin                  -0.134157
bloodnitro                0.056873
stonesboolean            -0.213313
householdincome_binned   -0.071297
sysbp_binned             -0.006125
UACR_log                  0.133400
Creatinine_log            0.726119
dtype: float64
Intercept:  [-3.96877316]
Non-zero coefficients from Elastic Net (10% Lasso 90% ridge):
 Creatinine_log            0.726119
urinecreatinine          -0.513618
diabp                    -0.336035
stonesboolean            -0.213313
water                    -0.199325
albumin                  -0.134157
UACR_log                  0.133400
age                       0.082588
householdincome_binned   -0.071297
householdincome           0.058643
bloodnitro                0.056873
sysbp                     0.035424
uacr                     -0.009090
sysbp_binned             -0.006125
dtype: float64

We ran some code testing different splits of lasso and ridge mix in an elastic net and found that the ridge has a stronger effect in reducing variables for these features.

Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
 [[1979    4]
 [  41   11]]
ROC-AUC: 0.7589122929516273
Report:               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.73      0.21      0.33        52

    accuracy                           0.98      2035
   macro avg       0.86      0.60      0.66      2035
weighted avg       0.97      0.98      0.97      2035

Comparing the reduced model we still maintain the same ROC-AUC values and only 1 reduced value (base Creatinine)

Checking a better threshhold value to capture more positive values. it looks like around .2 threshhold owuld work better than a flat 0.5.

<matplotlib.lines.Line2D at 0x7b97492cb1d0>
No description has been provided for this image
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9754299754299754
Precision (TP)/(TP+FP): 0.5294117647058824
Recall (TP)/(TP+FP): 0.34615384615384615
F1 2(precision*recall)/(precision+recall): 0.41860465116279066
Confusion matrix:
 [[1967   16]
 [  34   18]]
ROC-AUC: 0.7589122929516273
Report:               precision    recall  f1-score   support

           0       0.98      0.99      0.99      1983
           1       0.53      0.35      0.42        52

    accuracy                           0.98      2035
   macro avg       0.76      0.67      0.70      2035
weighted avg       0.97      0.98      0.97      2035

After adjusting the model's threshold to measure cases that fall below and above a 0.2 cutoff, we obtain improved metrics, including a recall of 0.35, a precision of 0.53 for false negatives, and a stronger F1 score of 0.42. Still maintaining a high accuracy as the true negatives drive the majority values, with only a slightly softened true positive precision of 0.98. This is a great move for model performance.

Random Forest Baseline

Now moving on to the Random Forest build. It looks like there is not much of a clean line between the variables for a log reg, so trees will likely perform better.

The first random forest uses a standard balanced baseline with inverse weights for outcomes. It performs very poorly, even by log reg standards, so we go directly into tuning.

--- Random Forest Baseline Model ---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.57      0.23      0.33        52

    accuracy                           0.98      2035
   macro avg       0.78      0.61      0.66      2035
weighted avg       0.97      0.98      0.97      2035

ROC_AUC:  0.7416695760114822
Confusion Matrix: 
 [[1974    9]
 [  40   12]]
No description has been provided for this image
No description has been provided for this image

It looks like our ROC Curve is predicting better than chance overall, but just like in the Log Reg model, we need to move our threshold value (closer to .2-.39). I am also running multiple model comparisons at once to find the best mix of weights and thresholds.

    class_weight  threshold  precision   recall       f1  roc_auc
5  {0: 1, 1: 25}   0.100000   0.240964 0.384615 0.296296 0.768276
3  {0: 1, 1: 25}   0.200000   0.562500 0.346154 0.428571 0.768276
4  {0: 1, 1: 25}   0.150000   0.367347 0.346154 0.356436 0.768276
2  {0: 1, 1: 25}   0.250000   0.592593 0.307692 0.405063 0.768276
1  {0: 1, 1: 25}   0.300000   0.590909 0.250000 0.351351 0.768276
0  {0: 1, 1: 25}   0.500000   0.800000 0.153846 0.258065 0.768276
--- Random Forest Tuned Model ---
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1983
           1       0.57      0.33      0.41        52

    accuracy                           0.98      2035
   macro avg       0.77      0.66      0.70      2035
weighted avg       0.97      0.98      0.97      2035

ROC_AUC:  0.7884809340936421
Confusion Matrix: 
 [[1970   13]
 [  35   17]]

The final model build is similar to Log Reg in performance, with only slightly better precision on false positives due to 5 more positive classifications and 4 more false positives. This only bumped our F1 by 0.08 and a higher ROC-AUC at 0.7885, so we will move to the XGBoosted trees.

XGBoost Model Build

After observing similar performance in both the final Log Reg and Tuned Random Forest. Some of the data may likely overlap heavily. Using XGBoost will help create more well-defined splits during the tree-building process, but may come at the cost of lower interpretability. This is clueing me in that the data may have a limited/hidden signal for prediction, combined with low outcome balance. So we need stronger prediction methods which typically put the power in the hands of the algorithm rather than the analyst, making them more difficult to explain but having stronger performance.

No description has been provided for this image

From the above base model build of the XG boost, it looks like there is a potential overfitting problem matching the noise of the data rather than the few unique cases. As the test training sample keeps rising in prediction, the test starts falling quickly after the ~21st round. Round ~15, having some of the quickest and strongest performance.

Following the initial discovery of rapid overfitting, we are now performing a Randomized Search to optimize the model's architecture. We are tuning the learning pace, tree complexity (depth and node size), and regularization penalties to ensure the model captures the unique minority cases without memorizing the noise in the data.

Fitting 5 folds for each of 40 candidates, totalling 200 fits

Best CV PR-AUC:  0.2313511425184655
Best Params: 
 subsample: 0.9
 reg_lambda: 1.0
 n_estimators: 1000
 min_child_weight: 1
 max_depth: 5
 learning_rate: 0.1
 gamma: 1.0
 colsample_bytree: 0.9

[Tuned] ROC_AUC: 0.762 | PR-AUC: 0.342

From the best parameter fits across the node and leafs we can see that there is a lot of noise in this dataset. The best build has a min_child_weight of 6 positive cases before breaking into a new node and a max depth of 5 leaves within the tree. We also had a relatively quick learning rate of 0.1, suggesting that the model was pushing to be simplified in its search and move to the next node to avoid noise and overfitting. The divergence between the Cross-validation PR-AUC (23.1%) and the Final Test PR-AUC (34.2%) suggests that the model is 'data-hungry'; it performs significantly better when it has access to the full training distribution to validate its simplistic rules.

[0]	validation_0-aucpr:0.18418
[30]	validation_0-aucpr:0.22533
[48]	validation_0-aucpr:0.23679

[Final Model] ROC_AUC:  0.743 | PR-AUC: 0.243
Best threshold for recall≈0.7: 0.021
[[1178  607]
 [  14   33]]
              precision    recall  f1-score   support

           0      0.988     0.660     0.791      1785
           1      0.052     0.702     0.096        47

    accuracy                          0.661      1832
   macro avg      0.520     0.681     0.444      1832
weighted avg      0.964     0.661     0.774      1832

Using the best model and forcing a target recall of 70% (correctly classifying 70% of positive cases). We end up with a very low threshold of 0.021, saying that it will flag any case if there is a 2.1% chance of it being a positive. Along with a rate of 1 true positive, catching 17 false positives. We also have a precision of 0.052 and a base rate of 33/2035 = 1.6%, which shows that we about doubled the odds of randomly guessing if an individual has kidney failure...while also catching 607 incorrectly. This would not be a very useful test.

Upsampling into XGBoost and Random Forest

Now that we have seen some lower-performing models on the base dataset, it is time to do some upsampling to increase the number of positive cases in the data and hopefully build a more robust model.

Class distribution after upsampling:  Counter({0: 5948, 1: 5948})
Confusion Matrix: 
 [[1629  156]
 [   5   42]]

Classification Report: 
               precision    recall  f1-score   support

           0      0.997     0.913     0.953      1785
           1      0.212     0.894     0.343        47

    accuracy                          0.912      1832
   macro avg      0.605     0.903     0.648      1832
weighted avg      0.977     0.912     0.937      1832


ROC-AUC: 
 0.9031110316467013

PR-AUC: 
 0.19228438336725986

With the manually upsampled data, the final model produced a 0.212 precision and a 0.894 recall, with only 156 false positives. So it was double the odds of finding a positive case with only a 1:4 ratio of true positive to false positive rate vs the other final model set at a 70% recall. The PR-AUC dropped drastically as we went with a balanced upsample approach, and we still have 5 false negatives to account for. This shows that the data points may be inseparable from positive or negative with this method.

No description has been provided for this image
Fitting 5 folds for each of 40 candidates, totalling 200 fits

Best CV PR-AUC: 0.998762727969751
Best Params:
 subsample: 0.8
 reg_lambda: 1.0
 n_estimators: 1200
 min_child_weight: 2
 max_depth: 7
 learning_rate: 0.05
 gamma: 0
 colsample_bytree: 1.0
Final ROC-AUC:  1.000 | PR-AUC:  0.997

Tuning the upsampled XGboost model, we get a mirage of good cross-validation as we upsampled before cross-validating, creating overfitting noise. But when compared to the base model, the outcomes for a 0.997 PR-AUC, it is overfitting to the upsampling very strongly. While confirming the best model fit with early stopping and a regularization of 5.

Moving to the Random Forest¶

Threshold Analysis:
   threshold  precision   recall       f1  roc_auc  avg_precision
0   0.500000   0.886792 1.000000 0.940000 0.999928       0.997260
1   0.400000   0.534091 1.000000 0.696296 0.999928       0.997260
2   0.300000   0.256831 1.000000 0.408696 0.999928       0.997260
3   0.250000   0.163763 1.000000 0.281437 0.999928       0.997260
4   0.200000   0.101732 1.000000 0.184676 0.999928       0.997260
5   0.150000   0.060960 1.000000 0.114914 0.999928       0.997260
==============================
FINAL REPORT (Threshold: 0.39)
==============================
              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1785
           1       0.52      1.00      0.69        47

    accuracy                           0.98      1832
   macro avg       0.76      0.99      0.84      1832
weighted avg       0.99      0.98      0.98      1832

ROC-AUC: 0.9999
Average Precision (AP): 0.9973

Confusion Matrix:
[[1742   43]
 [   0   47]]

The performance shown from both model summaries is a classic case of a model that has memorized the exam rather than learning the subject, likely due to the data leakage from the mismatched sets. While a 1.00 Recall and 0.9999 ROC-AUC look like a win, they indicate that the model is perfectly identifying the 47 positive cases because it either saw them (or their upsampled clones) during the training phase. The most telling metric is the 0.52 Precision, which reveals that even with "insider information," the model is still firing off 43 false alarms for every 47 correct hits, suggesting that once you fix the leakage, the real-world performance will likely drop significantly as the model loses its unfair advantage.

Model Comparisons

No description has been provided for this image

From the Final Model validation, it looks like we have a signal ceiling from the actual prediction power of our data. This is shown by 3 models, all landing on 0.24 precision. But compared to a baseline of 0.026, their prediction power is still 9.2x lift over baseline. It also looks like the upsampling hurt model performance as they just learned the repeated data and overfitted to noise. When presented with new, novel data, they failed to perform well.

This is more costly in a medical diagnosis, where a false positive or false negative is especially problematic in America. Where misclassification can be life-ruining through bills or a failing organ.

Most of this is driven by the severe overlap in features to be able to separate and predict a positive or negative case. A more standardized data cleaning and imputation method, more direct data collection rather than discovery from collected data, and potentially a set with a higher prevalence of kidney failure would all help in producing a more effective model if this were going to be productionalized.