Kidney Failure prediction
Chronic kidney disease (CKD) and kidney failure affect millions of people each year, and early prediction can help improve treatment and outcomes. In a previous notebook, I combined multiple datasets from the CDC and National Center for Health Statistics (NCHS) to create a unified, clean dataset for predicting kidney failure.
The final dataset included key health indicators such as:
Demographics & Lifestyle: Household income, age, daily water intake
Vitals: Systolic and diastolic blood pressure
Lab Measurements: Blood creatinine, urine creatinine, urine albumin-creatinine ratio (UACR), blood albumin, blood nitrogen levels
Medical History: Kidney stone history (boolean indicator)
I brought the data together and cleaned/imputed using Rstudio and packages such as: janitor, GGplot2, Mice, and the almighty Tidyverse. I also used Variance inflation Factor (VIF) over the final set. Helping to remove redundant features with a multicollieanerity measure.
These all went through different frames of imputation from basic mean/median/mode to Predictive Mean Matching (PMM) to fill in missing values. These datasets were very sparsely populated for the particular outcomes and even smaller for individuals who actually had kidney failure reported.
Using imputation and particularly PMM in the first notebook over the whole set is a form of data leakage as the PMM was looking at rows to fill in values before splitting into the Train/Test/Valid. However from the model results seen here, and the scarcity of the set. Was necessary to push the modelling forward, while acknowledging the limitations.
Statistical plan
As the outcome is Binary 3 main models came to mind: Logistic regression, RandomForest, and XGBoost. I also wanted to use SVM, but it is unlikley to outperform tree-based models on structed overlapping clinical data.
Goals
Produce well-calibrated probability estimates for kidney-failure risk.
Maximize clinically-relevant sensitivity (recall) for the positive class while managing false positives.
Demonstrate robustness using cross-validation
Modelling
Using Logistic Regression as a baseline and interpretation. Preprocessing the data into bins and log transformations to give it the best opportunity to draw lines between the features. Then running Elastic Net for a reduced model and comparing between full to reduced.
Then comparing results to a basic and tuned Random Forest. Once the Random forest is complete running an XGboost to test boosted tree modelling and tuning to see if it out performs the RandomForest/LogReg.
Finally processing the training data to upsample the outcomes and running the best models through before making a final model comparison- focusing on PR-AUC and ROC-AUC.
householdincome int64 age int64 water int64 diabp int64 sysbp int64 creatinine float64 urinecreatinine int64 uacr float64 albumin int64 bloodnitro int64 stonesboolean int64 failingkidney int64 dtype: object
Splitting Train/Test/Valid and checking to make sure the classes are equal in the outcomes with few available positive cases.
Rows before: 10175 Rows after: 10175 Rows before: 10175 Rows after: 10175 Train set: 6105 samples Validation set: 2035 samples Test set: 2035 samples Train set: Class 0: 5948 samples (97.43%) Class 1: 157 samples (2.57%) ------------------------------ Test set: Class 0: 1983 samples (97.44%) Class 1: 52 samples (2.56%) ------------------------------ Validation set: Class 0: 1982 samples (97.40%) Class 1: 53 samples (2.60%) ------------------------------
Checking Log Reg Assumptions
Running Box Tidwell to see if the variables are linearly related. We did fail some so we moved to transformation to tryand get them more in line for Log Reg to pick up.
creatinine 0.000000 uacr 0.000000 householdincome 0.014020 sysbp 0.041797 urinecreatinine 0.079348 age_shifted 0.113579 albumin 0.120395 bloodnitro 0.364051 diabp_shifted 0.635525 water 0.808855 stonesboolean NaN Name: Box-Tidwell p-values, dtype: float64
Checking to see which transformations would work the best for Log Reg outcomes. To prevent further data leakage through 'Peeking' it is best to wrap the transformation and model build into a pipeline with cross validation.
raw log bin_5 best_category householdincome 0.479194 0.479194 0.555281 bin_5 age 0.657161 0.657161 0.652746 raw water 0.565570 0.565570 0.548484 raw diabp 0.587725 0.587725 0.567551 raw sysbp 0.556482 0.556482 0.558819 bin_5 creatinine 0.722856 0.722856 0.664630 raw urinecreatinine 0.613126 0.613126 0.608567 raw uacr 0.613696 0.613696 0.571270 raw albumin 0.603902 0.603902 0.588164 raw bloodnitro 0.692000 0.692000 0.657949 raw stonesboolean 0.541439 0.541439 0.540544 raw
This wrapper shows that with a standard set of 5 bins there are 2 features that would be best shown by binning, next would be finding the best fit for bins to explain the data.
householdincome: 7 bins, score = 0.5566 sysbp: 7 bins, score = 0.5749
Once we have our bins set there was also two that would be best transformed to a log. 'Creatinine' and 'UACR'to better fit the data linearity assumption.
We need to scale the data to bring it all within a standardized size for good model prediction as we have values from 0-500 in the set and are trying to predict 0/1.
Checking the outliers the highest would be household income, but with over 6,000 data points I would like to keep the outliers in and model accordingly. Acknowledging it may skew the Log Reg line to the higher side from majority High outliers vs low.
high low householdincome 192 0 age 0 0 water 0 0 diabp 16 66 sysbp 96 1 creatinine 34 0 urinecreatinine 77 0 uacr 48 0 albumin 18 59 bloodnitro 84 0 stonesboolean 13 0 UACR_log 124 0 Creatinine_log 43 3
Optimization terminated successfully.
Current function value: 0.095486
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: failingkidney No. Observations: 6105
Model: Logit Df Residuals: 6089
Method: MLE Df Model: 15
Date: Fri, 27 Feb 2026 Pseudo R-squ.: 0.2011
Time: 15:41:46 Log-Likelihood: -582.94
converged: True LL-Null: -729.68
Covariance Type: nonrobust LLR p-value: 1.265e-53
==========================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------
const -3.8618 0.330 -11.688 0.000 -4.509 -3.214
householdincome 0.0925 0.100 0.923 0.356 -0.104 0.289
age 0.0797 0.108 0.736 0.462 -0.133 0.292
water -0.2184 0.086 -2.553 0.011 -0.386 -0.051
diabp -0.3465 0.067 -5.168 0.000 -0.478 -0.215
sysbp 0.1028 0.137 0.753 0.452 -0.165 0.370
creatinine -0.0390 0.140 -0.279 0.780 -0.313 0.235
urinecreatinine -0.5497 0.112 -4.915 0.000 -0.769 -0.331
uacr -0.0291 0.052 -0.563 0.574 -0.131 0.072
albumin -0.1508 0.087 -1.736 0.083 -0.321 0.019
bloodnitro 0.0504 0.080 0.627 0.531 -0.107 0.208
stonesboolean -0.2464 0.107 -2.313 0.021 -0.455 -0.038
householdincome_binned -0.0827 0.055 -1.506 0.132 -0.190 0.025
sysbp_binned -0.0528 0.097 -0.544 0.586 -0.243 0.137
UACR_log 0.1526 0.084 1.807 0.071 -0.013 0.318
Creatinine_log 0.7904 0.173 4.563 0.000 0.451 1.130
==========================================================================================
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
[[1979 4]
[ 41 11]]
ROC-AUC: 0.7585825672058653
Report: precision recall f1-score support
0 0.98 1.00 0.99 1983
1 0.73 0.21 0.33 52
accuracy 0.98 2035
macro avg 0.86 0.60 0.66 2035
weighted avg 0.97 0.98 0.97 2035
Running the initial Log Reg, we can see that systolic blood pressure, householdincome, and uacr do not meet the < 0.05 p-value threshold. That is to be expected as there are variables that explain the variance that would be used by them through binned values in HHI or sysbp. And UACR is better described by the raw values or the log variety.
We also have very strong accuracy and precision, but this is due to the heavily imbalanced dataset, so focusing on recall, we did terribly in our value of interest, which is positive Kidney Failure cases at 0.21 or 21% True Positive rate, and 73% of cases were misclassified.
The ROC-AUC is good at .7586, but this is primarily driven by the false positive rate and correct classification of false cases.
With a basic .5 threshold, we are getting an F1 score of 0.328, which is very low, which suggests their may be a better threshold to divide this data by. To drive up more recall at the risk of precision.
C values: [0.35938137] Best l1 ratio: [0.9] householdincome 0.058643 age 0.082588 water -0.199325 diabp -0.336035 sysbp 0.035424 creatinine 0.000000 urinecreatinine -0.513618 uacr -0.009090 albumin -0.134157 bloodnitro 0.056873 stonesboolean -0.213313 householdincome_binned -0.071297 sysbp_binned -0.006125 UACR_log 0.133400 Creatinine_log 0.726119 dtype: float64 Intercept: [-3.96877316] Non-zero coefficients from Elastic Net (10% Lasso 90% ridge): Creatinine_log 0.726119 urinecreatinine -0.513618 diabp -0.336035 stonesboolean -0.213313 water -0.199325 albumin -0.134157 UACR_log 0.133400 age 0.082588 householdincome_binned -0.071297 householdincome 0.058643 bloodnitro 0.056873 sysbp 0.035424 uacr -0.009090 sysbp_binned -0.006125 dtype: float64
We ran some code testing different splits of lasso and ridge mix in an elastic net and found that the ridge has a stronger effect in reducing variables for these features.
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
[[1979 4]
[ 41 11]]
ROC-AUC: 0.7589122929516273
Report: precision recall f1-score support
0 0.98 1.00 0.99 1983
1 0.73 0.21 0.33 52
accuracy 0.98 2035
macro avg 0.86 0.60 0.66 2035
weighted avg 0.97 0.98 0.97 2035
Comparing the reduced model we still maintain the same ROC-AUC values and only 1 reduced value (base Creatinine)
Checking a better threshhold value to capture more positive values. it looks like around .2 threshhold owuld work better than a flat 0.5.
<matplotlib.lines.Line2D at 0x7b97492cb1d0>
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9754299754299754
Precision (TP)/(TP+FP): 0.5294117647058824
Recall (TP)/(TP+FP): 0.34615384615384615
F1 2(precision*recall)/(precision+recall): 0.41860465116279066
Confusion matrix:
[[1967 16]
[ 34 18]]
ROC-AUC: 0.7589122929516273
Report: precision recall f1-score support
0 0.98 0.99 0.99 1983
1 0.53 0.35 0.42 52
accuracy 0.98 2035
macro avg 0.76 0.67 0.70 2035
weighted avg 0.97 0.98 0.97 2035
After adjusting the model's threshold to measure cases that fall below and above a 0.2 cutoff, we obtain improved metrics, including a recall of 0.35, a precision of 0.53 for false negatives, and a stronger F1 score of 0.42. Still maintaining a high accuracy as the true negatives drive the majority values, with only a slightly softened true positive precision of 0.98. This is a great move for model performance.
Random Forest Baseline
Now moving on to the Random Forest build. It looks like there is not much of a clean line between the variables for a log reg, so trees will likely perform better.
The first random forest uses a standard balanced baseline with inverse weights for outcomes. It performs very poorly, even by log reg standards, so we go directly into tuning.
--- Random Forest Baseline Model ---
precision recall f1-score support
0 0.98 1.00 0.99 1983
1 0.57 0.23 0.33 52
accuracy 0.98 2035
macro avg 0.78 0.61 0.66 2035
weighted avg 0.97 0.98 0.97 2035
ROC_AUC: 0.7416695760114822
Confusion Matrix:
[[1974 9]
[ 40 12]]
It looks like our ROC Curve is predicting better than chance overall, but just like in the Log Reg model, we need to move our threshold value (closer to .2-.39). I am also running multiple model comparisons at once to find the best mix of weights and thresholds.
class_weight threshold precision recall f1 roc_auc
5 {0: 1, 1: 25} 0.100000 0.240964 0.384615 0.296296 0.768276
3 {0: 1, 1: 25} 0.200000 0.562500 0.346154 0.428571 0.768276
4 {0: 1, 1: 25} 0.150000 0.367347 0.346154 0.356436 0.768276
2 {0: 1, 1: 25} 0.250000 0.592593 0.307692 0.405063 0.768276
1 {0: 1, 1: 25} 0.300000 0.590909 0.250000 0.351351 0.768276
0 {0: 1, 1: 25} 0.500000 0.800000 0.153846 0.258065 0.768276
--- Random Forest Tuned Model ---
precision recall f1-score support
0 0.98 0.99 0.99 1983
1 0.57 0.33 0.41 52
accuracy 0.98 2035
macro avg 0.77 0.66 0.70 2035
weighted avg 0.97 0.98 0.97 2035
ROC_AUC: 0.7884809340936421
Confusion Matrix:
[[1970 13]
[ 35 17]]
The final model build is similar to Log Reg in performance, with only slightly better precision on false positives due to 5 more positive classifications and 4 more false positives. This only bumped our F1 by 0.08 and a higher ROC-AUC at 0.7885, so we will move to the XGBoosted trees.
XGBoost Model Build
After observing similar performance in both the final Log Reg and Tuned Random Forest. Some of the data may likely overlap heavily. Using XGBoost will help create more well-defined splits during the tree-building process, but may come at the cost of lower interpretability. This is clueing me in that the data may have a limited/hidden signal for prediction, combined with low outcome balance. So we need stronger prediction methods which typically put the power in the hands of the algorithm rather than the analyst, making them more difficult to explain but having stronger performance.
From the above base model build of the XG boost, it looks like there is a potential overfitting problem matching the noise of the data rather than the few unique cases. As the test training sample keeps rising in prediction, the test starts falling quickly after the ~21st round. Round ~15, having some of the quickest and strongest performance.
Following the initial discovery of rapid overfitting, we are now performing a Randomized Search to optimize the model's architecture. We are tuning the learning pace, tree complexity (depth and node size), and regularization penalties to ensure the model captures the unique minority cases without memorizing the noise in the data.
Fitting 5 folds for each of 40 candidates, totalling 200 fits Best CV PR-AUC: 0.2313511425184655 Best Params: subsample: 0.9 reg_lambda: 1.0 n_estimators: 1000 min_child_weight: 1 max_depth: 5 learning_rate: 0.1 gamma: 1.0 colsample_bytree: 0.9 [Tuned] ROC_AUC: 0.762 | PR-AUC: 0.342
From the best parameter fits across the node and leafs we can see that there is a lot of noise in this dataset. The best build has a min_child_weight of 6 positive cases before breaking into a new node and a max depth of 5 leaves within the tree. We also had a relatively quick learning rate of 0.1, suggesting that the model was pushing to be simplified in its search and move to the next node to avoid noise and overfitting. The divergence between the Cross-validation PR-AUC (23.1%) and the Final Test PR-AUC (34.2%) suggests that the model is 'data-hungry'; it performs significantly better when it has access to the full training distribution to validate its simplistic rules.
[0] validation_0-aucpr:0.18418 [30] validation_0-aucpr:0.22533 [48] validation_0-aucpr:0.23679 [Final Model] ROC_AUC: 0.743 | PR-AUC: 0.243
Best threshold for recall≈0.7: 0.021
[[1178 607]
[ 14 33]]
precision recall f1-score support
0 0.988 0.660 0.791 1785
1 0.052 0.702 0.096 47
accuracy 0.661 1832
macro avg 0.520 0.681 0.444 1832
weighted avg 0.964 0.661 0.774 1832
Using the best model and forcing a target recall of 70% (correctly classifying 70% of positive cases). We end up with a very low threshold of 0.021, saying that it will flag any case if there is a 2.1% chance of it being a positive. Along with a rate of 1 true positive, catching 17 false positives. We also have a precision of 0.052 and a base rate of 33/2035 = 1.6%, which shows that we about doubled the odds of randomly guessing if an individual has kidney failure...while also catching 607 incorrectly. This would not be a very useful test.
Upsampling into XGBoost and Random Forest
Now that we have seen some lower-performing models on the base dataset, it is time to do some upsampling to increase the number of positive cases in the data and hopefully build a more robust model.
Class distribution after upsampling: Counter({0: 5948, 1: 5948})
Confusion Matrix:
[[1629 156]
[ 5 42]]
Classification Report:
precision recall f1-score support
0 0.997 0.913 0.953 1785
1 0.212 0.894 0.343 47
accuracy 0.912 1832
macro avg 0.605 0.903 0.648 1832
weighted avg 0.977 0.912 0.937 1832
ROC-AUC:
0.9031110316467013
PR-AUC:
0.19228438336725986
With the manually upsampled data, the final model produced a 0.212 precision and a 0.894 recall, with only 156 false positives. So it was double the odds of finding a positive case with only a 1:4 ratio of true positive to false positive rate vs the other final model set at a 70% recall. The PR-AUC dropped drastically as we went with a balanced upsample approach, and we still have 5 false negatives to account for. This shows that the data points may be inseparable from positive or negative with this method.
Fitting 5 folds for each of 40 candidates, totalling 200 fits Best CV PR-AUC: 0.998762727969751 Best Params: subsample: 0.8 reg_lambda: 1.0 n_estimators: 1200 min_child_weight: 2 max_depth: 7 learning_rate: 0.05 gamma: 0 colsample_bytree: 1.0 Final ROC-AUC: 1.000 | PR-AUC: 0.997
Tuning the upsampled XGboost model, we get a mirage of good cross-validation as we upsampled before cross-validating, creating overfitting noise. But when compared to the base model, the outcomes for a 0.997 PR-AUC, it is overfitting to the upsampling very strongly. While confirming the best model fit with early stopping and a regularization of 5.
Moving to the Random Forest¶
Threshold Analysis: threshold precision recall f1 roc_auc avg_precision 0 0.500000 0.886792 1.000000 0.940000 0.999928 0.997260 1 0.400000 0.534091 1.000000 0.696296 0.999928 0.997260 2 0.300000 0.256831 1.000000 0.408696 0.999928 0.997260 3 0.250000 0.163763 1.000000 0.281437 0.999928 0.997260 4 0.200000 0.101732 1.000000 0.184676 0.999928 0.997260 5 0.150000 0.060960 1.000000 0.114914 0.999928 0.997260
==============================
FINAL REPORT (Threshold: 0.39)
==============================
precision recall f1-score support
0 1.00 0.98 0.99 1785
1 0.52 1.00 0.69 47
accuracy 0.98 1832
macro avg 0.76 0.99 0.84 1832
weighted avg 0.99 0.98 0.98 1832
ROC-AUC: 0.9999
Average Precision (AP): 0.9973
Confusion Matrix:
[[1742 43]
[ 0 47]]
The performance shown from both model summaries is a classic case of a model that has memorized the exam rather than learning the subject, likely due to the data leakage from the mismatched sets. While a 1.00 Recall and 0.9999 ROC-AUC look like a win, they indicate that the model is perfectly identifying the 47 positive cases because it either saw them (or their upsampled clones) during the training phase. The most telling metric is the 0.52 Precision, which reveals that even with "insider information," the model is still firing off 43 false alarms for every 47 correct hits, suggesting that once you fix the leakage, the real-world performance will likely drop significantly as the model loses its unfair advantage.
Model Comparisons
From the Final Model validation, it looks like we have a signal ceiling from the actual prediction power of our data. This is shown by 3 models, all landing on 0.24 precision. But compared to a baseline of 0.026, their prediction power is still 9.2x lift over baseline. It also looks like the upsampling hurt model performance as they just learned the repeated data and overfitted to noise. When presented with new, novel data, they failed to perform well.
This is more costly in a medical diagnosis, where a false positive or false negative is especially problematic in America. Where misclassification can be life-ruining through bills or a failing organ.
Most of this is driven by the severe overlap in features to be able to separate and predict a positive or negative case. A more standardized data cleaning and imputation method, more direct data collection rather than discovery from collected data, and potentially a set with a higher prevalence of kidney failure would all help in producing a more effective model if this were going to be productionalized.