householdincome      int64
age                  int64
water                int64
diabp                int64
sysbp                int64
creatinine         float64
urinecreatinine      int64
uacr               float64
albumin              int64
bloodnitro           int64
stonesboolean        int64
failingkidney        int64
dtype: object

Rows before: 10175
Rows after: 10175
Rows before: 10175
Rows after: 10175
Train set:      6105 samples
Validation set: 2035 samples
Test set:       2035 samples

Train set:
  Class 0: 5948 samples (97.43%)
  Class 1: 157 samples (2.57%)
------------------------------

Test set:
  Class 0: 1983 samples (97.44%)
  Class 1: 52 samples (2.56%)
------------------------------

Validation set:
  Class 0: 1982 samples (97.40%)
  Class 1: 53 samples (2.60%)
------------------------------

creatinine        0.000000
uacr              0.000000
householdincome   0.014020
sysbp             0.041797
urinecreatinine   0.079348
age_shifted       0.113579
albumin           0.120395
bloodnitro        0.364051
diabp_shifted     0.635525
water             0.808855
stonesboolean          NaN
Name: Box-Tidwell p-values, dtype: float64

                     raw      log    bin_5 best_category
householdincome 0.479194 0.479194 0.555281         bin_5
age             0.657161 0.657161 0.652746           raw
water           0.565570 0.565570 0.548484           raw
diabp           0.587725 0.587725 0.567551           raw
sysbp           0.556482 0.556482 0.558819         bin_5
creatinine      0.722856 0.722856 0.664630           raw
urinecreatinine 0.613126 0.613126 0.608567           raw
uacr            0.613696 0.613696 0.571270           raw
albumin         0.603902 0.603902 0.588164           raw
bloodnitro      0.692000 0.692000 0.657949           raw
stonesboolean   0.541439 0.541439 0.540544           raw

householdincome: 7 bins, score = 0.5566
sysbp: 7 bins, score = 0.5749

                 high  low
householdincome   192    0
age                 0    0
water               0    0
diabp              16   66
sysbp              96    1
creatinine         34    0
urinecreatinine    77    0
uacr               48    0
albumin            18   59
bloodnitro         84    0
stonesboolean      13    0
UACR_log          124    0
Creatinine_log     43    3

Optimization terminated successfully.
         Current function value: 0.095486
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          failingkidney   No. Observations:                 6105
Model:                          Logit   Df Residuals:                     6089
Method:                           MLE   Df Model:                           15
Date:                Fri, 27 Feb 2026   Pseudo R-squ.:                  0.2011
Time:                        15:41:46   Log-Likelihood:                -582.94
converged:                       True   LL-Null:                       -729.68
Covariance Type:            nonrobust   LLR p-value:                 1.265e-53
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     -3.8618      0.330    -11.688      0.000      -4.509      -3.214
householdincome            0.0925      0.100      0.923      0.356      -0.104       0.289
age                        0.0797      0.108      0.736      0.462      -0.133       0.292
water                     -0.2184      0.086     -2.553      0.011      -0.386      -0.051
diabp                     -0.3465      0.067     -5.168      0.000      -0.478      -0.215
sysbp                      0.1028      0.137      0.753      0.452      -0.165       0.370
creatinine                -0.0390      0.140     -0.279      0.780      -0.313       0.235
urinecreatinine           -0.5497      0.112     -4.915      0.000      -0.769      -0.331
uacr                      -0.0291      0.052     -0.563      0.574      -0.131       0.072
albumin                   -0.1508      0.087     -1.736      0.083      -0.321       0.019
bloodnitro                 0.0504      0.080      0.627      0.531      -0.107       0.208
stonesboolean             -0.2464      0.107     -2.313      0.021      -0.455      -0.038
householdincome_binned    -0.0827      0.055     -1.506      0.132      -0.190       0.025
sysbp_binned              -0.0528      0.097     -0.544      0.586      -0.243       0.137
UACR_log                   0.1526      0.084      1.807      0.071      -0.013       0.318
Creatinine_log             0.7904      0.173      4.563      0.000       0.451       1.130
==========================================================================================
Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
 [[1979    4]
 [  41   11]]
ROC-AUC: 0.7585825672058653
Report:               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.73      0.21      0.33        52

    accuracy                           0.98      2035
   macro avg       0.86      0.60      0.66      2035
weighted avg       0.97      0.98      0.97      2035

C values: [0.35938137]
Best l1 ratio: [0.9]
householdincome           0.058643
age                       0.082588
water                    -0.199325
diabp                    -0.336035
sysbp                     0.035424
creatinine                0.000000
urinecreatinine          -0.513618
uacr                     -0.009090
albumin                  -0.134157
bloodnitro                0.056873
stonesboolean            -0.213313
householdincome_binned   -0.071297
sysbp_binned             -0.006125
UACR_log                  0.133400
Creatinine_log            0.726119
dtype: float64
Intercept:  [-3.96877316]
Non-zero coefficients from Elastic Net (10% Lasso 90% ridge):
 Creatinine_log            0.726119
urinecreatinine          -0.513618
diabp                    -0.336035
stonesboolean            -0.213313
water                    -0.199325
albumin                  -0.134157
UACR_log                  0.133400
age                       0.082588
householdincome_binned   -0.071297
householdincome           0.058643
bloodnitro                0.056873
sysbp                     0.035424
uacr                     -0.009090
sysbp_binned             -0.006125
dtype: float64

Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9778869778869779
Precision (TP)/(TP+FP): 0.7333333333333333
Recall (TP)/(TP+FP): 0.21153846153846154
F1 2(precision*recall)/(precision+recall): 0.32835820895522383
Confusion matrix:
 [[1979    4]
 [  41   11]]
ROC-AUC: 0.7589122929516273
Report:               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.73      0.21      0.33        52

    accuracy                           0.98      2035
   macro avg       0.86      0.60      0.66      2035
weighted avg       0.97      0.98      0.97      2035

<matplotlib.lines.Line2D at 0x7b97492cb1d0>

Accuracy (TP+TN)/(TP+TN+FP+FN): 0.9754299754299754
Precision (TP)/(TP+FP): 0.5294117647058824
Recall (TP)/(TP+FP): 0.34615384615384615
F1 2(precision*recall)/(precision+recall): 0.41860465116279066
Confusion matrix:
 [[1967   16]
 [  34   18]]
ROC-AUC: 0.7589122929516273
Report:               precision    recall  f1-score   support

           0       0.98      0.99      0.99      1983
           1       0.53      0.35      0.42        52

    accuracy                           0.98      2035
   macro avg       0.76      0.67      0.70      2035
weighted avg       0.97      0.98      0.97      2035

--- Random Forest Baseline Model ---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1983
           1       0.57      0.23      0.33        52

    accuracy                           0.98      2035
   macro avg       0.78      0.61      0.66      2035
weighted avg       0.97      0.98      0.97      2035

ROC_AUC:  0.7416695760114822
Confusion Matrix: 
 [[1974    9]
 [  40   12]]

    class_weight  threshold  precision   recall       f1  roc_auc
5  {0: 1, 1: 25}   0.100000   0.240964 0.384615 0.296296 0.768276
3  {0: 1, 1: 25}   0.200000   0.562500 0.346154 0.428571 0.768276
4  {0: 1, 1: 25}   0.150000   0.367347 0.346154 0.356436 0.768276
2  {0: 1, 1: 25}   0.250000   0.592593 0.307692 0.405063 0.768276
1  {0: 1, 1: 25}   0.300000   0.590909 0.250000 0.351351 0.768276
0  {0: 1, 1: 25}   0.500000   0.800000 0.153846 0.258065 0.768276

--- Random Forest Tuned Model ---
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1983
           1       0.57      0.33      0.41        52

    accuracy                           0.98      2035
   macro avg       0.77      0.66      0.70      2035
weighted avg       0.97      0.98      0.97      2035

ROC_AUC:  0.7884809340936421
Confusion Matrix: 
 [[1970   13]
 [  35   17]]

Fitting 5 folds for each of 40 candidates, totalling 200 fits

Best CV PR-AUC:  0.2313511425184655
Best Params: 
 subsample: 0.9
 reg_lambda: 1.0
 n_estimators: 1000
 min_child_weight: 1
 max_depth: 5
 learning_rate: 0.1
 gamma: 1.0
 colsample_bytree: 0.9

[Tuned] ROC_AUC: 0.762 | PR-AUC: 0.342

[0]	validation_0-aucpr:0.18418
[30]	validation_0-aucpr:0.22533
[48]	validation_0-aucpr:0.23679

[Final Model] ROC_AUC:  0.743 | PR-AUC: 0.243

Best threshold for recall≈0.7: 0.021
[[1178  607]
 [  14   33]]
              precision    recall  f1-score   support

           0      0.988     0.660     0.791      1785
           1      0.052     0.702     0.096        47

    accuracy                          0.661      1832
   macro avg      0.520     0.681     0.444      1832
weighted avg      0.964     0.661     0.774      1832

Class distribution after upsampling:  Counter({0: 5948, 1: 5948})

Confusion Matrix: 
 [[1629  156]
 [   5   42]]

Classification Report: 
               precision    recall  f1-score   support

           0      0.997     0.913     0.953      1785
           1      0.212     0.894     0.343        47

    accuracy                          0.912      1832
   macro avg      0.605     0.903     0.648      1832
weighted avg      0.977     0.912     0.937      1832


ROC-AUC: 
 0.9031110316467013

PR-AUC: 
 0.19228438336725986

Fitting 5 folds for each of 40 candidates, totalling 200 fits

Best CV PR-AUC: 0.998762727969751
Best Params:
 subsample: 0.8
 reg_lambda: 1.0
 n_estimators: 1200
 min_child_weight: 2
 max_depth: 7
 learning_rate: 0.05
 gamma: 0
 colsample_bytree: 1.0
Final ROC-AUC:  1.000 | PR-AUC:  0.997

Threshold Analysis:
   threshold  precision   recall       f1  roc_auc  avg_precision
0   0.500000   0.886792 1.000000 0.940000 0.999928       0.997260
1   0.400000   0.534091 1.000000 0.696296 0.999928       0.997260
2   0.300000   0.256831 1.000000 0.408696 0.999928       0.997260
3   0.250000   0.163763 1.000000 0.281437 0.999928       0.997260
4   0.200000   0.101732 1.000000 0.184676 0.999928       0.997260
5   0.150000   0.060960 1.000000 0.114914 0.999928       0.997260

==============================
FINAL REPORT (Threshold: 0.39)
==============================
              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1785
           1       0.52      1.00      0.69        47

    accuracy                           0.98      1832
   macro avg       0.76      0.99      0.84      1832
weighted avg       0.99      0.98      0.98      1832

ROC-AUC: 0.9999
Average Precision (AP): 0.9973

Confusion Matrix:
[[1742   43]
 [   0   47]]

Moving to the Random Forest¶