AbstractΒΆ
I scoured the web for a fun dataset that would be relatable to real life. I had a couple of close brushes with death in the Summer with my grandma passing away. So that led me to be interested in when I may be dying. This led to the question of what variables come into play for an estimated life expectancy. The Kaggle dataset collected data from WHO and many other sources to bring together a somewhat comprehensive dataset of potential covariates that come together with recorded life expectancy in various parts of the world. Within this statistical analysis, I will look at the various features and trim them down to a model that will estimate the life expectancy of an individual, no matter their place in the world, with a handful of variables! The final model trimmed down 20 potential Xiβs to an easy-to-manage 5 covariates predicting the life expectancy of an individual while only suffering a ~2% decrease in R^2 value from the full to reduced. This reduced model uses variables from Adult_mortality, Schooling, Incidents_HIV, Alcohol_consumption, and Hepatitis_B to estimate average longevity within the population. The final model had 95.44% of the variance in Y controlled for using these 5 covariates.
IntroductionΒΆ
This project started when I found the Life Expectancy (WHO) Fixed dataset on Kaggle. Having always wondered to what degree each variable in life contributes to the overall estimation of life expectancy throughout the world I set to work. The dataset had variables such as: "Country"- Which of the 179 countries the data was collected from, "Region"- one of 9 geographic regions each country fell into, "Year"- The Years the data were collected originally from the countries (from 2000-2015), "Infant_deaths"- infant deaths per 1000 population, "Under_five_deaths"- death of children under age of 5 per 1000 population, "Adult_mortality" - deat grate of adults per 1000 population, "Alcohol_consumption"- Liters of pure alcohol per capita 15+ years old, "Hepatitis_B"-% coverage of vaccine in 1 year olds , "Measles"-% coverage of vaccine in 1 year olds , "BMI"- Weight in kg/height in m^2, "Polioβ-% coverage of vaccine in 1 year olds , "Diphtheria"- % coverage of vaccine in 1 year olds, "Incidents_HIV"- incidents of HIV per 1000 pop aged 15-49, "GDP_per_capita"- GDP per Capita in USD, "Population_mln"- Population in millions, "Thinness_ten_nineteen_yearsβ- 2 standard deviations below average BMI in 10-19 year olds, "Thinness_five_nine_years"- 2 standard deviations below average BMI in 5-9 year olds, "Schooling"- Average years spent in formal education for 25+ year olds, "Economy_status_Developed"- developed country, "Economy_status_Developing"- Developing country. Armed with these data points, all I had to do was build out and then interpret the model. I found through the exploratory portion that many of the higher-scoring life expectancies were already developed countries, and considered part of the βWestβ culturally.
Using the above-mentioned data points as Xi values, I hoped to build a somewhat accurate model that investigates the relationship between these various factors and Yi βLife_expectancyβ-the average age an adult with various backgrounds and country/regions could expect to live. While controlling for other various sources of variance through the nature of assumptions and coefficient modeling!
MethodsΒΆ
The Initial Full Model used all variables (except the developing country) to create a prediction.
Yi - βLife_expectancyβ- average life expectancy of an adult
Xi1 - "Country"- Which of the 179 countries the data was collected from for i-th observation
Xi2 - "Region"- one of 9 geographic regions each country fell into for i-th observation
Xi3 - "Year"- The Years the data were collected originally from the countries (from 2000-2015)for i-th observation
Xi4 - "Infant_deaths"- infant deaths per 1000 population for i-th observation
Xi5 - "Under_five_deaths"- death of children under age of 5 per 1000 population for i-th observation
Xi6 - "Adult_mortality" - death grate of adults per 1000 population for i-th observation
Xi7 - "Alcohol_consumption"- Liters of pure alcohol per capita 15+ years old for i-th observation
Xi8 - "Hepatitis_B"-% coverage of vaccine in 1 year olds for i-th observation
Xi9 - "Measles"-% coverage of vaccine in 1 year olds for i-th observation
Xi10 - "BMI"- Weight in kg/height in m^2 for i-th observation
Xi11 -"Polioβ-% coverage of vaccine in 1 year olds for i-th observation
Xi12 - "Diphtheria"- % coverage of vaccine in 1 year olds for i-th observation
Xi13 - βIncidents_HIV"- incidents of HIV per 1000 pop aged 15-49 for i-th observation
Xi14 - "GDP_per_capita"- GDP per Capita in USD for i-th observation
Xi15 - "Population_mln"- Population in millions for i-th observation
Xi16 - "Thinness_ten_nineteen_yearsβ- 2 standard deviations below average BMI in 10-19 year olds for i-th observation
Xi17 - "Thinness_five_nine_years"- 2 standard deviations below average BMI in 5-9 year olds for i-th observation
Xi18 - "Schooling"- Average years spent in formal education for 25+ year olds for i-th observation
Xi19 - "Economy_status_Developed"- developed country for i-th observation
Once I had the initial results. I removed the Economy developing status as it was redundant with the binary economy status developed feature. Then I narrowed down the variables primarily through multicollinearity. To find the potential problem variables, I could have pushed out a pairs plot for each variable; however, it made the plots super small to see and not very useful. As I was not interested in individually parsing out the plots and investigating, I instead used a heatmap of the correlations, which was not very helpful either. VIF was not possible as some of the variables had perfect multicollinearity, and with the plots being fickle before I even started exploring the data, I was locked out a bit with a laptop. So I set up a bit of a black box and hardcoded any variables that had a correlation coefficient higher than 0.7 to be segmented into their own data frame, then removed Life_expectancy from the data set. I looked at the variables that were pulled, and it made sense that these variables had high correlation with each other: "Infant_deaths'', "Under_five_deaths", "Polio", "Diphtheria", "Thinness_ten_nineteen_years". Comparing these columns to the heatmap and some research, I found that it was clear they had high multicollinearity, as Infant deaths would be related to deaths under five. Polio and Diphtheria are generalized vaccines, and if one was not vaccinated for then the other would typically not be either, indicative more of the healthcare in the particular country than anything. And Thinness ten to nineteen (measure of BMI for 10-19 year olds) would directly overlap with BMI and the other thinness measure. Once I removed these values and separated the data into regions rather than down to a country level, I was able to use the more useful tools such as Variance Inflation Factor, pairing scatterplots, and honing the model with the regsubsets package. Using VIF, I found the Region variable (which separated the world into 9 distinct areas) to be causing issues with multicollinearity as well. Years were also not indicative of the length of an average personβs life, as this just measured the time the data was collected. With these 12 covariates swiftly removed, I was able to move on to the actual model-building portion.
The Final Model used for estimation purposes cut many of the variables to a base 5 Xi.
Life_expectancy ~ Adult_mortality + Schooling + Incidents_HIV + Alcohol_consumption + Hepatitis_B
Yi - βLife_expectancyβ- average life expectancy of an adult
Xi1 - "Adult_mortality" - death rate of adults per 1000 population for i-th observation
Xi2 - "Schooling"- Average years spent in formal education for 25+ year olds for i-th observation
Xi3 - βIncidents_HIV"- incidents of HIV per 1000 pop aged 15-49 for i-th observation
Xi4 - "Alcohol_consumption"- Liters of pure alcohol per capita 15+ years old for i-th observation
Xi5 - "Hepatitis_B"-% coverage of vaccine in 1 year olds for i-th observation
ResultsΒΆ
The final fitted model output is below. I used a 99% confidence interval or alpha = 0.01 as there is a ton of variance in the data, but a large portion of the ages clump together around ~60-75 no matter the Region or other factors. The null hypothesis H0: Xin = 0 and alternative hypothesis Ha: Xin /= 0, we were able to reject the null even with the fully reduced final model at a 99% confidence. As the output shows- we had WELL below a 0.01 p-value and the bounds of the F-crit was simply above a 3.026 so, that further supported rejecting the Null Hypothesis. The 99% CI for all values is also below and the coefficients fall well within range. As you also probably suspected reading through the list of variables; adult mortality in a given country is a KEY indicator in the lifespan of any given individual. There are many factors that go into adult mortality rates but it is an overwhelmingly comprehensive measure of how well a country is taking care of its people and the assumed lifespan of an individual within it. Providing a description of 93.84% of the variance in Yi no matter the model. Next Schooling with 5.15%, then occurrence of HIV within a population for a good look over safe sex or personal health practices and availability for contraception helps to account for 0.568% variance in Yi. Then Liters of Alcohol consumption or how well entertained a population would be accounts for 0.267% of variance in Yi. And finally Hepatitis B vaccine rate accounting for 0.174% of the variance, which can also be looked at as availability of preventative health services for the population.
This dataset was pretty interesting and I wanted to do a bit of exploratory analysis on it first for a direction in how to build the model out. So, first I cleaned up the dataset further- as the dataset had 20 variables available to predict the Y with 2,864 data points I had to narrow it down as much as possible before using any model building/feature selection techniques.
I removed the developing column as it presented the same binary info as the developed country column and transformed Region into a factor to force R to dummy code the 9 regions for me.
Then I began trying to build the model out after visualizing some of the life_expectancy trends such as the developed countries having higher values, a heavy left skew to about 70-75 globally for life expectancy, and checking completeness of the dataset to prevent imputation errors.
Call:
lm(formula = Life_expectancy ~ Country + Region + Year + Infant_deaths +
Under_five_deaths + Adult_mortality + Alcohol_consumption +
Hepatitis_B + Measles + BMI + Polio + Diphtheria + Incidents_HIV +
GDP_per_capita + Population_mln + Thinness_ten_nineteen_years +
Thinness_five_nine_years + Schooling + Economy_status_Developed,
data = life)
Residuals:
Min 1Q Median 3Q Max
-2.0594 -0.2239 -0.0128 0.2094 5.2254
Coefficients: (9 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.030e+02 1.248e+01 -16.262 < 2e-16
CountryAlbania 5.694e+00 3.867e-01 14.724 < 2e-16
CountryAlgeria 4.940e+00 3.054e-01 16.177 < 2e-16
CountryAngola -3.614e+00 2.175e-01 -16.616 < 2e-16
CountryAntigua and Barbuda 7.061e+00 3.875e-01 18.222 < 2e-16
CountryArgentina 6.694e+00 4.236e-01 15.805 < 2e-16
CountryArmenia 5.511e+00 4.080e-01 13.507 < 2e-16
CountryAustralia 8.968e+00 5.758e-01 15.575 < 2e-16
CountryAustria 8.062e+00 4.823e-01 16.716 < 2e-16
CountryAzerbaijan 3.928e+00 4.018e-01 9.776 < 2e-16
CountryBahamas, The 5.655e+00 5.111e-01 11.064 < 2e-16
CountryBahrain 3.817e+00 3.872e-01 9.857 < 2e-16
CountryBangladesh 1.113e+00 2.704e-01 4.116 3.97e-05
CountryBarbados 9.306e+00 4.477e-01 20.786 < 2e-16
CountryBelarus 5.817e+00 4.251e-01 13.684 < 2e-16
CountryBelgium 8.061e+00 4.875e-01 16.535 < 2e-16
CountryBelize 6.318e+00 4.819e-01 13.112 < 2e-16
CountryBenin -1.233e+00 2.092e-01 -5.895 4.22e-09
CountryBhutan 3.151e+00 2.099e-01 15.009 < 2e-16
CountryBolivia 3.111e+00 3.272e-01 9.508 < 2e-16
CountryBosnia and Herzegovina 5.597e+00 3.433e-01 16.302 < 2e-16
CountryBotswana 9.806e-01 3.575e-01 2.743 0.006138
CountryBrazil 5.612e+00 3.771e-01 14.881 < 2e-16
CountryBrunei Darussalam 3.318e+00 4.514e-01 7.351 2.61e-13
CountryBulgaria 5.587e+00 4.049e-01 13.800 < 2e-16
CountryBurkina Faso -2.520e+00 2.599e-01 -9.696 < 2e-16
CountryBurundi -2.656e+00 2.384e-01 -11.139 < 2e-16
CountryCabo Verde 2.101e+00 2.649e-01 7.930 3.18e-15
CountryCambodia -3.525e-01 2.080e-01 -1.695 0.090181
CountryCameroon -2.524e-01 2.322e-01 -1.087 0.277207
CountryCanada 9.169e+00 5.386e-01 17.024 < 2e-16
CountryCentral African Republic -1.745e+00 2.235e-01 -7.809 8.22e-15
CountryChad -2.455e+00 2.481e-01 -9.896 < 2e-16
CountryChile 8.566e+00 4.497e-01 19.051 < 2e-16
CountryChina 2.631e+00 1.757e+00 1.498 0.134357
CountryColombia 7.318e+00 3.254e-01 22.490 < 2e-16
CountryComoros 3.750e-01 2.101e-01 1.785 0.074382
CountryCongo, Dem. Rep. -2.517e+00 2.157e-01 -11.669 < 2e-16
CountryCongo, Rep. -2.148e-01 2.250e-01 -0.955 0.339679
CountryCosta Rica 8.398e+00 4.021e-01 20.885 < 2e-16
CountryCote d'Ivoire -2.638e-01 2.153e-01 -1.225 0.220520
CountryCroatia 6.292e+00 4.384e-01 14.351 < 2e-16
CountryCuba 7.540e+00 3.821e-01 19.733 < 2e-16
CountryCyprus 7.248e+00 4.844e-01 14.964 < 2e-16
CountryCzechia 7.231e+00 4.841e-01 14.937 < 2e-16
CountryDenmark 6.661e+00 5.369e-01 12.407 < 2e-16
CountryDjibouti -9.279e-02 2.302e-01 -0.403 0.686894
CountryDominican Republic 5.716e+00 3.311e-01 17.263 < 2e-16
CountryEcuador 6.901e+00 3.661e-01 18.847 < 2e-16
CountryEgypt, Arab Rep. 4.600e+00 4.525e-01 10.167 < 2e-16
CountryEl Salvador 5.732e+00 3.694e-01 15.518 < 2e-16
CountryEquatorial Guinea -8.483e-01 2.665e-01 -3.183 0.001474
CountryEritrea -7.392e-01 2.743e-01 -2.695 0.007086
CountryEstonia 6.837e+00 4.646e-01 14.716 < 2e-16
CountryEswatini -3.575e+00 4.791e-01 -7.461 1.15e-13
CountryEthiopia -7.916e-01 2.588e-01 -3.059 0.002242
CountryFiji 1.999e+00 4.160e-01 4.807 1.62e-06
CountryFinland 8.029e+00 5.113e-01 15.703 < 2e-16
CountryFrance 9.347e+00 4.545e-01 20.563 < 2e-16
CountryGabon 6.503e-01 2.860e-01 2.274 0.023056
CountryGambia, The -1.818e+00 2.226e-01 -8.167 4.84e-16
CountryGeorgia 5.508e+00 4.412e-01 12.485 < 2e-16
CountryGermany 8.244e+00 5.281e-01 15.611 < 2e-16
CountryGhana -8.546e-01 2.299e-01 -3.717 0.000206
CountryGreece 8.723e+00 4.617e-01 18.894 < 2e-16
CountryGrenada 5.678e+00 3.737e-01 15.197 < 2e-16
CountryGuatemala 5.264e+00 3.089e-01 17.042 < 2e-16
CountryGuinea -2.174e+00 2.020e-01 -10.765 < 2e-16
CountryGuinea-Bissau -4.292e+00 2.137e-01 -20.081 < 2e-16
CountryGuyana 3.714e+00 3.322e-01 11.180 < 2e-16
CountryHaiti 4.362e-01 2.434e-01 1.792 0.073194
CountryHonduras 5.595e+00 3.168e-01 17.659 < 2e-16
CountryHungary 6.724e+00 4.529e-01 14.847 < 2e-16
CountryIceland 8.515e+00 5.241e-01 16.247 < 2e-16
CountryIndia 1.201e+00 1.598e+00 0.752 0.452362
CountryIndonesia 1.950e+00 3.573e-01 5.459 5.23e-08
CountryIran, Islamic Rep. 4.321e+00 3.410e-01 12.670 < 2e-16
CountryIraq 4.097e+00 4.059e-01 10.095 < 2e-16
CountryIreland 7.894e+00 5.641e-01 13.995 < 2e-16
CountryIsrael 9.012e+00 5.201e-01 17.325 < 2e-16
CountryItaly 8.762e+00 4.401e-01 19.909 < 2e-16
CountryJamaica 6.719e+00 3.924e-01 17.123 < 2e-16
CountryJapan 8.825e+00 4.345e-01 20.311 < 2e-16
CountryJordan 5.494e+00 4.912e-01 11.183 < 2e-16
CountryKazakhstan 5.000e+00 4.084e-01 12.241 < 2e-16
CountryKenya -5.759e-01 2.219e-01 -2.595 0.009510
CountryKiribati 3.976e+00 5.072e-01 7.838 6.55e-15
CountryKuwait 3.054e+00 5.550e-01 5.503 4.09e-08
CountryKyrgyz Republic 4.932e+00 3.660e-01 13.476 < 2e-16
CountryLao PDR 3.858e-01 2.161e-01 1.786 0.074271
CountryLatvia 6.806e+00 4.412e-01 15.427 < 2e-16
CountryLebanon 7.025e+00 4.017e-01 17.490 < 2e-16
CountryLesotho -2.819e+00 3.763e-01 -7.492 9.19e-14
CountryLiberia -4.736e-01 2.380e-01 -1.990 0.046692
CountryLibya 4.087e+00 4.189e-01 9.756 < 2e-16
CountryLithuania 7.145e+00 4.499e-01 15.881 < 2e-16
CountryLuxembourg 6.303e+00 7.731e-01 8.153 5.41e-16
CountryMadagascar 1.302e-01 2.415e-01 0.539 0.590038
CountryMalawi -9.598e-01 2.331e-01 -4.118 3.93e-05
CountryMalaysia 4.649e+00 3.437e-01 13.527 < 2e-16
CountryMaldives 3.619e+00 2.734e-01 13.237 < 2e-16
CountryMali -3.656e+00 2.203e-01 -16.594 < 2e-16
CountryMalta 8.423e+00 4.639e-01 18.157 < 2e-16
CountryMauritania 6.668e-01 2.381e-01 2.801 0.005136
CountryMauritius 4.908e+00 3.217e-01 15.255 < 2e-16
CountryMexico 6.821e+00 4.306e-01 15.840 < 2e-16
CountryMicronesia, Fed. Sts. 1.587e+00 4.904e-01 3.237 0.001224
CountryMoldova 4.677e+00 4.143e-01 11.289 < 2e-16
CountryMongolia 3.221e+00 3.435e-01 9.378 < 2e-16
CountryMontenegro 5.990e+00 4.171e-01 14.363 < 2e-16
CountryMorocco 3.633e+00 2.910e-01 12.485 < 2e-16
CountryMozambique -2.513e+00 2.132e-01 -11.784 < 2e-16
CountryMyanmar -9.138e-01 1.980e-01 -4.615 4.12e-06
CountryNamibia -6.691e-01 2.709e-01 -2.470 0.013560
CountryNepal 6.133e-01 2.087e-01 2.938 0.003331
CountryNetherlands 7.522e+00 5.022e-01 14.979 < 2e-16
CountryNew Zealand 9.163e+00 5.421e-01 16.902 < 2e-16
CountryNicaragua 5.663e+00 3.517e-01 16.102 < 2e-16
CountryNiger -2.616e+00 2.933e-01 -8.918 < 2e-16
CountryNigeria -2.359e+00 2.793e-01 -8.445 < 2e-16
CountryNorth Macedonia 4.920e+00 3.826e-01 12.857 < 2e-16
CountryNorway 7.456e+00 6.467e-01 11.529 < 2e-16
CountryOman 4.609e+00 3.844e-01 11.989 < 2e-16
CountryPakistan 1.102e+00 2.873e-01 3.835 0.000128
CountryPanama 8.037e+00 3.956e-01 20.316 < 2e-16
CountryPapua New Guinea -7.103e-01 2.582e-01 -2.751 0.005986
CountryParaguay 5.427e+00 3.365e-01 16.125 < 2e-16
CountryPeru 5.928e+00 3.504e-01 16.916 < 2e-16
CountryPhilippines 4.326e+00 2.881e-01 15.017 < 2e-16
CountryPoland 7.206e+00 4.317e-01 16.693 < 2e-16
CountryPortugal 7.707e+00 3.988e-01 19.325 < 2e-16
CountryQatar 6.504e+00 6.388e-01 10.182 < 2e-16
CountryRomania 5.615e+00 3.965e-01 14.160 < 2e-16
CountryRussian Federation 5.453e+00 4.387e-01 12.428 < 2e-16
CountryRwanda -6.380e-01 2.513e-01 -2.539 0.011174
CountrySamoa 5.919e+00 6.322e-01 9.362 < 2e-16
CountrySao Tome and Principe 1.231e+00 2.526e-01 4.873 1.16e-06
CountrySaudi Arabia 4.200e+00 4.756e-01 8.831 < 2e-16
CountrySenegal -5.244e-01 2.101e-01 -2.496 0.012637
CountrySerbia 5.262e+00 3.942e-01 13.347 < 2e-16
CountrySeychelles 5.067e+00 3.924e-01 12.915 < 2e-16
CountrySierra Leone -1.847e+00 2.292e-01 -8.055 1.18e-15
CountrySingapore 6.620e+00 4.605e-01 14.377 < 2e-16
CountrySlovak Republic 6.088e+00 4.348e-01 14.003 < 2e-16
CountrySlovenia 8.062e+00 4.603e-01 17.515 < 2e-16
CountrySolomon Islands 3.879e+00 3.039e-01 12.765 < 2e-16
CountrySomalia -7.450e-01 2.092e-01 -3.562 0.000375
CountrySouth Africa 3.524e+00 4.236e-01 8.318 < 2e-16
CountrySpain 9.240e+00 4.378e-01 21.103 < 2e-16
CountrySri Lanka 5.106e+00 3.378e-01 15.114 < 2e-16
CountrySt. Lucia 8.232e+00 4.700e-01 17.514 < 2e-16
CountrySt. Vincent and the Grenadines 4.992e+00 3.748e-01 13.318 < 2e-16
CountrySuriname 4.244e+00 3.458e-01 12.273 < 2e-16
CountrySweden 8.180e+00 5.313e-01 15.397 < 2e-16
CountrySwitzerland 7.845e+00 6.503e-01 12.064 < 2e-16
CountrySyrian Arab Republic 6.000e+00 3.935e-01 15.251 < 2e-16
CountryTajikistan 1.318e+00 3.596e-01 3.666 0.000252
CountryTanzania -7.783e-01 2.252e-01 -3.456 0.000558
CountryThailand 5.500e+00 2.812e-01 19.556 < 2e-16
CountryTimor-Leste -6.180e-02 2.301e-01 -0.269 0.788320
CountryTogo -2.133e+00 2.047e-01 -10.424 < 2e-16
CountryTonga 4.654e+00 6.688e-01 6.959 4.29e-12
CountryTrinidad and Tobago 5.753e+00 4.265e-01 13.491 < 2e-16
CountryTunisia 4.951e+00 3.340e-01 14.824 < 2e-16
CountryTurkiye 5.427e+00 3.972e-01 13.662 < 2e-16
CountryTurkmenistan 2.443e+00 3.633e-01 6.724 2.15e-11
CountryUganda -7.137e-01 2.593e-01 -2.753 0.005946
CountryUkraine 5.967e+00 4.054e-01 14.717 < 2e-16
CountryUnited Arab Emirates 4.693e+00 5.600e-01 8.380 < 2e-16
CountryUnited Kingdom 8.280e+00 5.484e-01 15.100 < 2e-16
CountryUnited States 8.232e+00 7.246e-01 11.360 < 2e-16
CountryUruguay 7.016e+00 4.039e-01 17.370 < 2e-16
CountryUzbekistan 3.392e+00 3.719e-01 9.121 < 2e-16
CountryVanuatu 2.007e+00 3.312e-01 6.059 1.56e-09
CountryVenezuela, RB 5.717e+00 3.888e-01 14.706 < 2e-16
CountryVietnam 4.589e+00 2.860e-01 16.042 < 2e-16
CountryYemen, Rep. 1.723e+00 2.083e-01 8.272 < 2e-16
CountryZambia -1.042e+00 2.605e-01 -4.000 6.50e-05
CountryZimbabwe 1.304e-01 2.972e-01 0.439 0.660929
RegionAsia NA NA NA NA
RegionCentral America and Caribbean NA NA NA NA
RegionEuropean Union NA NA NA NA
RegionMiddle East NA NA NA NA
RegionNorth America NA NA NA NA
RegionOceania NA NA NA NA
RegionRest of Europe NA NA NA NA
RegionSouth America NA NA NA NA
Year 1.435e-01 6.719e-03 21.352 < 2e-16
Infant_deaths -8.570e-03 7.114e-03 -1.205 0.228456
Under_five_deaths -4.410e-02 3.835e-03 -11.499 < 2e-16
Adult_mortality -4.163e-02 6.555e-04 -63.503 < 2e-16
Alcohol_consumption -2.697e-02 1.197e-02 -2.253 0.024326
Hepatitis_B 1.869e-03 1.375e-03 1.359 0.174179
Measles 2.034e-03 1.374e-03 1.480 0.139041
BMI -4.152e-01 6.403e-02 -6.484 1.06e-10
Polio 7.760e-04 2.736e-03 0.284 0.776735
Diphtheria 8.859e-03 2.746e-03 3.226 0.001270
Incidents_HIV 1.555e-01 2.458e-02 6.323 2.99e-10
GDP_per_capita 2.896e-05 6.078e-06 4.766 1.98e-06
Population_mln 1.954e-04 1.369e-03 0.143 0.886510
Thinness_ten_nineteen_years -1.400e-02 7.219e-03 -1.940 0.052538
Thinness_five_nine_years -1.285e-02 7.160e-03 -1.794 0.072870
Schooling -9.822e-02 2.869e-02 -3.423 0.000628
Economy_status_Developed NA NA NA NA
(Intercept) ***
CountryAlbania ***
CountryAlgeria ***
CountryAngola ***
CountryAntigua and Barbuda ***
CountryArgentina ***
CountryArmenia ***
CountryAustralia ***
CountryAustria ***
CountryAzerbaijan ***
CountryBahamas, The ***
CountryBahrain ***
CountryBangladesh ***
CountryBarbados ***
CountryBelarus ***
CountryBelgium ***
CountryBelize ***
CountryBenin ***
CountryBhutan ***
CountryBolivia ***
CountryBosnia and Herzegovina ***
CountryBotswana **
CountryBrazil ***
CountryBrunei Darussalam ***
CountryBulgaria ***
CountryBurkina Faso ***
CountryBurundi ***
CountryCabo Verde ***
CountryCambodia .
CountryCameroon
CountryCanada ***
CountryCentral African Republic ***
CountryChad ***
CountryChile ***
CountryChina
CountryColombia ***
CountryComoros .
CountryCongo, Dem. Rep. ***
CountryCongo, Rep.
CountryCosta Rica ***
CountryCote d'Ivoire
CountryCroatia ***
CountryCuba ***
CountryCyprus ***
CountryCzechia ***
CountryDenmark ***
CountryDjibouti
CountryDominican Republic ***
CountryEcuador ***
CountryEgypt, Arab Rep. ***
CountryEl Salvador ***
CountryEquatorial Guinea **
CountryEritrea **
CountryEstonia ***
CountryEswatini ***
CountryEthiopia **
CountryFiji ***
CountryFinland ***
CountryFrance ***
CountryGabon *
CountryGambia, The ***
CountryGeorgia ***
CountryGermany ***
CountryGhana ***
CountryGreece ***
CountryGrenada ***
CountryGuatemala ***
CountryGuinea ***
CountryGuinea-Bissau ***
CountryGuyana ***
CountryHaiti .
CountryHonduras ***
CountryHungary ***
CountryIceland ***
CountryIndia
CountryIndonesia ***
CountryIran, Islamic Rep. ***
CountryIraq ***
CountryIreland ***
CountryIsrael ***
CountryItaly ***
CountryJamaica ***
CountryJapan ***
CountryJordan ***
CountryKazakhstan ***
CountryKenya **
CountryKiribati ***
CountryKuwait ***
CountryKyrgyz Republic ***
CountryLao PDR .
CountryLatvia ***
CountryLebanon ***
CountryLesotho ***
CountryLiberia *
CountryLibya ***
CountryLithuania ***
CountryLuxembourg ***
CountryMadagascar
CountryMalawi ***
CountryMalaysia ***
CountryMaldives ***
CountryMali ***
CountryMalta ***
CountryMauritania **
CountryMauritius ***
CountryMexico ***
CountryMicronesia, Fed. Sts. **
CountryMoldova ***
CountryMongolia ***
CountryMontenegro ***
CountryMorocco ***
CountryMozambique ***
CountryMyanmar ***
CountryNamibia *
CountryNepal **
CountryNetherlands ***
CountryNew Zealand ***
CountryNicaragua ***
CountryNiger ***
CountryNigeria ***
CountryNorth Macedonia ***
CountryNorway ***
CountryOman ***
CountryPakistan ***
CountryPanama ***
CountryPapua New Guinea **
CountryParaguay ***
CountryPeru ***
CountryPhilippines ***
CountryPoland ***
CountryPortugal ***
CountryQatar ***
CountryRomania ***
CountryRussian Federation ***
CountryRwanda *
CountrySamoa ***
CountrySao Tome and Principe ***
CountrySaudi Arabia ***
CountrySenegal *
CountrySerbia ***
CountrySeychelles ***
CountrySierra Leone ***
CountrySingapore ***
CountrySlovak Republic ***
CountrySlovenia ***
CountrySolomon Islands ***
CountrySomalia ***
CountrySouth Africa ***
CountrySpain ***
CountrySri Lanka ***
CountrySt. Lucia ***
CountrySt. Vincent and the Grenadines ***
CountrySuriname ***
CountrySweden ***
CountrySwitzerland ***
CountrySyrian Arab Republic ***
CountryTajikistan ***
CountryTanzania ***
CountryThailand ***
CountryTimor-Leste
CountryTogo ***
CountryTonga ***
CountryTrinidad and Tobago ***
CountryTunisia ***
CountryTurkiye ***
CountryTurkmenistan ***
CountryUganda **
CountryUkraine ***
CountryUnited Arab Emirates ***
CountryUnited Kingdom ***
CountryUnited States ***
CountryUruguay ***
CountryUzbekistan ***
CountryVanuatu ***
CountryVenezuela, RB ***
CountryVietnam ***
CountryYemen, Rep. ***
CountryZambia ***
CountryZimbabwe
RegionAsia
RegionCentral America and Caribbean
RegionEuropean Union
RegionMiddle East
RegionNorth America
RegionOceania
RegionRest of Europe
RegionSouth America
Year ***
Infant_deaths
Under_five_deaths ***
Adult_mortality ***
Alcohol_consumption *
Hepatitis_B
Measles
BMI ***
Polio
Diphtheria **
Incidents_HIV ***
GDP_per_capita ***
Population_mln
Thinness_ten_nineteen_years .
Thinness_five_nine_years .
Schooling ***
Economy_status_Developed
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 0.486 on 2669 degrees of freedom
Multiple R-squared: 0.9975, Adjusted R-squared: 0.9973
F-statistic: 5513 on 194 and 2669 DF, p-value: < 2.2e-16
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> | <dbl> | |
| Country | 178 | 2.409342e+05 | 1.353563e+03 | 5.730201e+03 | 0.000000e+00 |
| Year | 1 | 7.699873e+03 | 7.699873e+03 | 3.259680e+04 | 0.000000e+00 |
| Infant_deaths | 1 | 2.070251e+03 | 2.070251e+03 | 8.764246e+03 | 0.000000e+00 |
| Under_five_deaths | 1 | 4.835294e+02 | 4.835294e+02 | 2.046983e+03 | 0.000000e+00 |
| Adult_mortality | 1 | 1.408345e+03 | 1.408345e+03 | 5.962118e+03 | 0.000000e+00 |
| Alcohol_consumption | 1 | 1.714765e+00 | 1.714765e+00 | 7.259320e+00 | 7.097574e-03 |
| Hepatitis_B | 1 | 5.448582e+00 | 5.448582e+00 | 2.306614e+01 | 1.651718e-06 |
| Measles | 1 | 7.636172e-01 | 7.636172e-01 | 3.232713e+00 | 7.229398e-02 |
| BMI | 1 | 1.343557e+01 | 1.343557e+01 | 5.687844e+01 | 6.316136e-14 |
| Polio | 1 | 4.908968e+00 | 4.908968e+00 | 2.078173e+01 | 5.378995e-06 |
| Diphtheria | 1 | 3.135116e+00 | 3.135116e+00 | 1.327227e+01 | 2.744917e-04 |
| Incidents_HIV | 1 | 8.630708e+00 | 8.630708e+00 | 3.653742e+01 | 1.706632e-09 |
| GDP_per_capita | 1 | 4.760226e+00 | 4.760226e+00 | 2.015204e+01 | 7.455780e-06 |
| Population_mln | 1 | 7.435504e-02 | 7.435504e-02 | 3.147762e-01 | 5.748111e-01 |
| Thinness_ten_nineteen_years | 1 | 3.637577e+00 | 3.637577e+00 | 1.539940e+01 | 8.921391e-05 |
| Thinness_five_nine_years | 1 | 7.228397e-01 | 7.228397e-01 | 3.060085e+00 | 8.035240e-02 |
| Schooling | 1 | 2.768456e+00 | 2.768456e+00 | 1.172004e+01 | 6.276284e-04 |
| Residuals | 2669 | 6.304593e+02 | 2.362156e-01 | NA | NA |
| Year | Infant_deaths | Under_five_deaths | Adult_mortality | Alcohol_consumption | Hepatitis_B | Measles | BMI | Polio | Diphtheria | Incidents_HIV | GDP_per_capita | Population_mln | Thinness_ten_nineteen_years | Thinness_five_nine_years | Schooling | Economy_status_Developed | Life_expectancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | 1.0000000000 | -0.17240170 | -0.176392621 | -0.15865958 | -0.0006105222 | 0.17682407 | 0.08594472 | 0.1614225 | 0.13985840 | 0.14514293 | -0.08174257 | 0.04099817 | 0.015157618 | -0.04490053 | -0.04803775 | 0.15053937 | 0.00000000 | 0.17435894 |
| Infant_deaths | -0.1724016970 | 1.00000000 | 0.985651346 | 0.79466086 | -0.4545261472 | -0.51256224 | -0.52628201 | -0.6619883 | -0.74079046 | -0.72187465 | 0.34945826 | -0.51228611 | 0.007621990 | 0.49119174 | 0.47763934 | -0.78851253 | -0.47586620 | -0.92003192 |
| Under_five_deaths | -0.1763926213 | 0.98565135 | 1.000000000 | 0.80236112 | -0.4093673971 | -0.50742741 | -0.51297174 | -0.6652550 | -0.74298347 | -0.72535503 | 0.36961773 | -0.46968167 | -0.005234231 | 0.46697846 | 0.45075570 | -0.77319598 | -0.42713418 | -0.92041913 |
| Adult_mortality | -0.1586595781 | 0.79466086 | 0.802361123 | 1.00000000 | -0.2447937555 | -0.34488221 | -0.41615254 | -0.5228655 | -0.52422554 | -0.51380270 | 0.69911938 | -0.51012141 | -0.053847680 | 0.38214030 | 0.37979229 | -0.58103548 | -0.42937477 | -0.94536036 |
| Alcohol_consumption | -0.0006105222 | -0.45452615 | -0.409367397 | -0.24479376 | 1.0000000000 | 0.16843582 | 0.31860293 | 0.2840319 | 0.30192623 | 0.29901592 | -0.03411801 | 0.44396595 | -0.039118659 | -0.44636618 | -0.43302972 | 0.61572804 | 0.67036609 | 0.39915911 |
| Hepatitis_B | 0.1768240714 | -0.51256224 | -0.507427407 | -0.34488221 | 0.1684358238 | 1.00000000 | 0.42916779 | 0.3454209 | 0.72434526 | 0.76178009 | -0.07578195 | 0.15937504 | -0.082396398 | -0.20845350 | -0.21379442 | 0.34764345 | 0.11353405 | 0.41780443 |
| Measles | 0.0859447214 | -0.52628201 | -0.512971742 | -0.41615254 | 0.3186029309 | 0.42916779 | 1.00000000 | 0.4163214 | 0.51409629 | 0.49405877 | -0.15058000 | 0.31372372 | -0.098221891 | -0.34070533 | -0.36696995 | 0.49839128 | 0.29869329 | 0.49001859 |
| BMI | 0.1614224541 | -0.66198827 | -0.665255042 | -0.52286551 | 0.2840319455 | 0.34542091 | 0.41632141 | 1.0000000 | 0.45720604 | 0.42650090 | -0.16114208 | 0.33617960 | -0.166482004 | -0.59648328 | -0.59911219 | 0.63547517 | 0.24328705 | 0.59842332 |
| Polio | 0.1398583960 | -0.74079046 | -0.742983474 | -0.52422554 | 0.3019262324 | 0.72434526 | 0.51409629 | 0.4572060 | 1.00000000 | 0.95317790 | -0.14795220 | 0.31378567 | -0.033485888 | -0.31268545 | -0.30699811 | 0.55276511 | 0.28326012 | 0.64121746 |
| Diphtheria | 0.1451429275 | -0.72187465 | -0.725355032 | -0.51380270 | 0.2990159210 | 0.76178009 | 0.49405877 | 0.4265009 | 0.95317790 | 1.00000000 | -0.14693191 | 0.31332094 | -0.027335977 | -0.30446625 | -0.29559745 | 0.53562097 | 0.28941718 | 0.62754139 |
| Incidents_HIV | -0.0817425731 | 0.34945826 | 0.369617726 | 0.69911938 | -0.0341180147 | -0.07578195 | -0.15058000 | -0.1611421 | -0.14795220 | -0.14693191 | 1.00000000 | -0.16958972 | -0.058039708 | 0.18876454 | 0.19384734 | -0.20124620 | -0.17563524 | -0.55302746 |
| GDP_per_capita | 0.0409981721 | -0.51228611 | -0.469681668 | -0.51012141 | 0.4439659537 | 0.15937504 | 0.31372372 | 0.3361796 | 0.31378567 | 0.31332094 | -0.16958972 | 1.00000000 | -0.040838867 | -0.37526974 | -0.38103211 | 0.58062592 | 0.66754691 | 0.58308972 |
| Population_mln | 0.0151576184 | 0.00762199 | -0.005234231 | -0.05384768 | -0.0391186595 | -0.08239640 | -0.09822189 | -0.1664820 | -0.03348589 | -0.02733598 | -0.05803971 | -0.04083887 | 1.000000000 | 0.25632201 | 0.25848584 | -0.03356182 | -0.03530183 | 0.02629788 |
| Thinness_ten_nineteen_years | -0.0449005325 | 0.49119174 | 0.466978458 | 0.38214030 | -0.4463661789 | -0.20845350 | -0.34070533 | -0.5964833 | -0.31268545 | -0.30446625 | 0.18876454 | -0.37526974 | 0.256322009 | 1.00000000 | 0.93875710 | -0.57148516 | -0.41609766 | -0.46782450 |
| Thinness_five_nine_years | -0.0480377469 | 0.47763934 | 0.450755699 | 0.37979229 | -0.4330297154 | -0.21379442 | -0.36696995 | -0.5991122 | -0.30699811 | -0.29559745 | 0.19384734 | -0.38103211 | 0.258485836 | 0.93875710 | 1.00000000 | -0.55137635 | -0.41486734 | -0.45816623 |
| Schooling | 0.1505393684 | -0.78851253 | -0.773195983 | -0.58103548 | 0.6157280403 | 0.34764345 | 0.49839128 | 0.6354752 | 0.55276511 | 0.53562097 | -0.20124620 | 0.58062592 | -0.033561816 | -0.57148516 | -0.55137635 | 1.00000000 | 0.59943940 | 0.73248447 |
| Economy_status_Developed | 0.0000000000 | -0.47586620 | -0.427134181 | -0.42937477 | 0.6703660889 | 0.11353405 | 0.29869329 | 0.2432870 | 0.28326012 | 0.28941718 | -0.17563524 | 0.66754691 | -0.035301833 | -0.41609766 | -0.41486734 | 0.59943940 | 1.00000000 | 0.52379098 |
| Life_expectancy | 0.1743589433 | -0.92003192 | -0.920419134 | -0.94536036 | 0.3991591076 | 0.41780443 | 0.49001859 | 0.5984233 | 0.64121746 | 0.62754139 | -0.55302746 | 0.58308972 | 0.026297880 | -0.46782450 | -0.45816623 | 0.73248447 | 0.52379098 | 1.00000000 |
Then to get the reduced model narrowed down I had to make some decisions over what was too much correlation between the values. As I did not print pair plots with 20 variables I instead removed country and year first. The year was just the time the data was collected and should have little real predictive power when moving to another dataset as there could be variables from outside the date range and break the model. Then I removed country. The level of granularity from country would have been useful for a long-term in depth project, but for a couple of weeks the Region conveys much the same information especially when coupled with the other data collected by WHO and each line of data basically being many points from different countries already it is redundant.
The Heatmap was confusing but there are a lot of SOLID colors within the graph so there must be strong correlations between the variables. This is where I took a more swathing approach and hard coded anything above a 0.7 correlation to be too colinear for my project scope.
- 2
- 3
- 9
- 10
- 14
- 'Infant_deaths'
- 'Under_five_deaths'
- 'Polio'
- 'Diphtheria'
- 'Thinness_ten_nineteen_years'
Given these features were above a 0.7 within the other variable selections it makes sense. As Infant_deaths would directly affect the number of Deaths under five. If the country was not well vaccinated against polio- an almost extinct disease, they would most likely not have the preventative care available for diphtheria either and these values would be better reflected using a more common vaccine such as Hepatitis B. And thinness from 10 to 19 would probably be explained by BMI straight out or by the continuation of thinness in the younger population. The younger population starving has higher instances of long term health effects as malnutrition during development could affect hormones through puberty, brain growth, learning, lifestyle prioritization in the maslow pyramid, and potential disability which would stunt life expectancy (NCBI, 3)
Call:
lm(formula = Life_expectancy ~ ., data = life_nocor)
Residuals:
Min 1Q Median 3Q Max
-8.0277 -1.2763 0.0598 1.2788 9.2552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.911e+01 1.682e+01 2.325 0.0202 *
Year 1.736e-02 8.401e-03 2.066 0.0389 *
Adult_mortality -7.052e-02 7.155e-04 -98.560 <2e-16 ***
Alcohol_consumption 1.416e-01 1.437e-02 9.855 <2e-16 ***
Hepatitis_B 2.685e-02 2.773e-03 9.683 <2e-16 ***
Measles 4.836e-03 2.492e-03 1.940 0.0525 .
BMI 2.841e-02 2.752e-02 1.032 0.3020
Incidents_HIV 3.838e-01 2.562e-02 14.980 <2e-16 ***
GDP_per_capita 7.853e-06 3.303e-06 2.378 0.0175 *
Population_mln 4.591e-05 2.937e-04 0.156 0.8758
Thinness_five_nine_years -3.387e-03 1.170e-02 -0.289 0.7723
Schooling 5.220e-01 2.214e-02 23.573 <2e-16 ***
Economy_status_Developed 1.272e-01 1.576e-01 0.807 0.4199
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 2.006 on 2851 degrees of freedom
Multiple R-squared: 0.9547, Adjusted R-squared: 0.9545
F-statistic: 5009 on 12 and 2851 DF, p-value: < 2.2e-16
| 0.5 % | 99.5 % | |
|---|---|---|
| (Intercept) | -4.252978e+00 | 8.247596e+01 |
| Year | -4.298335e-03 | 3.901067e-02 |
| Adult_mortality | -7.236575e-02 | -6.867719e-02 |
| Alcohol_consumption | 1.045465e-01 | 1.786086e-01 |
| Hepatitis_B | 1.970273e-02 | 3.399666e-02 |
| Measles | -1.588618e-03 | 1.125969e-02 |
| BMI | -4.252833e-02 | 9.934511e-02 |
| Incidents_HIV | 3.177369e-01 | 4.498011e-01 |
| GDP_per_capita | -6.591380e-07 | 1.636592e-05 |
| Population_mln | -7.111973e-04 | 8.030276e-04 |
| Thinness_five_nine_years | -3.355376e-02 | 2.677985e-02 |
| Schooling | 4.649087e-01 | 5.790576e-01 |
| Economy_status_Developed | -2.790990e-01 | 5.334208e-01 |
- Year
- 1.0678269348341
- Adult_mortality
- 4.81133232608706
- Alcohol_consumption
- 2.3292571693254
- Hepatitis_B
- 1.40001787949871
- Measles
- 1.53934169307944
- BMI
- 2.59460750860656
- Incidents_HIV
- 2.64889458720439
- GDP_per_capita
- 2.22623310236142
- Population_mln
- 1.14390513757467
- Thinness_five_nine_years
- 1.99632484687208
- Schooling
- 3.51011901647122
- Economy_status_Developed
- 2.90026350069789
- 'Adult_mortality'
- 'Alcohol_consumption'
- 'Hepatitis_B'
- 'Measles'
- 'BMI'
- 'Incidents_HIV'
- 'GDP_per_capita'
- 'Population_mln'
- 'Thinness_five_nine_years'
- 'Schooling'
- 'Economy_status_Developed'
- 'Life_expectancy'
Then ran a Variance Inflation Factor to make sure I was removing the appropriate values. After removing the high correlated values in the previous step I could actually run the VIF and found the majority of the values fell below the 10 threshold for removal except for Region. Which had a high multicollinearity with the other values at 40.608
So although region seemed like a strong variable a lot of the information that was presented drew lines and borders in the data anyway that made Region redundant and I removed it as well. Nothing would be gained from knowing the Country was Malaysia and region Asia if we already know all of the health outcomes and measures for the datapoint anyway.
The QQplot to see a plot of the residuals. As you can see the tail leaves the fit line a bit at the bottom and holds steady the rest of the way to the top. This could be showing a bit of the left hand skew that was visualized in the first exploratory pass through as life_expectancy throughout the world should be around ~70-75.
To be sure about the fit of the data I plotted the studentized residuals against the fitted values and found a good distribution of values about the origin. There were a few values in the extremes of +/- 4 standard deviations for the studentized residuals. So I ran a cursory check and found 34 values outside the +/- 3 standard deviation range. With almost 3,000 data points that would be a good portion of variance for the dataset and I believe it fits the model well. But to be sure I ran a hat matrix to pull the leverage values and check for any overpowering outliers to be sure. And for Hat values being closer to 1 means they have a high leverage value.
Then I ran a cooks distance on the values to be sure the Hat values fall below the 10th and 20th percentiles of my dataset and are not over leveraging the dataset. The cook's distance came out alright and no values were outliers that had too much influence on my dataset.
- 61
- 61
- 168
- 168
- 172
- 172
- 262
- 262
- 297
- 297
- 342
- 342
- 384
- 384
- 392
- 392
- 550
- 550
- 562
- 562
- 620
- 620
- 682
- 682
- 723
- 723
- 849
- 849
- 932
- 932
- 939
- 939
- 940
- 940
- 996
- 996
- 1135
- 1135
- 1176
- 1176
- 1470
- 1470
- 1546
- 1546
- 1827
- 1827
- 1971
- 1971
- 1979
- 1979
- 2136
- 2136
- 2194
- 2194
- 2203
- 2203
- 2254
- 2254
- 2416
- 2416
- 2516
- 2516
- 2556
- 2556
- 2590
- 2590
- 2600
- 2600
- 2749
- 2749
- 2828
- 2828
- 2830
- 2830
- 2860
- 2860
3 33 41 61 85 94 122 141 172 173 187 220 238 279 296 391 3 33 41 61 85 94 122 141 172 173 187 220 238 279 296 391 403 405 422 485 528 565 580 610 611 636 640 642 650 682 683 690 403 405 422 485 528 565 580 610 611 636 640 642 650 682 683 690 692 761 767 771 794 805 819 829 849 858 866 889 929 934 939 974 692 761 767 771 794 805 819 829 849 858 866 889 929 934 939 974 978 1015 1019 1048 1055 1073 1074 1093 1129 1140 1217 1219 1223 1237 1269 1276 978 1015 1019 1048 1055 1073 1074 1093 1129 1140 1217 1219 1223 1237 1269 1276 1281 1286 1375 1394 1401 1403 1407 1428 1440 1445 1449 1463 1506 1518 1520 1567 1281 1286 1375 1394 1401 1403 1407 1428 1440 1445 1449 1463 1506 1518 1520 1567 1573 1576 1585 1592 1614 1624 1661 1687 1740 1756 1767 1786 1796 1799 1806 1813 1573 1576 1585 1592 1614 1624 1661 1687 1740 1756 1767 1786 1796 1799 1806 1813 1842 1880 1887 1906 1918 1939 1951 1961 1967 1968 1986 2025 2044 2059 2081 2084 1842 1880 1887 1906 1918 1939 1951 1961 1967 1968 1986 2025 2044 2059 2081 2084 2087 2119 2129 2160 2186 2203 2204 2257 2260 2263 2276 2282 2295 2298 2310 2318 2087 2119 2129 2160 2186 2203 2204 2257 2260 2263 2276 2282 2295 2298 2310 2318 2325 2371 2378 2411 2427 2460 2466 2484 2516 2521 2527 2529 2560 2586 2625 2660 2325 2371 2378 2411 2427 2460 2466 2484 2516 2521 2527 2529 2560 2586 2625 2660 2679 2680 2745 2749 2761 2788 2801 2819 2862 2679 2680 2745 2749 2761 2788 2801 2819 2862
Now that I finally had the model trimmed and checked to be viable I was able to use the all powerful Regsubsets package and organize my variables in the most powerful way. To this point I ruthlessly removed 12 Xiβs to limit complexity and keep the highest R^2 available. We moved from 0.998 to 0.962 in the reduced model- so how much further can we go? So far it seems like the model is suffering from overfit to the dataset.
Subset selection object
Call: regsubsets.formula(Life_expectancy ~ Adult_mortality + Alcohol_consumption +
Hepatitis_B + Measles + BMI + Incidents_HIV + GDP_per_capita +
Population_mln + Thinness_five_nine_years + Schooling + Economy_status_Developed,
data = life_nocor)
11 Variables (and intercept)
Forced in Forced out
Adult_mortality FALSE FALSE
Alcohol_consumption FALSE FALSE
Hepatitis_B FALSE FALSE
Measles FALSE FALSE
BMI FALSE FALSE
Incidents_HIV FALSE FALSE
GDP_per_capita FALSE FALSE
Population_mln FALSE FALSE
Thinness_five_nine_years FALSE FALSE
Schooling FALSE FALSE
Economy_status_Developed FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
Adult_mortality Alcohol_consumption Hepatitis_B Measles BMI
1 ( 1 ) "*" " " " " " " " "
2 ( 1 ) "*" " " " " " " " "
3 ( 1 ) "*" " " " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" "*" " " " "
6 ( 1 ) "*" "*" "*" " " " "
7 ( 1 ) "*" "*" "*" "*" " "
8 ( 1 ) "*" "*" "*" "*" "*"
Incidents_HIV GDP_per_capita Population_mln Thinness_five_nine_years
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) "*" " " " " " "
4 ( 1 ) "*" " " " " " "
5 ( 1 ) "*" " " " " " "
6 ( 1 ) "*" "*" " " " "
7 ( 1 ) "*" "*" " " " "
8 ( 1 ) "*" "*" " " " "
Schooling Economy_status_Developed
1 ( 1 ) " " " "
2 ( 1 ) "*" " "
3 ( 1 ) "*" " "
4 ( 1 ) "*" " "
5 ( 1 ) "*" " "
6 ( 1 ) "*" " "
7 ( 1 ) "*" " "
8 ( 1 ) "*" " "
- 'which'
- 'rsq'
- 'rss'
- 'adjr2'
- 'cp'
- 'bic'
- 'outmat'
- 'obj'
Mallow's Cp: 4.040904 PRESS: 11584.94 AIC: 12129.17 BIC: 12212.6 R-squared: 0.9547192
The regsubset had spoken and printed the 8 covariates in the appropriate order. But before I reorganized the function I ran optimizations using Mallow cp, AIC, BIC, R^2, and RSS to find how many variables I really needed. As seen above in the results the main two variables were simply Adult_mortality and Schooling. But the other factors pulled their weight as well!
The functions told me that Adult_mortality is doing the bulk of the work, but if I added 4 other parameters from the covariate list I would retain the majority of the prediction power from the original full model. So, I reorganized the variables in the order recommended by the regsubset above and came back with the complete final model.
Subset selection object
Call: regsubsets.formula(Life_expectancy ~ Adult_mortality + Alcohol_consumption +
Hepatitis_B + Measles + BMI + Incidents_HIV + GDP_per_capita +
Population_mln + Thinness_five_nine_years + Schooling + Economy_status_Developed,
data = life_nocor)
11 Variables (and intercept)
Forced in Forced out
Adult_mortality FALSE FALSE
Alcohol_consumption FALSE FALSE
Hepatitis_B FALSE FALSE
Measles FALSE FALSE
BMI FALSE FALSE
Incidents_HIV FALSE FALSE
GDP_per_capita FALSE FALSE
Population_mln FALSE FALSE
Thinness_five_nine_years FALSE FALSE
Schooling FALSE FALSE
Economy_status_Developed FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
Adult_mortality Alcohol_consumption Hepatitis_B Measles BMI
1 ( 1 ) "*" " " " " " " " "
2 ( 1 ) "*" " " " " " " " "
3 ( 1 ) "*" " " " " " " " "
4 ( 1 ) "*" "*" " " " " " "
5 ( 1 ) "*" "*" "*" " " " "
6 ( 1 ) "*" "*" "*" " " " "
7 ( 1 ) "*" "*" "*" "*" " "
8 ( 1 ) "*" "*" "*" "*" "*"
Incidents_HIV GDP_per_capita Population_mln Thinness_five_nine_years
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) "*" " " " " " "
4 ( 1 ) "*" " " " " " "
5 ( 1 ) "*" " " " " " "
6 ( 1 ) "*" "*" " " " "
7 ( 1 ) "*" "*" " " " "
8 ( 1 ) "*" "*" " " " "
Schooling Economy_status_Developed
1 ( 1 ) " " " "
2 ( 1 ) "*" " "
3 ( 1 ) "*" " "
4 ( 1 ) "*" " "
5 ( 1 ) "*" " "
6 ( 1 ) "*" " "
7 ( 1 ) "*" " "
8 ( 1 ) "*" " "
Call:
lm(formula = Life_expectancy ~ Adult_mortality + Schooling +
Incidents_HIV + Alcohol_consumption + Hepatitis_B, data = life)
Residuals:
Min 1Q Median 3Q Max
-8.1193 -1.2899 0.0812 1.2435 9.4031
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.9456604 0.3089373 242.59 <2e-16 ***
Adult_mortality -0.0717077 0.0006147 -116.66 <2e-16 ***
Schooling 0.5578440 0.0191709 29.10 <2e-16 ***
Incidents_HIV 0.4071339 0.0240272 16.95 <2e-16 ***
Alcohol_consumption 0.1522213 0.0121663 12.51 <2e-16 ***
Hepatitis_B 0.0277708 0.0025997 10.68 <2e-16 ***
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 2.01 on 2858 degrees of freedom
Multiple R-squared: 0.9544, Adjusted R-squared: 0.9544
F-statistic: 1.197e+04 on 5 and 2858 DF, p-value: < 2.2e-16
Call:
lm(formula = Life_expectancy ~ Country + Region + Year + Infant_deaths +
Under_five_deaths + Adult_mortality + Alcohol_consumption +
Hepatitis_B + Measles + BMI + Polio + Diphtheria + Incidents_HIV +
GDP_per_capita + Population_mln + Thinness_ten_nineteen_years +
Thinness_five_nine_years + Schooling + Economy_status_Developed,
data = life)
Residuals:
Min 1Q Median 3Q Max
-2.0594 -0.2239 -0.0128 0.2094 5.2254
Coefficients: (9 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.030e+02 1.248e+01 -16.262 < 2e-16
CountryAlbania 5.694e+00 3.867e-01 14.724 < 2e-16
CountryAlgeria 4.940e+00 3.054e-01 16.177 < 2e-16
CountryAngola -3.614e+00 2.175e-01 -16.616 < 2e-16
CountryAntigua and Barbuda 7.061e+00 3.875e-01 18.222 < 2e-16
CountryArgentina 6.694e+00 4.236e-01 15.805 < 2e-16
CountryArmenia 5.511e+00 4.080e-01 13.507 < 2e-16
CountryAustralia 8.968e+00 5.758e-01 15.575 < 2e-16
CountryAustria 8.062e+00 4.823e-01 16.716 < 2e-16
CountryAzerbaijan 3.928e+00 4.018e-01 9.776 < 2e-16
CountryBahamas, The 5.655e+00 5.111e-01 11.064 < 2e-16
CountryBahrain 3.817e+00 3.872e-01 9.857 < 2e-16
CountryBangladesh 1.113e+00 2.704e-01 4.116 3.97e-05
CountryBarbados 9.306e+00 4.477e-01 20.786 < 2e-16
CountryBelarus 5.817e+00 4.251e-01 13.684 < 2e-16
CountryBelgium 8.061e+00 4.875e-01 16.535 < 2e-16
CountryBelize 6.318e+00 4.819e-01 13.112 < 2e-16
CountryBenin -1.233e+00 2.092e-01 -5.895 4.22e-09
CountryBhutan 3.151e+00 2.099e-01 15.009 < 2e-16
CountryBolivia 3.111e+00 3.272e-01 9.508 < 2e-16
CountryBosnia and Herzegovina 5.597e+00 3.433e-01 16.302 < 2e-16
CountryBotswana 9.806e-01 3.575e-01 2.743 0.006138
CountryBrazil 5.612e+00 3.771e-01 14.881 < 2e-16
CountryBrunei Darussalam 3.318e+00 4.514e-01 7.351 2.61e-13
CountryBulgaria 5.587e+00 4.049e-01 13.800 < 2e-16
CountryBurkina Faso -2.520e+00 2.599e-01 -9.696 < 2e-16
CountryBurundi -2.656e+00 2.384e-01 -11.139 < 2e-16
CountryCabo Verde 2.101e+00 2.649e-01 7.930 3.18e-15
CountryCambodia -3.525e-01 2.080e-01 -1.695 0.090181
CountryCameroon -2.524e-01 2.322e-01 -1.087 0.277207
CountryCanada 9.169e+00 5.386e-01 17.024 < 2e-16
CountryCentral African Republic -1.745e+00 2.235e-01 -7.809 8.22e-15
CountryChad -2.455e+00 2.481e-01 -9.896 < 2e-16
CountryChile 8.566e+00 4.497e-01 19.051 < 2e-16
CountryChina 2.631e+00 1.757e+00 1.498 0.134357
CountryColombia 7.318e+00 3.254e-01 22.490 < 2e-16
CountryComoros 3.750e-01 2.101e-01 1.785 0.074382
CountryCongo, Dem. Rep. -2.517e+00 2.157e-01 -11.669 < 2e-16
CountryCongo, Rep. -2.148e-01 2.250e-01 -0.955 0.339679
CountryCosta Rica 8.398e+00 4.021e-01 20.885 < 2e-16
CountryCote d'Ivoire -2.638e-01 2.153e-01 -1.225 0.220520
CountryCroatia 6.292e+00 4.384e-01 14.351 < 2e-16
CountryCuba 7.540e+00 3.821e-01 19.733 < 2e-16
CountryCyprus 7.248e+00 4.844e-01 14.964 < 2e-16
CountryCzechia 7.231e+00 4.841e-01 14.937 < 2e-16
CountryDenmark 6.661e+00 5.369e-01 12.407 < 2e-16
CountryDjibouti -9.279e-02 2.302e-01 -0.403 0.686894
CountryDominican Republic 5.716e+00 3.311e-01 17.263 < 2e-16
CountryEcuador 6.901e+00 3.661e-01 18.847 < 2e-16
CountryEgypt, Arab Rep. 4.600e+00 4.525e-01 10.167 < 2e-16
CountryEl Salvador 5.732e+00 3.694e-01 15.518 < 2e-16
CountryEquatorial Guinea -8.483e-01 2.665e-01 -3.183 0.001474
CountryEritrea -7.392e-01 2.743e-01 -2.695 0.007086
CountryEstonia 6.837e+00 4.646e-01 14.716 < 2e-16
CountryEswatini -3.575e+00 4.791e-01 -7.461 1.15e-13
CountryEthiopia -7.916e-01 2.588e-01 -3.059 0.002242
CountryFiji 1.999e+00 4.160e-01 4.807 1.62e-06
CountryFinland 8.029e+00 5.113e-01 15.703 < 2e-16
CountryFrance 9.347e+00 4.545e-01 20.563 < 2e-16
CountryGabon 6.503e-01 2.860e-01 2.274 0.023056
CountryGambia, The -1.818e+00 2.226e-01 -8.167 4.84e-16
CountryGeorgia 5.508e+00 4.412e-01 12.485 < 2e-16
CountryGermany 8.244e+00 5.281e-01 15.611 < 2e-16
CountryGhana -8.546e-01 2.299e-01 -3.717 0.000206
CountryGreece 8.723e+00 4.617e-01 18.894 < 2e-16
CountryGrenada 5.678e+00 3.737e-01 15.197 < 2e-16
CountryGuatemala 5.264e+00 3.089e-01 17.042 < 2e-16
CountryGuinea -2.174e+00 2.020e-01 -10.765 < 2e-16
CountryGuinea-Bissau -4.292e+00 2.137e-01 -20.081 < 2e-16
CountryGuyana 3.714e+00 3.322e-01 11.180 < 2e-16
CountryHaiti 4.362e-01 2.434e-01 1.792 0.073194
CountryHonduras 5.595e+00 3.168e-01 17.659 < 2e-16
CountryHungary 6.724e+00 4.529e-01 14.847 < 2e-16
CountryIceland 8.515e+00 5.241e-01 16.247 < 2e-16
CountryIndia 1.201e+00 1.598e+00 0.752 0.452362
CountryIndonesia 1.950e+00 3.573e-01 5.459 5.23e-08
CountryIran, Islamic Rep. 4.321e+00 3.410e-01 12.670 < 2e-16
CountryIraq 4.097e+00 4.059e-01 10.095 < 2e-16
CountryIreland 7.894e+00 5.641e-01 13.995 < 2e-16
CountryIsrael 9.012e+00 5.201e-01 17.325 < 2e-16
CountryItaly 8.762e+00 4.401e-01 19.909 < 2e-16
CountryJamaica 6.719e+00 3.924e-01 17.123 < 2e-16
CountryJapan 8.825e+00 4.345e-01 20.311 < 2e-16
CountryJordan 5.494e+00 4.912e-01 11.183 < 2e-16
CountryKazakhstan 5.000e+00 4.084e-01 12.241 < 2e-16
CountryKenya -5.759e-01 2.219e-01 -2.595 0.009510
CountryKiribati 3.976e+00 5.072e-01 7.838 6.55e-15
CountryKuwait 3.054e+00 5.550e-01 5.503 4.09e-08
CountryKyrgyz Republic 4.932e+00 3.660e-01 13.476 < 2e-16
CountryLao PDR 3.858e-01 2.161e-01 1.786 0.074271
CountryLatvia 6.806e+00 4.412e-01 15.427 < 2e-16
CountryLebanon 7.025e+00 4.017e-01 17.490 < 2e-16
CountryLesotho -2.819e+00 3.763e-01 -7.492 9.19e-14
CountryLiberia -4.736e-01 2.380e-01 -1.990 0.046692
CountryLibya 4.087e+00 4.189e-01 9.756 < 2e-16
CountryLithuania 7.145e+00 4.499e-01 15.881 < 2e-16
CountryLuxembourg 6.303e+00 7.731e-01 8.153 5.41e-16
CountryMadagascar 1.302e-01 2.415e-01 0.539 0.590038
CountryMalawi -9.598e-01 2.331e-01 -4.118 3.93e-05
CountryMalaysia 4.649e+00 3.437e-01 13.527 < 2e-16
CountryMaldives 3.619e+00 2.734e-01 13.237 < 2e-16
CountryMali -3.656e+00 2.203e-01 -16.594 < 2e-16
CountryMalta 8.423e+00 4.639e-01 18.157 < 2e-16
CountryMauritania 6.668e-01 2.381e-01 2.801 0.005136
CountryMauritius 4.908e+00 3.217e-01 15.255 < 2e-16
CountryMexico 6.821e+00 4.306e-01 15.840 < 2e-16
CountryMicronesia, Fed. Sts. 1.587e+00 4.904e-01 3.237 0.001224
CountryMoldova 4.677e+00 4.143e-01 11.289 < 2e-16
CountryMongolia 3.221e+00 3.435e-01 9.378 < 2e-16
CountryMontenegro 5.990e+00 4.171e-01 14.363 < 2e-16
CountryMorocco 3.633e+00 2.910e-01 12.485 < 2e-16
CountryMozambique -2.513e+00 2.132e-01 -11.784 < 2e-16
CountryMyanmar -9.138e-01 1.980e-01 -4.615 4.12e-06
CountryNamibia -6.691e-01 2.709e-01 -2.470 0.013560
CountryNepal 6.133e-01 2.087e-01 2.938 0.003331
CountryNetherlands 7.522e+00 5.022e-01 14.979 < 2e-16
CountryNew Zealand 9.163e+00 5.421e-01 16.902 < 2e-16
CountryNicaragua 5.663e+00 3.517e-01 16.102 < 2e-16
CountryNiger -2.616e+00 2.933e-01 -8.918 < 2e-16
CountryNigeria -2.359e+00 2.793e-01 -8.445 < 2e-16
CountryNorth Macedonia 4.920e+00 3.826e-01 12.857 < 2e-16
CountryNorway 7.456e+00 6.467e-01 11.529 < 2e-16
CountryOman 4.609e+00 3.844e-01 11.989 < 2e-16
CountryPakistan 1.102e+00 2.873e-01 3.835 0.000128
CountryPanama 8.037e+00 3.956e-01 20.316 < 2e-16
CountryPapua New Guinea -7.103e-01 2.582e-01 -2.751 0.005986
CountryParaguay 5.427e+00 3.365e-01 16.125 < 2e-16
CountryPeru 5.928e+00 3.504e-01 16.916 < 2e-16
CountryPhilippines 4.326e+00 2.881e-01 15.017 < 2e-16
CountryPoland 7.206e+00 4.317e-01 16.693 < 2e-16
CountryPortugal 7.707e+00 3.988e-01 19.325 < 2e-16
CountryQatar 6.504e+00 6.388e-01 10.182 < 2e-16
CountryRomania 5.615e+00 3.965e-01 14.160 < 2e-16
CountryRussian Federation 5.453e+00 4.387e-01 12.428 < 2e-16
CountryRwanda -6.380e-01 2.513e-01 -2.539 0.011174
CountrySamoa 5.919e+00 6.322e-01 9.362 < 2e-16
CountrySao Tome and Principe 1.231e+00 2.526e-01 4.873 1.16e-06
CountrySaudi Arabia 4.200e+00 4.756e-01 8.831 < 2e-16
CountrySenegal -5.244e-01 2.101e-01 -2.496 0.012637
CountrySerbia 5.262e+00 3.942e-01 13.347 < 2e-16
CountrySeychelles 5.067e+00 3.924e-01 12.915 < 2e-16
CountrySierra Leone -1.847e+00 2.292e-01 -8.055 1.18e-15
CountrySingapore 6.620e+00 4.605e-01 14.377 < 2e-16
CountrySlovak Republic 6.088e+00 4.348e-01 14.003 < 2e-16
CountrySlovenia 8.062e+00 4.603e-01 17.515 < 2e-16
CountrySolomon Islands 3.879e+00 3.039e-01 12.765 < 2e-16
CountrySomalia -7.450e-01 2.092e-01 -3.562 0.000375
CountrySouth Africa 3.524e+00 4.236e-01 8.318 < 2e-16
CountrySpain 9.240e+00 4.378e-01 21.103 < 2e-16
CountrySri Lanka 5.106e+00 3.378e-01 15.114 < 2e-16
CountrySt. Lucia 8.232e+00 4.700e-01 17.514 < 2e-16
CountrySt. Vincent and the Grenadines 4.992e+00 3.748e-01 13.318 < 2e-16
CountrySuriname 4.244e+00 3.458e-01 12.273 < 2e-16
CountrySweden 8.180e+00 5.313e-01 15.397 < 2e-16
CountrySwitzerland 7.845e+00 6.503e-01 12.064 < 2e-16
CountrySyrian Arab Republic 6.000e+00 3.935e-01 15.251 < 2e-16
CountryTajikistan 1.318e+00 3.596e-01 3.666 0.000252
CountryTanzania -7.783e-01 2.252e-01 -3.456 0.000558
CountryThailand 5.500e+00 2.812e-01 19.556 < 2e-16
CountryTimor-Leste -6.180e-02 2.301e-01 -0.269 0.788320
CountryTogo -2.133e+00 2.047e-01 -10.424 < 2e-16
CountryTonga 4.654e+00 6.688e-01 6.959 4.29e-12
CountryTrinidad and Tobago 5.753e+00 4.265e-01 13.491 < 2e-16
CountryTunisia 4.951e+00 3.340e-01 14.824 < 2e-16
CountryTurkiye 5.427e+00 3.972e-01 13.662 < 2e-16
CountryTurkmenistan 2.443e+00 3.633e-01 6.724 2.15e-11
CountryUganda -7.137e-01 2.593e-01 -2.753 0.005946
CountryUkraine 5.967e+00 4.054e-01 14.717 < 2e-16
CountryUnited Arab Emirates 4.693e+00 5.600e-01 8.380 < 2e-16
CountryUnited Kingdom 8.280e+00 5.484e-01 15.100 < 2e-16
CountryUnited States 8.232e+00 7.246e-01 11.360 < 2e-16
CountryUruguay 7.016e+00 4.039e-01 17.370 < 2e-16
CountryUzbekistan 3.392e+00 3.719e-01 9.121 < 2e-16
CountryVanuatu 2.007e+00 3.312e-01 6.059 1.56e-09
CountryVenezuela, RB 5.717e+00 3.888e-01 14.706 < 2e-16
CountryVietnam 4.589e+00 2.860e-01 16.042 < 2e-16
CountryYemen, Rep. 1.723e+00 2.083e-01 8.272 < 2e-16
CountryZambia -1.042e+00 2.605e-01 -4.000 6.50e-05
CountryZimbabwe 1.304e-01 2.972e-01 0.439 0.660929
RegionAsia NA NA NA NA
RegionCentral America and Caribbean NA NA NA NA
RegionEuropean Union NA NA NA NA
RegionMiddle East NA NA NA NA
RegionNorth America NA NA NA NA
RegionOceania NA NA NA NA
RegionRest of Europe NA NA NA NA
RegionSouth America NA NA NA NA
Year 1.435e-01 6.719e-03 21.352 < 2e-16
Infant_deaths -8.570e-03 7.114e-03 -1.205 0.228456
Under_five_deaths -4.410e-02 3.835e-03 -11.499 < 2e-16
Adult_mortality -4.163e-02 6.555e-04 -63.503 < 2e-16
Alcohol_consumption -2.697e-02 1.197e-02 -2.253 0.024326
Hepatitis_B 1.869e-03 1.375e-03 1.359 0.174179
Measles 2.034e-03 1.374e-03 1.480 0.139041
BMI -4.152e-01 6.403e-02 -6.484 1.06e-10
Polio 7.760e-04 2.736e-03 0.284 0.776735
Diphtheria 8.859e-03 2.746e-03 3.226 0.001270
Incidents_HIV 1.555e-01 2.458e-02 6.323 2.99e-10
GDP_per_capita 2.896e-05 6.078e-06 4.766 1.98e-06
Population_mln 1.954e-04 1.369e-03 0.143 0.886510
Thinness_ten_nineteen_years -1.400e-02 7.219e-03 -1.940 0.052538
Thinness_five_nine_years -1.285e-02 7.160e-03 -1.794 0.072870
Schooling -9.822e-02 2.869e-02 -3.423 0.000628
Economy_status_Developed NA NA NA NA
(Intercept) ***
CountryAlbania ***
CountryAlgeria ***
CountryAngola ***
CountryAntigua and Barbuda ***
CountryArgentina ***
CountryArmenia ***
CountryAustralia ***
CountryAustria ***
CountryAzerbaijan ***
CountryBahamas, The ***
CountryBahrain ***
CountryBangladesh ***
CountryBarbados ***
CountryBelarus ***
CountryBelgium ***
CountryBelize ***
CountryBenin ***
CountryBhutan ***
CountryBolivia ***
CountryBosnia and Herzegovina ***
CountryBotswana **
CountryBrazil ***
CountryBrunei Darussalam ***
CountryBulgaria ***
CountryBurkina Faso ***
CountryBurundi ***
CountryCabo Verde ***
CountryCambodia .
CountryCameroon
CountryCanada ***
CountryCentral African Republic ***
CountryChad ***
CountryChile ***
CountryChina
CountryColombia ***
CountryComoros .
CountryCongo, Dem. Rep. ***
CountryCongo, Rep.
CountryCosta Rica ***
CountryCote d'Ivoire
CountryCroatia ***
CountryCuba ***
CountryCyprus ***
CountryCzechia ***
CountryDenmark ***
CountryDjibouti
CountryDominican Republic ***
CountryEcuador ***
CountryEgypt, Arab Rep. ***
CountryEl Salvador ***
CountryEquatorial Guinea **
CountryEritrea **
CountryEstonia ***
CountryEswatini ***
CountryEthiopia **
CountryFiji ***
CountryFinland ***
CountryFrance ***
CountryGabon *
CountryGambia, The ***
CountryGeorgia ***
CountryGermany ***
CountryGhana ***
CountryGreece ***
CountryGrenada ***
CountryGuatemala ***
CountryGuinea ***
CountryGuinea-Bissau ***
CountryGuyana ***
CountryHaiti .
CountryHonduras ***
CountryHungary ***
CountryIceland ***
CountryIndia
CountryIndonesia ***
CountryIran, Islamic Rep. ***
CountryIraq ***
CountryIreland ***
CountryIsrael ***
CountryItaly ***
CountryJamaica ***
CountryJapan ***
CountryJordan ***
CountryKazakhstan ***
CountryKenya **
CountryKiribati ***
CountryKuwait ***
CountryKyrgyz Republic ***
CountryLao PDR .
CountryLatvia ***
CountryLebanon ***
CountryLesotho ***
CountryLiberia *
CountryLibya ***
CountryLithuania ***
CountryLuxembourg ***
CountryMadagascar
CountryMalawi ***
CountryMalaysia ***
CountryMaldives ***
CountryMali ***
CountryMalta ***
CountryMauritania **
CountryMauritius ***
CountryMexico ***
CountryMicronesia, Fed. Sts. **
CountryMoldova ***
CountryMongolia ***
CountryMontenegro ***
CountryMorocco ***
CountryMozambique ***
CountryMyanmar ***
CountryNamibia *
CountryNepal **
CountryNetherlands ***
CountryNew Zealand ***
CountryNicaragua ***
CountryNiger ***
CountryNigeria ***
CountryNorth Macedonia ***
CountryNorway ***
CountryOman ***
CountryPakistan ***
CountryPanama ***
CountryPapua New Guinea **
CountryParaguay ***
CountryPeru ***
CountryPhilippines ***
CountryPoland ***
CountryPortugal ***
CountryQatar ***
CountryRomania ***
CountryRussian Federation ***
CountryRwanda *
CountrySamoa ***
CountrySao Tome and Principe ***
CountrySaudi Arabia ***
CountrySenegal *
CountrySerbia ***
CountrySeychelles ***
CountrySierra Leone ***
CountrySingapore ***
CountrySlovak Republic ***
CountrySlovenia ***
CountrySolomon Islands ***
CountrySomalia ***
CountrySouth Africa ***
CountrySpain ***
CountrySri Lanka ***
CountrySt. Lucia ***
CountrySt. Vincent and the Grenadines ***
CountrySuriname ***
CountrySweden ***
CountrySwitzerland ***
CountrySyrian Arab Republic ***
CountryTajikistan ***
CountryTanzania ***
CountryThailand ***
CountryTimor-Leste
CountryTogo ***
CountryTonga ***
CountryTrinidad and Tobago ***
CountryTunisia ***
CountryTurkiye ***
CountryTurkmenistan ***
CountryUganda **
CountryUkraine ***
CountryUnited Arab Emirates ***
CountryUnited Kingdom ***
CountryUnited States ***
CountryUruguay ***
CountryUzbekistan ***
CountryVanuatu ***
CountryVenezuela, RB ***
CountryVietnam ***
CountryYemen, Rep. ***
CountryZambia ***
CountryZimbabwe
RegionAsia
RegionCentral America and Caribbean
RegionEuropean Union
RegionMiddle East
RegionNorth America
RegionOceania
RegionRest of Europe
RegionSouth America
Year ***
Infant_deaths
Under_five_deaths ***
Adult_mortality ***
Alcohol_consumption *
Hepatitis_B
Measles
BMI ***
Polio
Diphtheria **
Incidents_HIV ***
GDP_per_capita ***
Population_mln
Thinness_ten_nineteen_years .
Thinness_five_nine_years .
Schooling ***
Economy_status_Developed
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 0.486 on 2669 degrees of freedom
Multiple R-squared: 0.9975, Adjusted R-squared: 0.9973
F-statistic: 5513 on 194 and 2669 DF, p-value: < 2.2e-16
The final model retained an R^2 value of 0.9544 which is still fantastic, but may be due to overfit of the dataset to the model by this point. To double check and prevent any misuse of my life expectancy prediction in the future I ran a comparison breaking the model back down to full and final. Except this time I separated the data into train and test data then retrained the full and reorganized reduced to compare the MSEβs of the comparable models.
Warning message in predict.lm(full_life_model, newdata = test): βprediction from a rank-deficient fit may be misleadingβ
Call:
lm(formula = Life_expectancy ~ Adult_mortality + Schooling +
Incidents_HIV + Alcohol_consumption + Hepatitis_B, data = train)
Residuals:
Min 1Q Median 3Q Max
-8.1298 -1.3151 0.0925 1.2884 9.5692
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.1840217 0.3685935 203.975 <2e-16 ***
Adult_mortality -0.0718642 0.0007395 -97.180 <2e-16 ***
Schooling 0.5475069 0.0232085 23.591 <2e-16 ***
Incidents_HIV 0.3971186 0.0286688 13.852 <2e-16 ***
Alcohol_consumption 0.1585384 0.0146170 10.846 <2e-16 ***
Hepatitis_B 0.0260564 0.0030714 8.483 <2e-16 ***
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 2.033 on 1998 degrees of freedom
Multiple R-squared: 0.9539, Adjusted R-squared: 0.9538
F-statistic: 8276 on 5 and 1998 DF, p-value: < 2.2e-16
The full model MSE was 0.1946 which is insanely accurate⦠or a little too overfit as it follows the data also not exactly.
The initial reduced model for these values had an MSE of 4.1044, which is much higher than the first model but still believable. Even if the model varied by an estimate of 4 years for life_expectancy that is still a great estimation.
The final model MSE was 4.1203 which is a strong predictor for having 5 of the 20 original variables and even with the training data has an R^2 value explaining about 95.39% of the variance in Yi = Life expectancy. So this model is great and fairly accurate even with data it has not previously seen!
| 0.5 % | 99.5 % | |
|---|---|---|
| (Intercept) | 74.23368010 | 76.13436339 |
| Adult_mortality | -0.07377079 | -0.06995753 |
| Schooling | 0.48766859 | 0.60734528 |
| Incidents_HIV | 0.32320198 | 0.47103516 |
| Alcohol_consumption | 0.12085163 | 0.19622517 |
| Hepatitis_B | 0.01813737 | 0.03397552 |
- 'Country'
- 'Region'
- 'Year'
- 'Infant_deaths'
- 'Under_five_deaths'
- 'Adult_mortality'
- 'Alcohol_consumption'
- 'Hepatitis_B'
- 'Measles'
- 'BMI'
- 'Polio'
- 'Diphtheria'
- 'Incidents_HIV'
- 'GDP_per_capita'
- 'Population_mln'
- 'Thinness_ten_nineteen_years'
- 'Thinness_five_nine_years'
- 'Schooling'
- 'Economy_status_Developed'
- 'Life_expectancy'
To double check the model I then ran prplots for the remaining X-vars and found them well distributed along the line with great grouping. The data did have some outliers but as shown in the above Hat values nothing was overleveraging the data by a huge margin.
ReferencesΒΆ
1. https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates
2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4435622/
3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8517826/