Cancer County EDA with PandaSQL¶
Summary¶
Cancer is the second leading cause of death within the United States next to heart disease. Everyone in modern day US knows someone who has been affected by cancer either through death in the family, watching a friend recover with chemotherapy, or meeting a survivor who prays for continued remission. Although cancer diagnosis rate has decreased by 27% over the last 20 years (1, CDC), it still is a constant looming fear in the eyes of most individuals especially approaching their 50's.
In this exploratory data analysis I hope to uncover any insight into the cancer rates across the US. Perhaps allowing another individual more in tune with cultural, political, or geographic causes/datasets to have a place to start their own discoveries. And maybe create a comprehensive list of states/counties to avoid for the sake of your personal health...
Kidding aside I hope this EDA or use of Pandasql is useful to you. There are 3 slightly different syntax for engaging in sql with the pandas data frame that I have used throughout this notebook. I will point them out as they come up and go over the strengths and weaknesses I found of each while doing this project.
Preparation¶
Dataset used¶
This data comes from cancer.gov and the US Census American Community Survey compiled by another kaggle user- Noah Rippner. Who has compiled and released the dataset. Any other users are free to use, manipulate, and transform the data at will with appropriate credit. Thank you!
Dataset Organization¶
There are two tables within this dataset. One reporting Incidence rates of cancer across the counties in the US. And another reporting the Death rates from cancer related causes within the US. The intial data was collected 05/09/2017 and was added upon until 2020 where it was recently published from data.world to Kaggle. The data includes the county, state, FIPS, age-adjusted death rate/100k, average deaths per year per county, recent trend of aa_death_rate, upper and lower 95% confidence intervals of trends, 5 year trends in death and incidence, average annual count of diagnoses per county, and interestingly enough a Met45.5 objective. In short did this county/state maintain a 45.5/100,000 age-adjusted death rate? Which is incredibley low compared to the US average of 144.1/100,000 (1, CDC).
Dataset Integrity¶
There are 3,141 individual rows of data each representing a different county. There are 3,143 coutnies in the original 50 states, but including territories as well there are 3,243 counties. This data does include D.C. and Rhode Island with various territories in Alaska to bring the count up. Even with the county listed in row there are quite a few nulls that we will get into later. The data creator has scraped various sources to complete incident data. With no available data for Nevada.
Processing of the Data¶
Analysis Tools¶
Throughout this notebook I tried to display SQL through PandaSQL as much as possible. There are a few lines that had to be done using strictly Python to manipulate the dataframe before analysis with SQL. I am used to R and Rstudio syntax/tools so, please help me build my understanding of Python by critiquing my code in the comments! After the initial publishing of this EDA using Pandasql I will create a dashboard using Tableau Public as soon as possible. To visualize the insights found in the hard numbers of the data. This would be a great dataset to practice use of matplotlib or plotly packages, however, under the scope of this project I think SQL and Tableau would better serve my purposes.
Packages for Analysis¶
To use SQL in Python I imported
Numpy package for manipulations not allowed by pandas
pandas to create dataframes and transform the data from .CSV to dataframe type data to be used as a database/table
Pandasql to manipulate the pandas dataframes with basic sql queries
The above syntax reads in the .csv into a pandas dataframe without having to connect the database or transform the .csv files into tables for sql analysis, which was very helpful. I then set the max tibble to 100 as there were times when the table would print 3000+ rows and DRASTICALLY slow down the system and is not all that helpful for data processing. It should automatically tibble at 12 rows, however, I found a lot of the really cool findings would be about 50 rows down.
After loading in the data I knew I had to rename the coloumns to be able to use SQL. It is a particular query language already when it comes to names of columns from specific tables. So I made them more concise using '_', avoiding '.', and refraining from capitalization. While hopefully maintaining general meaning of particular columns throughout the analysis.
index county FIPS met45 aa_deathrate Lower95 \
0 0 United States 0 No 46 45.9
1 1 Perry County, Kentucky 21193 No 125.6 108.9
2 2 Powell County, Kentucky 21197 No 125.3 100.2
3 3 North Slope Borough, Alaska 2185 No 124.9 73
4 4 Owsley County, Kentucky 21189 No 118.5 83.1
... ... ... ... ... ... ...
3136 3136 Yakutat City and Borough, Alaska<sup>3</sup> 2282 * * *
3137 3137 Yukon-Koyukuk Census Area, Alaska 2290 * * *
3138 3138 Zapata County, Texas 48505 * * *
3139 3139 Zavala County, Texas 48507 * * *
3140 3140 Ziebach County, South Dakota 46137 * * *
upper95 avgdeath_year trending five_year lower95_trend upper95_trend
0 46.1 157,376 falling -2.4 -2.6 -2.2
1 144.2 43 stable -0.6 -2.7 1.6
2 155.1 18 stable 1.7 0 3.4
3 194.7 5 ** ** ** **
4 165.5 8 stable 2.2 -0.4 4.8
... ... ... ... ... ... ...
3136 * * ** ** ** **
3137 * * ** ** ** **
3138 * * * * * *
3139 * * ** ** ** **
3140 * * ** ** ** **
[3141 rows x 12 columns]
index county FIPS aa_deathrate_per_100k lower95 upper95 \
0 0 US (SEER+NPCR)(1,10) 0 62.4 62.3 62.6
1 1 Autauga County, Alabama(6,10) 1001 74.9 65.1 85.7
2 2 Baldwin County, Alabama(6,10) 1003 66.9 62.4 71.7
3 3 Barbour County, Alabama(6,10) 1005 74.6 61.8 89.4
4 4 Bibb County, Alabama(6,10) 1007 86.4 71 104.2
... ... ... ... ... ... ...
3136 3136 Sweetwater County, Wyoming(6,10) 56037 39.9 30.5 51.1
3137 3137 Teton County, Wyoming(6,10) 56039 23.7 14.7 36.1
3138 3138 Uinta County, Wyoming(6,10) 56041 31.7 20.8 46.1
3139 3139 Washakie County, Wyoming(6,10) 56043 50 33.8 72.2
3140 3140 Weston County, Wyoming(6,10) 56045 44.9 27.9 69.6
avg_cases_annually trending five_year lower95_trend upper95_trend
0 214614 falling -2.5 -3 -2
1 43 stable 0.5 -14.9 18.6
2 170 stable 3 -10.2 18.3
3 25 stable -6.4 -18.3 7.3
4 23 stable -4.5 -31.4 32.9
... ... ... ... ... ...
3136 14 stable 12.6 -18.1 54.9
3137 5 stable -19.6 -35.5 0.1
3138 6 stable -0.1 -18.3 22
3139 6 stable 13.5 -12.2 46.7
3140 4 stable -26.2 -65.4 57.4
[3141 rows x 11 columns]
After printing the tables using more Python I was able to see the changes took effect and I was ready to go! The first thing I had to do was try to drop the upper and lower 95th percentiles trends as they were not really useful information for my EDA. Pandasql does not have a function to drop whole columns to my knowledge. But I left the typical syntax in incase the functionality does come online as the package is always being updated.
| index | county | FIPS | aa_deathrate_per_100k | lower95 | upper95 | avg_cases_annually | trending | five_year | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | US (SEER+NPCR)(1,10) | 0 | 62.4 | 62.3 | 62.6 | 214614 | falling | -2.5 |
| 1 | 1 | Autauga County, Alabama(6,10) | 1001 | 74.9 | 65.1 | 85.7 | 43 | stable | 0.5 |
| 2 | 2 | Baldwin County, Alabama(6,10) | 1003 | 66.9 | 62.4 | 71.7 | 170 | stable | 3 |
| 3 | 3 | Barbour County, Alabama(6,10) | 1005 | 74.6 | 61.8 | 89.4 | 25 | stable | -6.4 |
| 4 | 4 | Bibb County, Alabama(6,10) | 1007 | 86.4 | 71 | 104.2 | 23 | stable | -4.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3136 | 3136 | Sweetwater County, Wyoming(6,10) | 56037 | 39.9 | 30.5 | 51.1 | 14 | stable | 12.6 |
| 3137 | 3137 | Teton County, Wyoming(6,10) | 56039 | 23.7 | 14.7 | 36.1 | 5 | stable | -19.6 |
| 3138 | 3138 | Uinta County, Wyoming(6,10) | 56041 | 31.7 | 20.8 | 46.1 | 6 | stable | -0.1 |
| 3139 | 3139 | Washakie County, Wyoming(6,10) | 56043 | 50 | 33.8 | 72.2 | 6 | stable | 13.5 |
| 3140 | 3140 | Weston County, Wyoming(6,10) | 56045 | 44.9 | 27.9 | 69.6 | 4 | stable | -26.2 |
3141 rows × 9 columns
Next I parsed the state from the county information as best as I could using python. If I could have used only SQL it would have been easy using a LIKE with a RIGHT, LEFT, string_split function. In Python I created new columns and split at the first space, then deleted that info as it was all the word 'County' then splitting again to rename that portion to state. It was not the most elegant solution but allowed me to group in SQL fairly accurately.
| index | county | FIPS | met45 | aa_deathrate | Lower95 | upper95 | avgdeath_year | trending | five_year | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | United | 0 | No | 46 | 45.9 | 46.1 | 157,376 | falling | -2.4 | None |
| 1 | 1 | Perry | 21193 | No | 125.6 | 108.9 | 144.2 | 43 | stable | -0.6 | Kentucky |
| 2 | 2 | Powell | 21197 | No | 125.3 | 100.2 | 155.1 | 18 | stable | 1.7 | Kentucky |
| 3 | 3 | North | 2185 | No | 124.9 | 73 | 194.7 | 5 | ** | ** | Borough, Alaska |
| 4 | 4 | Owsley | 21189 | No | 118.5 | 83.1 | 165.5 | 8 | stable | 2.2 | Kentucky |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3136 | 3136 | Yakutat | 2282 | * | * | * | * | * | ** | ** | and Borough, Alaska<sup>3</sup> |
| 3137 | 3137 | Yukon-Koyukuk | 2290 | * | * | * | * | * | ** | ** | Area, Alaska |
| 3138 | 3138 | Zapata | 48505 | * | * | * | * | * | * | * | Texas |
| 3139 | 3139 | Zavala | 48507 | * | * | * | * | * | ** | ** | Texas |
| 3140 | 3140 | Ziebach | 46137 | * | * | * | * | * | ** | ** | South Dakota |
3141 rows × 11 columns
Then I did the same processing to the incd table to keep it even which allows me to be able to JOIN the tables information.
| index | county | FIPS | aa_deathrate_per_100k | lower95 | upper95 | avg_cases_annually | trending | five_year | state | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | US | 0 | 62.4 | 62.3 | 62.6 | 214614 | falling | -2.5 | None |
| 1 | 1 | Autauga | 1001 | 74.9 | 65.1 | 85.7 | 43 | stable | 0.5 | Alabama(6,10) |
| 2 | 2 | Baldwin | 1003 | 66.9 | 62.4 | 71.7 | 170 | stable | 3 | Alabama(6,10) |
| 3 | 3 | Barbour | 1005 | 74.6 | 61.8 | 89.4 | 25 | stable | -6.4 | Alabama(6,10) |
| 4 | 4 | Bibb | 1007 | 86.4 | 71 | 104.2 | 23 | stable | -4.5 | Alabama(6,10) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3136 | 3136 | Sweetwater | 56037 | 39.9 | 30.5 | 51.1 | 14 | stable | 12.6 | Wyoming(6,10) |
| 3137 | 3137 | Teton | 56039 | 23.7 | 14.7 | 36.1 | 5 | stable | -19.6 | Wyoming(6,10) |
| 3138 | 3138 | Uinta | 56041 | 31.7 | 20.8 | 46.1 | 6 | stable | -0.1 | Wyoming(6,10) |
| 3139 | 3139 | Washakie | 56043 | 50 | 33.8 | 72.2 | 6 | stable | 13.5 | Wyoming(6,10) |
| 3140 | 3140 | Weston | 56045 | 44.9 | 27.9 | 69.6 | 4 | stable | -26.2 | Wyoming(6,10) |
3141 rows × 10 columns
FIPS is an identifier for individual states and their county. The first two digits correspond to state and last 3 digits are to signify county. While doing the actual analysis some FIPS only had 4 digits as the leading zero was removed when reading the dataframe in. So I added a leading zero to the entire column until each line had 5 numbers. Then convert the FIPS to string type to pull first 2 digits. This was a much cleaner grouping operator than the messy state/county divide.
death FIPS BEFORE CONVERSION index int64 county object FIPS int64 met45 object aa_deathrate object Lower95 object upper95 object avgdeath_year object trending object five_year object state object dtype: object death FIPS AFTER CONVERSION index int64 county object FIPS object met45 object aa_deathrate object Lower95 object upper95 object avgdeath_year object trending object five_year object state object state_code object dtype: object incd FIPS BEFORE CONVERSION index int64 county object FIPS int64 aa_deathrate_per_100k object lower95 object upper95 object avg_cases_annually object trending object five_year object state object dtype: object incd FIPS AFTER CONVERSION index int64 county object FIPS object aa_deathrate_per_100k object lower95 object upper95 object avg_cases_annually object trending object five_year object state object state_code object dtype: object
| index | county | FIPS | aa_deathrate_per_100k | lower95 | upper95 | avg_cases_annually | trending | five_year | state | state_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | US | 00000 | 62.4 | 62.3 | 62.6 | 214614 | falling | -2.5 | None | 00 |
| 1 | 1 | Autauga | 01001 | 74.9 | 65.1 | 85.7 | 43 | stable | 0.5 | Alabama(6,10) | 01 |
| 2 | 2 | Baldwin | 01003 | 66.9 | 62.4 | 71.7 | 170 | stable | 3 | Alabama(6,10) | 01 |
| 3 | 3 | Barbour | 01005 | 74.6 | 61.8 | 89.4 | 25 | stable | -6.4 | Alabama(6,10) | 01 |
| 4 | 4 | Bibb | 01007 | 86.4 | 71 | 104.2 | 23 | stable | -4.5 | Alabama(6,10) | 01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3136 | 3136 | Sweetwater | 56037 | 39.9 | 30.5 | 51.1 | 14 | stable | 12.6 | Wyoming(6,10) | 56 |
| 3137 | 3137 | Teton | 56039 | 23.7 | 14.7 | 36.1 | 5 | stable | -19.6 | Wyoming(6,10) | 56 |
| 3138 | 3138 | Uinta | 56041 | 31.7 | 20.8 | 46.1 | 6 | stable | -0.1 | Wyoming(6,10) | 56 |
| 3139 | 3139 | Washakie | 56043 | 50 | 33.8 | 72.2 | 6 | stable | 13.5 | Wyoming(6,10) | 56 |
| 3140 | 3140 | Weston | 56045 | 44.9 | 27.9 | 69.6 | 4 | stable | -26.2 | Wyoming(6,10) | 56 |
3141 rows × 11 columns
In strictly SQL, again it would have been an easier solution to use a LEFT (FIPS, 2) as state_code operation to pull the state into its own column and rename it. This function is not supported yet.
Now that the data has been organized and made easily usable by pandasql queries it is time to fill in the missing values with a consistent value. I researched and found NaN is the typical Python form for missing values and initially plugged in the numpy package np.NaN to signify Not a Number, but pulling the data and excluding it in sql was not compatible so I used the string/char 'nan' instead to get the message across and easily pull it from the results. As seen above in the dtypes transformation all values besides indexing were strings already so, this was a complimentary solution specific to this dataset.
Analysis¶
Now the SQL queries begin. I started with the total row count of the dataset and then found all of the nulls in the data. Typically you would want to drop null values, but since they were previously marked with 'nan' and there is still good data within the specific rows I think it would be a waste to drop entire rows of counties for analysis when I can just include != 'nan' statements.
| counties in death count | |
|---|---|
| 0 | 3141 |
| counties in incidence count | |
|---|---|
| 0 | 3141 |
3141 counties reported within the dataset and it is the same amount of data points between both datasets.
| age adjusted nulls | |
|---|---|
| 0 | 328 |
| total death nulls | |
|---|---|
| 0 | 328 |
328 total counties reporting do not have reported avg death rate or age adjusted death rates
| total counties missing avg cases annually | |
|---|---|
| 0 | 209 |
209 counties have no information about average cases diagnosed annually in the incidence table.
| county count | state | |
|---|---|---|
| 0 | 1 | None |
| 1 | 67 | Alabama |
| 2 | 15 | Arizona |
| 3 | 75 | Arkansas |
| 4 | 27 | Borough, Alaska |
| 5 | 64 | Colorado |
| 6 | 1 | Columbia (State) |
| 7 | 8 | Connecticut |
| 8 | 58 | County, California |
| 9 | 39 | County, Washington |
| 10 | 3 | Delaware |
| 11 | 67 | Florida |
| 12 | 159 | Georgia |
| 13 | 5 | Hawaii |
| 14 | 44 | Idaho |
| 15 | 102 | Illinois |
| 16 | 92 | Indiana |
| 17 | 99 | Iowa |
| 18 | 105 | Kansas |
| 19 | 120 | Kentucky |
| 20 | 16 | Maine |
| 21 | 24 | Maryland |
| 22 | 14 | Massachusetts |
| 23 | 83 | Michigan |
| 24 | 87 | Minnesota |
| 25 | 82 | Mississippi |
| 26 | 115 | Missouri |
| 27 | 56 | Montana |
| 28 | 93 | Nebraska |
| 29 | 17 | Nevada |
| 30 | 10 | New Hampshire |
| 31 | 21 | New Jersey |
| 32 | 33 | New Mexico |
| 33 | 62 | New York |
| 34 | 100 | North Carolina |
| 35 | 53 | North Dakota |
| 36 | 88 | Ohio |
| 37 | 77 | Oklahoma |
| 38 | 36 | Oregon |
| 39 | 64 | Parish, Louisiana |
| 40 | 67 | Pennsylvania |
| 41 | 5 | Rhode Island |
| 42 | 46 | South Carolina |
| 43 | 66 | South Dakota |
| 44 | 95 | Tennessee |
| 45 | 254 | Texas |
| 46 | 29 | Utah |
| 47 | 14 | Vermont |
| 48 | 133 | Virginia |
| 49 | 55 | West Virginia |
| 50 | 72 | Wisconsin |
| 51 | 23 | Wyoming |
| total average deaths in US per year | |
|---|---|
| 0 | 146841.0 |
There is a total of 146,841 deaths due to cancer related problems in the US per year according to this dataset.
| Avg across all counties | |
|---|---|
| 0 | 134.54 |
There is an average of 134.5/100,000 population cases diagnosed per county per year across the United States. This data includes all cancer types.
county county with least diagnosis 0 Choctaw 10 county county with most diagnosis 0 Nassau 992
Choctaw County, Alabama has the fewest average annual cases diagnosed, while Nassau county New york has the most avg annual cases. After a quick google seach Choctaw county has a population of about ~12,400, while Nassau county has a population of ~1,383,000. So this is an unfair comparison but a good look at cases on the whole. Lets try to group them by state and see which states have the highest avg annual cases.
| cases_per_county_per_year | state | |
|---|---|---|
| 0 | 152.46 | Maryland(6,10) |
| 1 | 24.13 | Iowa(7,8) |
| 2 | 159.07 | Pennsylvania(6,10) |
| 3 | 59.52 | Tennessee(6,10) |
| 4 | 13.32 | Montana(6,10) |
| 5 | 294.40 | California(7,8) |
| 6 | 91.25 | Illinois(6,10) |
| 7 | 110.28 | Washington(6,10) |
| 8 | 332.88 | Connecticut(7,8) |
| 9 | 105.60 | New Hampshire(6,10) |
| 10 | 54.45 | Louisiana(7,9) |
| 11 | 51.48 | Texas(6,10) |
| 12 | 39.82 | Virginia(6,10) |
| 13 | 59.60 | Alabama(6,10) |
| 14 | 8.91 | South Dakota(6,10) |
| 15 | 172.80 | Rhode Island(6,10) |
| 16 | 351.00 | Columbia(6,10) |
| 17 | 55.89 | Wisconsin(6,10) |
| 18 | 109.67 | Ohio(6,10) |
| 19 | 37.36 | Vermont(6,10) |
| 20 | 22.24 | Utah(7) |
| 21 | 9.09 | North Dakota(6,10) |
| 22 | 13.48 | Borough, Alaska(6,10) |
| 23 | 30.49 | Mississippi(6,10) |
| 24 | 29.42 | New Mexico(7,8) |
| 25 | 219.53 | New York(6,10) |
| 26 | 358.79 | Massachusetts(6,10) |
| 27 | 81.41 | South Carolina(6,10) |
| 28 | 13.47 | Nebraska(6,10) |
| 29 | 280.90 | New Jersey(7,8) |
| 30 | 214614.00 | None |
| 31 | 40.08 | Kentucky(7,9) |
| 32 | 35.73 | Arkansas(6,10) |
| 33 | 19.50 | Idaho(6,10) |
| 34 | 57.77 | Indiana(6,10) |
| 35 | 35.28 | Colorado(6,10) |
| 36 | 74.69 | Oregon(6,10) |
| 37 | 39.27 | Oklahoma(6,10) |
| 38 | 245.46 | Florida(6,10) |
| 39 | 252.60 | Arizona(6,10) |
| 40 | 46.24 | Missouri(6,10) |
| 41 | 95.01 | Michigan(6,10) |
| 42 | 39.82 | Georgia(7,9) |
| 43 | 257.67 | Delaware(6,10) |
| 44 | 75.58 | North Carolina(6,10) |
| 45 | 36.44 | West Virginia(6,10) |
| 46 | 12.74 | Wyoming(6,10) |
| 47 | 155.80 | Hawaii(7,9) |
| 48 | 82.75 | Maine(6,10) |
So subtracting the 'None' column from the previous query and the US total will give us a rough look at the cure/recovery rate per year. This is a very rough estimate as some treatment may last years, we dont know how they count recurring diagnosis, the deaths could have been cancer related but not caused by cancer, or the incidence/death averages could be skewed with other confounding variable- it will be a fun look none the less. (214614 - 146841 = 67,773 average cases cured/recovered in a year) Making that a percent over the death rate, while grim, would give us an estimation over whole cause mortality after being diagnosed with cancer in the US. ( 67,773 / 146,841 = 0.4615 or 46.1% chance on average to recover from all cause cancer in any given year according to these numbers.)
| state | total average cases diagnosed per state per year | |
|---|---|---|
| 0 | None | 214614.0 |
| 1 | Alabama(6,10) | 3993.0 |
| 2 | Arizona(6,10) | 3789.0 |
| 3 | Arkansas(6,10) | 2680.0 |
| 4 | Borough, Alaska(6,10) | 364.0 |
| 5 | California(7,8) | 17075.0 |
| 6 | Colorado(6,10) | 2258.0 |
| 7 | Columbia(6,10) | 351.0 |
| 8 | Connecticut(7,8) | 2663.0 |
| 9 | Delaware(6,10) | 773.0 |
| 10 | Florida(6,10) | 16446.0 |
| 11 | Georgia(7,9) | 6332.0 |
| 12 | Hawaii(7,9) | 779.0 |
| 13 | Idaho(6,10) | 858.0 |
| 14 | Illinois(6,10) | 9308.0 |
| 15 | Indiana(6,10) | 5315.0 |
| 16 | Iowa(7,8) | 2389.0 |
| 17 | Kansas(6) | 0.0 |
| 18 | Kentucky(7,9) | 4810.0 |
| 19 | Louisiana(7,9) | 3485.0 |
| 20 | Maine(6,10) | 1324.0 |
| 21 | Maryland(6,10) | 3659.0 |
| 22 | Massachusetts(6,10) | 5023.0 |
| 23 | Michigan(6,10) | 7886.0 |
| 24 | Minnesota(6) | 0.0 |
| 25 | Mississippi(6,10) | 2500.0 |
| 26 | Missouri(6,10) | 5318.0 |
| 27 | Montana(6,10) | 746.0 |
| 28 | Nebraska(6,10) | 1253.0 |
| 29 | Nevada(6) | 0.0 |
| 30 | New Hampshire(6,10) | 1056.0 |
| 31 | New Jersey(7,8) | 5899.0 |
| 32 | New Mexico(7,8) | 971.0 |
| 33 | New York(6,10) | 13611.0 |
| 34 | North Carolina(6,10) | 7558.0 |
| 35 | North Dakota(6,10) | 482.0 |
| 36 | Ohio(6,10) | 9651.0 |
| 37 | Oklahoma(6,10) | 3024.0 |
| 38 | Oregon(6,10) | 2689.0 |
| 39 | Pennsylvania(6,10) | 10658.0 |
| 40 | Rhode Island(6,10) | 864.0 |
| 41 | South Carolina(6,10) | 3745.0 |
| 42 | South Dakota(6,10) | 588.0 |
| 43 | Tennessee(6,10) | 5654.0 |
| 44 | Texas(6,10) | 13076.0 |
| 45 | Utah(7) | 645.0 |
| 46 | Vermont(6,10) | 523.0 |
| 47 | Virginia(6,10) | 5296.0 |
| 48 | Washington(6,10) | 4301.0 |
| 49 | West Virginia(6,10) | 2004.0 |
| 50 | Wisconsin(6,10) | 4024.0 |
| 51 | Wyoming(6,10) | 293.0 |
| county | aa_deathrate | avgdeath_year | |
|---|---|---|---|
| 0 | United | 46 | 157,376 |
| 1 | Perry | 125.6 | 43 |
| 2 | Powell | 125.3 | 18 |
| 3 | North | 124.9 | 5 |
| 4 | Owsley | 118.5 | 8 |
| ... | ... | ... | ... |
| 2808 | Eagle | 14.9 | 5 |
| 2809 | Summit | 14.4 | 4 |
| 2810 | Utah | 12.4 | 37 |
| 2811 | McKinley | 11.6 | 7 |
| 2812 | Cache | 9.2 | 7 |
2813 rows × 3 columns
| state | total average deaths per state per year | |
|---|---|---|
| 0 | Alabama | 3186.0 |
| 1 | Arizona | 1278.0 |
| 2 | Arkansas | 2141.0 |
| 3 | Borough, Alaska | 242.0 |
| 4 | Colorado | 1563.0 |
| 5 | Columbia (State) | 240.0 |
| 6 | Connecticut | 1735.0 |
| 7 | County, California | 8780.0 |
| 8 | County, Washington | 3139.0 |
| 9 | Delaware | 565.0 |
| 10 | Florida | 11928.0 |
| 11 | Georgia | 4506.0 |
| 12 | Hawaii | 537.0 |
| 13 | Idaho | 597.0 |
| 14 | Illinois | 4262.0 |
| 15 | Indiana | 4010.0 |
| 16 | Iowa | 1751.0 |
| 17 | Kansas | 1437.0 |
| 18 | Kentucky | 3452.0 |
| 19 | Maine | 956.0 |
| 20 | Maryland | 2738.0 |
| 21 | Massachusetts | 3465.0 |
| 22 | Michigan | 4747.0 |
| 23 | Minnesota | 2355.0 |
| 24 | Mississippi | 1942.0 |
| 25 | Missouri | 3914.0 |
| 26 | Montana | 472.0 |
| 27 | Nebraska | 842.0 |
| 28 | Nevada | 1302.0 |
| 29 | New Hampshire | 736.0 |
| 30 | New Jersey | 4097.0 |
| 31 | New Mexico | 719.0 |
| 32 | New York | 9094.0 |
| 33 | North Carolina | 5484.0 |
| 34 | North Dakota | 276.0 |
| 35 | Ohio | 7410.0 |
| 36 | Oklahoma | 2433.0 |
| 37 | Oregon | 2058.0 |
| 38 | Parish, Louisiana | 2720.0 |
| 39 | Pennsylvania | 7697.0 |
| 40 | Rhode Island | 623.0 |
| 41 | South Carolina | 2798.0 |
| 42 | South Dakota | 374.0 |
| 43 | Tennessee | 4358.0 |
| 44 | Texas | 8266.0 |
| 45 | Utah | 421.0 |
| 46 | Vermont | 371.0 |
| 47 | Virginia | 3967.0 |
| 48 | West Virginia | 1509.0 |
| 49 | Wisconsin | 2963.0 |
| 50 | Wyoming | 228.0 |
Met 45.5¶
In the next section I got curious about the met 45.5 objective as that would be a great direction for healthy states overall. The current goal for the CDC is a 122.7/100,000 deaths by 2030 so, already being at 45.5/100,000 is a fantastic achievement. It is also worth looking into for further studies into specific states that are able to tout an assumed survival rate at that level.
| # of counties that reached standard 45.5/100,000 age adjusted deaths | |
|---|---|
| 0 | 828 |
In total there are 828 / 3141 reported counties that reached the 45.5/100,000 objective or a total of ~26.4% counties. Only a quarter of the US is smashing the survival objective for cancer rates. So lets dive a little deeper and see why that may be.
Starting with which states have even 1 county within that objective...
| states that have counties that reached 45.5 | |
|---|---|
| 0 | Alabama |
| 1 | Alaska |
| 2 | Arizona |
| 3 | County, California |
| 4 | Colorado |
| 5 | County, Connecticut |
| 6 | Columbia (State) |
| 7 | Florida |
| 8 | Georgia |
| 9 | Hawaii |
| 10 | Idaho |
| 11 | Illinois |
| 12 | Indiana |
| 13 | Iowa |
| 14 | Kansas |
| 15 | Louisiana |
| 16 | Maine |
| 17 | Maryland |
| 18 | Massachusetts |
| 19 | Michigan |
| 20 | Minnesota |
| 21 | Mississippi |
| 22 | Missouri |
| 23 | Montana |
| 24 | Nebraska |
| 25 | Nevada |
| 26 | New Hampshire |
| 27 | New Jersey |
| 28 | New Mexico |
| 29 | New York |
| 30 | North Carolina |
| 31 | North Dakota |
| 32 | Ohio |
| 33 | Oklahoma |
| 34 | Oregon |
| 35 | Pennsylvania |
| 36 | Rhode Island |
| 37 | South Carolina |
| 38 | South Dakota |
| 39 | Tennessee |
| 40 | County, Texas |
| 41 | Utah |
| 42 | County, Vermont |
| 43 | and County, Virginia |
| 44 | Washington |
| 45 | West Virginia |
| 46 | Wisconsin |
| 47 | Wyoming |
It looks like the majority of states have atleast one county with a met45.5. The missing could just be NaN values for the state as a whole. So lets turn it into a percentage of the entire reported state to see which ones are actually doing well overall.
| state | Percent_counties_met45 | Percent_counties_not_met45 | |
|---|---|---|---|
| 0 | Alabama | 8.96 | 91.04 |
| 1 | Arizona | 93.33 | 6.67 |
| 2 | Arkansas | 0.00 | 100.00 |
| 3 | Borough, Alaska | 15.38 | 84.62 |
| 4 | Colorado | 95.00 | 5.00 |
| 5 | Columbia (State) | 100.00 | 0.00 |
| 6 | Connecticut | 87.50 | 12.50 |
| 7 | County, California | 78.18 | 21.82 |
| 8 | County, Washington | 54.05 | 45.95 |
| 9 | Delaware | 0.00 | 100.00 |
| 10 | Florida | 28.36 | 71.64 |
| 11 | Georgia | 17.22 | 82.78 |
| 12 | Hawaii | 100.00 | 0.00 |
| 13 | Idaho | 75.86 | 24.14 |
| 14 | Illinois | 13.73 | 86.27 |
| 15 | Indiana | 10.87 | 89.13 |
| 16 | Iowa | 55.67 | 44.33 |
| 17 | Kansas | 33.33 | 66.67 |
| 18 | Kentucky | 0.00 | 100.00 |
| 19 | Maine | 6.25 | 93.75 |
| 20 | Maryland | 20.83 | 79.17 |
| 21 | Massachusetts | 57.14 | 42.86 |
| 22 | Michigan | 17.07 | 82.93 |
| 23 | Minnesota | 71.95 | 28.05 |
| 24 | Mississippi | 6.17 | 93.83 |
| 25 | Missouri | 11.40 | 88.60 |
| 26 | Montana | 57.14 | 42.86 |
| 27 | Nebraska | 57.14 | 42.86 |
| 28 | Nevada | 25.00 | 75.00 |
| 29 | New Hampshire | 30.00 | 70.00 |
| 30 | New Jersey | 61.90 | 38.10 |
| 31 | New Mexico | 92.31 | 7.69 |
| 32 | New York | 22.58 | 77.42 |
| 33 | North Carolina | 13.00 | 87.00 |
| 34 | North Dakota | 60.87 | 39.13 |
| 35 | Ohio | 13.64 | 86.36 |
| 36 | Oklahoma | 8.33 | 91.67 |
| 37 | Oregon | 54.55 | 45.45 |
| 38 | Parish, Louisiana | 7.81 | 92.19 |
| 39 | Pennsylvania | 40.91 | 59.09 |
| 40 | Rhode Island | 40.00 | 60.00 |
| 41 | South Carolina | 13.04 | 86.96 |
| 42 | South Dakota | 57.14 | 42.86 |
| 43 | Tennessee | 2.11 | 97.89 |
| 44 | Texas | 38.38 | 61.62 |
| 45 | Utah | 100.00 | 0.00 |
| 46 | Vermont | 28.57 | 71.43 |
| 47 | Virginia | 23.85 | 76.15 |
| 48 | West Virginia | 10.91 | 89.09 |
| 49 | Wisconsin | 47.89 | 52.11 |
| 50 | Wyoming | 85.71 | 14.29 |
Sure enough out of the states listed the ones that stick out to me are Arizona, Colorado, Connecticut, New Mexico and Wyoming. Potentially Hawaii and Utah as well, however being 100% makes me skeptical- Hawaii has 5 counties and Utah has 29. But other than that the main commonality between these states to me is the prevalance for an outdoor culture overall. They all have many opportunities for outdoor activites due too nature, government infastructure, and cultural lifestyle that would lead to a healthier outlook in recovering from cancer. While simultaneously boasting a decent size population (besides Wyoming) for assumed better than average healthcare from doctors who have world class training and care.
On the other side though many of the states that are lacking in their ability to meet the objective are in the South. Alabama, Louisiana, Mississippi, Tennessee, and another odd one out- Maine. Again ignoring Arkansas, Delaware, and Kentucky as they have 100% no met45 counties. While they are absent from the first query about having any met45 counties and are mostly 'in the South' besides Delaware. They may not be reliable figures and require further research.
Trending in Incidence¶
Now that we have seen the met45.5 objective it might be interesting to look at the individual states and how they are trending in their diagnosis of cancer. The trends are listed in a last year basis with a numeric value, a 5 year basis with a numeric value, and a last year trend in a string- rising, falling, stable. I thought it would be fun to encode the string first before dealing with the numeric values to see what is trending in the trends!
index county FIPS aa_deathrate_per_100k lower95 upper95 avg_cases_annually trending \
0 0 US 00000 62.4 62.3 62.6 214614 falling
1 1 Autauga 01001 74.9 65.1 85.7 43 stable
2 2 Baldwin 01003 66.9 62.4 71.7 170 stable
3 3 Barbour 01005 74.6 61.8 89.4 25 stable
4 4 Bibb 01007 86.4 71 104.2 23 stable
... ... ... ... ... ... ... ... ...
3136 3136 Sweetwater 56037 39.9 30.5 51.1 14 stable
3137 3137 Teton 56039 23.7 14.7 36.1 5 stable
3138 3138 Uinta 56041 31.7 20.8 46.1 6 stable
3139 3139 Washakie 56043 50 33.8 72.2 6 stable
3140 3140 Weston 56045 44.9 27.9 69.6 4 stable
five_year state state_code trending_encoded
0 -2.5 None 00 1.0
1 0.5 Alabama(6,10) 01 2.0
2 3 Alabama(6,10) 01 2.0
3 -6.4 Alabama(6,10) 01 2.0
4 -4.5 Alabama(6,10) 01 2.0
... ... ... ... ...
3136 12.6 Wyoming(6,10) 56 2.0
3137 -19.6 Wyoming(6,10) 56 2.0
3138 -0.1 Wyoming(6,10) 56 2.0
3139 13.5 Wyoming(6,10) 56 2.0
3140 -26.2 Wyoming(6,10) 56 2.0
[3141 rows x 12 columns]
Now that we have encoded the trend from the previous years of data- whether the county is rising, falling, or stable in their diagnosis of cancer year to year we can group them and analyze on a state by state basis or the US as a whole
| total average of trend | avg five year trend | |
|---|---|---|
| 0 | 1.941595 | -1.243453 |
Here we averaged for the entire US to see what the trend is. It seems the diagnosis of cancer is slightly below stable. Which is good news! That means there has been fewer diagnosis than previous years overall or the diagnostic rate is "falling". I put the average five year trend next to the average of our current to last year trend to see how the diagnosis was trending through the dataset. It seems incidence of cancer is falling consistently over the previous 5 years and from the datapoint of just last year as well- so it is a continuing trend in this glimpse.
Next lets look at the trend on a state-by-state basis.
| state | avg state trend | avg 5 yr state trend | |
|---|---|---|---|
| 0 | Alabama(6,10) | 1.93 | -2.79 |
| 1 | Alaska(6,10) | 2.00 | -6.27 |
| 2 | Arizona(6,10) | 1.93 | -3.51 |
| 3 | Arkansas(6,10) | 1.96 | 1.48 |
| 4 | California(7,8) | 1.73 | -5.71 |
| 5 | Colorado(6,10) | 1.93 | -4.91 |
| 6 | Columbia(6,10) | 2.00 | -1.70 |
| 7 | Connecticut(7,8) | 1.63 | -0.74 |
| 8 | County, Utah(7,8) | 1.87 | -1.57 |
| 9 | Delaware(6,10) | 2.00 | -0.80 |
| 10 | Florida(6,10) | 1.87 | -1.77 |
| 11 | Georgia(7,9) | 1.94 | -0.35 |
| 12 | Hawaii(7,9) | 2.00 | -1.43 |
| 13 | Idaho(6,10) | 2.00 | -1.59 |
| 14 | Illinois(6,10) | 1.88 | -1.38 |
| 15 | Indiana(6,10) | 2.00 | -2.08 |
| 16 | Iowa(7,8) | 2.03 | -0.36 |
| 17 | Kentucky(7,9) | 1.98 | -1.27 |
| 18 | Louisiana(7,9) | 1.97 | -2.28 |
| 19 | Maine(6,10) | 2.00 | 0.49 |
| 20 | Maryland(6,10) | 1.92 | -0.89 |
| 21 | Massachusetts(6,10) | 1.86 | 1.54 |
| 22 | Michigan(6,10) | 1.93 | -3.34 |
| 23 | Mississippi(6,10) | 1.96 | -1.89 |
| 24 | Missouri(6,10) | 2.02 | -0.34 |
| 25 | Montana(6,10) | 2.03 | -0.99 |
| 26 | Nebraska(6,10) | 2.02 | 2.74 |
| 27 | New Hampshire(6,10) | 1.90 | -2.34 |
| 28 | New Jersey(7,8) | 1.33 | -3.48 |
| 29 | New Mexico(7,8) | 1.85 | -1.13 |
| 30 | New York(6,10) | 1.94 | -1.40 |
| 31 | North Carolina(6,10) | 1.97 | -1.73 |
| 32 | North Dakota(6,10) | 2.04 | 3.30 |
| 33 | Ohio(6,10) | 1.89 | -2.17 |
| 34 | Oklahoma(6,10) | 1.96 | -1.33 |
| 35 | Oregon(6,10) | 1.94 | -3.85 |
| 36 | Pennsylvania(6,10) | 2.00 | -1.36 |
| 37 | Rhode Island(6,10) | 2.00 | -2.96 |
| 38 | South Carolina(6,10) | 1.96 | -0.69 |
| 39 | South Dakota(6,10) | 1.95 | -0.76 |
| 40 | Tennessee(6,10) | 1.95 | -1.27 |
| 41 | Texas(6,10) | 1.97 | -1.64 |
| 42 | Vermont(6,10) | 1.86 | -4.60 |
| 43 | Virginia(6,10) | 1.90 | -2.26 |
| 44 | Washington(6,10) | 1.71 | -2.30 |
| 45 | West Virginia(6,10) | 2.00 | -0.15 |
| 46 | Wisconsin(6,10) | 1.97 | -0.48 |
| 47 | Wyoming(6,10) | 1.95 | -3.42 |
From this look there are only a handful of states collectivly above a 'Stable' trend from previous years and New Jersey some how has the fastest falling rate of diagnosis. After a quick google New Jersey has had a sharp increase in popualtion in the last 5 years and only a mild -.04% - -.07% decrease within the last year that this data was collected. So, New Jersey may have implemented other environmental, health, or diagnostic critiria that is decreasing their cancer instances from previous years. (After doing further research New Jersey had some of the highest rates in cancer incidence until about 2018 when new toxic waste protocols were put into place so, this one year glimpse was not a very accurate look of the data.)
Next it might be worthwhile to put the two tables together and see how the trends in avgerage cases diagnosed, average cases diagnosed trend, and average death rate trend might look per state.
| state | state_incdtrend_1_year | state_avg_cases | overall_deathtrend | |
|---|---|---|---|---|
| 0 | Alabama | 1.93 | 59.60 | 1.72 |
| 1 | Arizona | 1.93 | 270.21 | 1.29 |
| 2 | Arkansas | 1.96 | 35.73 | 1.83 |
| 3 | Colorado | 1.89 | 54.59 | 1.44 |
| 4 | Columbia (State) | 2.00 | 351.00 | 1.00 |
| 5 | Connecticut | 1.63 | 332.88 | 1.00 |
| 6 | County, California | 1.73 | 310.29 | 1.11 |
| 7 | County, Washington | 1.69 | 119.17 | 1.36 |
| 8 | Delaware | 2.00 | 257.67 | 1.00 |
| 9 | Florida | 1.87 | 245.46 | 1.30 |
| 10 | Georgia | 1.95 | 43.20 | 1.71 |
| 11 | Idaho | 2.00 | 32.29 | 1.83 |
| 12 | Illinois | 1.88 | 92.96 | 1.81 |
| 13 | Indiana | 2.00 | 57.77 | 1.84 |
| 14 | Iowa | 2.03 | 25.31 | 1.92 |
| 15 | Kansas | NaN | 0.00 | 1.92 |
| 16 | Kentucky | 1.97 | 40.39 | 1.94 |
| 17 | Maine | 2.00 | 82.75 | 1.44 |
| 18 | Maryland | 1.92 | 152.46 | 1.25 |
| 19 | Massachusetts | 1.86 | 358.79 | 1.14 |
| 20 | Michigan | 1.93 | 96.13 | 1.71 |
| 21 | Minnesota | NaN | 0.00 | 1.93 |
| 22 | Mississippi | 1.96 | 31.48 | 1.81 |
| 23 | Missouri | 2.02 | 48.12 | 1.86 |
| 24 | Montana | 1.96 | 23.07 | 1.63 |
| 25 | Nebraska | 2.02 | 24.32 | 1.89 |
| 26 | Nevada | NaN | 0.00 | 1.70 |
| 27 | New Hampshire | 1.90 | 105.60 | 1.30 |
| 28 | New Jersey | 1.33 | 280.90 | 1.00 |
| 29 | New Mexico | 1.85 | 36.50 | 1.62 |
| 30 | New York | 1.94 | 219.53 | 1.45 |
| 31 | North Carolina | 1.97 | 76.28 | 1.62 |
| 32 | North Dakota | 2.05 | 19.11 | 1.95 |
| 33 | Ohio | 1.89 | 109.67 | 1.69 |
| 34 | Oklahoma | 1.96 | 43.39 | 1.77 |
| 35 | Oregon | 1.94 | 83.56 | 1.31 |
| 36 | Parish, Louisiana | 1.97 | 55.24 | 1.71 |
| 37 | Pennsylvania | 2.00 | 161.41 | 1.62 |
| 38 | Rhode Island | 2.00 | 172.80 | 1.00 |
| 39 | South Carolina | 1.96 | 81.41 | 1.61 |
| 40 | South Dakota | 1.93 | 16.29 | 1.82 |
| 41 | Star, Alaska | 2.00 | 53.20 | 1.40 |
| 42 | Tennessee | 1.95 | 60.09 | 1.84 |
| 43 | Texas | 1.97 | 68.72 | 1.53 |
| 44 | Utah | 1.85 | 41.93 | 1.64 |
| 45 | Vermont | 1.85 | 39.62 | 1.77 |
| 46 | Virginia | 1.90 | 40.90 | 1.67 |
| 47 | West Virginia | 2.00 | 36.44 | 1.76 |
| 48 | Wisconsin | 1.97 | 57.36 | 1.87 |
| 49 | Wyoming | 1.95 | 14.10 | 1.75 |
| state | state_code | |
|---|---|---|
| 0 | Kansas(6) | 20 |
| 1 | Minnesota(6) | 27 |
| 2 | Nevada(6) | 32 |
The previous two queries show that the incidence rate and death rate of all states have been falling at a decent pace since the incidence rates are at or below 2 (stable) and the death rate trend over the last year is also below 2 (stable) that signifies the deaths and diagnoses are decreasing from the previous year. There were 3 NaN values in the table though. So I ran a query only on the incidence table and found 3 states that are missing all of their county information. Sadly I have lived my whole life in Kansas and was hoping to do a deeper exploration into their data. However Georgia, being the Peak South, will probably have some interesting findings through their rising obesity rate. And a comparison to New jersey, with its swiftly falling incidence rate, may have vastly different looking tables.
Georgia¶
| five year incidence trend | death_trend | county | |
|---|---|---|---|
| 0 | -0.1 | -0.9 | Jackson |
| 1 | -0.1 | -0.9 | Thomas |
| 2 | -0.2 | -0.1 | Bulloch |
| 3 | -0.4 | -0.5 | Carroll |
| 4 | -0.5 | -0.6 | Brantley |
| ... | ... | ... | ... |
| 132 | 8.3 | 9.6 | Habersham |
| 133 | 8.5 | -0.2 | Peach |
| 134 | 8.6 | -1.3 | Rabun |
| 135 | 9.1 | -4.8 | Ben |
| 136 | 9.2 | -2.1 | Camden |
137 rows × 3 columns
| county | age_adjusted_rate | AverageDeaths_per_year | Average_Diagnoses_per_year | |
|---|---|---|---|---|
| 0 | Appling | 56.3 | 12 | 15 |
| 1 | Atkinson | 75.8 | 6 | 5 |
| 2 | Bacon | 66.5 | 8 | 11 |
| 3 | Baldwin | 47.3 | 23 | 36 |
| 4 | Banks | 33.8 | 7 | 13 |
| ... | ... | ... | ... | ... |
| 146 | Whitfield | 64.5 | 64 | 79 |
| 147 | Wilcox | 45.7 | 5 | 7 |
| 148 | Wilkes | 57.7 | 9 | 10 |
| 149 | Wilkinson | 56.9 | 8 | 11 |
| 150 | Worth | 51.3 | 13 | 19 |
151 rows × 4 columns
According to the last two queries the state of Georgia has 137 out of the 159 counties that have reported recent 5 year trend numbers. So I order the incidence by ascending and showed the whole table then joined the death rate to it and grouped by the county name to eliminate duplicates. This query shows that there are 82 counties out of 137 reported that have a negative trend in incidence. Or that the majority of the trend is slowly decreasing in the majority of the counties for both death and incidence. Georgia's age adjusted death rates are also decently low but still hovering around the 60-70 mark, which beats the national average but is still nowhere near the met45.5 objective.
New Jersey¶
| five year incidence trend | death_trend | county | |
|---|---|---|---|
| 0 | -0.4 | -1.1 | Cape |
| 1 | -1 | -1.4 | Warren |
| 2 | -1.1 | -1.6 | Salem |
| 3 | -1.2 | -1.9 | Mercer |
| 4 | -1.2 | -1.7 | Passaic |
| 5 | -1.3 | -2.1 | Somerset |
| 6 | -1.3 | -1.6 | Sussex |
| 7 | -1.6 | -1.8 | Union |
| 8 | -1.7 | -2.2 | Hunterdon |
| 9 | -2.1 | -2.7 | Hudson |
| 10 | -2.6 | -2.9 | Essex |
| 11 | -3.3 | -2.2 | Bergen |
| 12 | -4 | -4.3 | Ocean |
| 13 | -5.4 | -2.9 | Camden |
| 14 | -5.6 | -3.2 | Gloucester |
| 15 | -5.7 | -1.8 | Burlington |
| 16 | -5.8 | -3.8 | Morris |
| 17 | -6.4 | -5.5 | Atlantic |
| 18 | -6.9 | -1 | Cumberland |
| 19 | -7.1 | -3.2 | Monmouth |
| 20 | -7.3 | -3.1 | Middlesex |
Unsurprisingly New Jersey's numbers are all drastically falling in the rate of incidence and death with 0 being stable and a positive (+) integer showing increase. This is great for New Jersey residence and can also mean that they had higher rates previously, but were able to slow the rates down. Perhaps other states could follow suit in legislation or public intervention in a way that is reliable and replicable?
| county | age_adjusted_rate | AverageDeaths_per_year | Average_Diagnoses_per_year | |
|---|---|---|---|---|
| 0 | Atlantic | 47.9 | 156 | 230 |
| 1 | Bergen | 34.7 | 402 | 580 |
| 2 | Burlington | 44.2 | 232 | 342 |
| 3 | Camden | 48.9 | 275 | 406 |
| 4 | Cape | 54.9 | 90 | 136 |
| 5 | Cumberland | 50.7 | 84 | 122 |
| 6 | Essex | 37.1 | 289 | 399 |
| 7 | Gloucester | 55.5 | 172 | 250 |
| 8 | Hudson | 36.5 | 206 | 279 |
| 9 | Hunterdon | 37.8 | 55 | 80 |
| 10 | Mercer | 38.2 | 152 | 235 |
| 11 | Middlesex | 37 | 319 | 459 |
| 12 | Monmouth | 42.8 | 317 | 475 |
| 13 | Morris | 34.8 | 201 | 287 |
| 14 | Ocean | 47.7 | 442 | 645 |
| 15 | Passaic | 39.3 | 202 | 276 |
| 16 | Salem | 48.1 | 41 | 62 |
| 17 | Somerset | 35.3 | 122 | 171 |
| 18 | Sussex | 45.2 | 74 | 106 |
| 19 | Union | 35.8 | 207 | 274 |
| 20 | Warren | 45.6 | 59 | 85 |
New Jersey's counties also have very low age adjusted death rate numbers- many hovering around the 45.5 objective discussed earlier.
Kansas¶
And now onto Kansas for my own personal interest... Sadly after the previous queries into the incidence table it was discovered that Kansas was one of the three states with no incidence data to report. They also have very limited data on the met45.5 goal that narrowed the insights into the state.
| index | county | FIPS | met45 | aa_deathrate | Lower95 | upper95 | avgdeath_year | trending | five_year | state | state_code | met45_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 105 | Chautauqua | 20019 | No | 80.1 | 51.7 | 123 | 5 | nan | nan | Kansas | 20 | 0 |
| 1 | 174 | Woodson | 20207 | No | 75.6 | 44.9 | 123.4 | 4 | stable | 2.1 | Kansas | 20 | 0 |
| 2 | 260 | Kearny | 20093 | No | 71.7 | 40.6 | 117.8 | 3 | nan | nan | Kansas | 20 | 0 |
| 3 | 269 | Geary | 20061 | No | 71.4 | 56.4 | 88.9 | 16 | stable | 0.5 | Kansas | 20 | 0 |
| 4 | 403 | Wyandotte | 20209 | No | 67.4 | 61.5 | 73.7 | 99 | stable | -0.3 | Kansas | 20 | 0 |
| 5 | 503 | Linn | 20107 | No | 65.1 | 47.5 | 88 | 10 | stable | 1 | Kansas | 20 | 0 |
| 6 | 543 | Montgomery | 20125 | No | 64.4 | 54.4 | 75.9 | 30 | stable | -0.3 | Kansas | 20 | 0 |
| 7 | 601 | Cherokee | 20021 | No | 63.2 | 50.5 | 78.5 | 17 | stable | 0.4 | Kansas | 20 | 0 |
| 8 | 626 | Rice | 20159 | No | 62.7 | 45.3 | 85.4 | 9 | rising | 1.9 | Kansas | 20 | 0 |
| 9 | 666 | Smith | 20183 | No | 61.9 | 38.2 | 100.1 | 4 | nan | nan | Kansas | 20 | 0 |
| 10 | 674 | Osage | 20139 | No | 61.8 | 47.9 | 78.9 | 14 | stable | 0.2 | Kansas | 20 | 0 |
| 11 | 859 | Crawford | 20037 | No | 59 | 49.2 | 70.3 | 26 | stable | -0.3 | Kansas | 20 | 0 |
| 12 | 927 | Doniphan | 20043 | No | 58.1 | 38.1 | 85.3 | 6 | stable | -0.1 | Kansas | 20 | 0 |
| 13 | 1072 | Phillips | 20147 | No | 56.3 | 35.1 | 87.3 | 5 | nan | nan | Kansas | 20 | 0 |
| 14 | 1098 | Elk | 20049 | No | 56.1 | 32 | 99.9 | 3 | stable | -0.7 | Kansas | 20 | 0 |
| 15 | 1099 | Labette | 20099 | No | 56.1 | 44.1 | 70.4 | 16 | stable | 1.2 | Kansas | 20 | 0 |
| 16 | 1100 | Jackson | 20085 | No | 56.1 | 41.2 | 75.1 | 10 | stable | 0 | Kansas | 20 | 0 |
| 17 | 1109 | Barber | 20007 | No | 56 | 33.4 | 89.9 | 4 | stable | 0.9 | Kansas | 20 | 0 |
| 18 | 1172 | Jewell | 20089 | No | 55.2 | 31.5 | 97.6 | 3 | nan | nan | Kansas | 20 | 0 |
| 19 | 1204 | Dickinson | 20041 | No | 54.8 | 42.9 | 69.3 | 15 | stable | 1.2 | Kansas | 20 | 0 |
| 20 | 1232 | Coffey | 20031 | No | 54.4 | 37.3 | 77.8 | 7 | stable | 0 | Kansas | 20 | 0 |
| 21 | 1310 | Brown | 20013 | No | 53.6 | 37.9 | 74.4 | 8 | nan | nan | Kansas | 20 | 0 |
| 22 | 1330 | Lyon | 20111 | No | 53.3 | 43 | 65.4 | 19 | stable | -0.4 | Kansas | 20 | 0 |
| 23 | 1353 | Franklin | 20059 | No | 53 | 42 | 66.1 | 16 | stable | 0.7 | Kansas | 20 | 0 |
| 24 | 1354 | Butler | 20015 | No | 53 | 45.6 | 61.2 | 38 | stable | -0.5 | Kansas | 20 | 0 |
| 25 | 1355 | Harper | 20077 | No | 53 | 34.6 | 80 | 5 | stable | 1.3 | Kansas | 20 | 0 |
| 26 | 1369 | Clay | 20027 | No | 52.9 | 35.9 | 76 | 7 | nan | nan | Kansas | 20 | 0 |
| 27 | 1422 | Greenwood | 20073 | No | 52.3 | 35.4 | 76.8 | 6 | stable | -0.8 | Kansas | 20 | 0 |
| 28 | 1433 | Allen | 20001 | No | 52.2 | 38.2 | 70.1 | 10 | stable | -1 | Kansas | 20 | 0 |
| 29 | 1464 | Saline | 20169 | No | 51.8 | 44.3 | 60.2 | 35 | stable | -0.6 | Kansas | 20 | 0 |
| 30 | 1490 | Sherman | 20181 | No | 51.6 | 32.4 | 79.5 | 5 | stable | 0.6 | Kansas | 20 | 0 |
| 31 | 1508 | Reno | 20155 | No | 51.4 | 44.7 | 58.8 | 45 | stable | 0.6 | Kansas | 20 | 0 |
| 32 | 1516 | Morris | 20127 | No | 51.3 | 33.3 | 78.1 | 5 | stable | -0.1 | Kansas | 20 | 0 |
| 33 | 1532 | Atchison | 20005 | No | 51.1 | 37.8 | 67.8 | 10 | stable | 0 | Kansas | 20 | 0 |
| 34 | 1569 | Anderson | 20003 | No | 50.6 | 33.9 | 73.7 | 6 | stable | -0.6 | Kansas | 20 | 0 |
| 35 | 1596 | Sedgwick | 20173 | No | 50.3 | 47.5 | 53.2 | 251 | falling | -1 | Kansas | 20 | 0 |
| 36 | 1631 | Sumner | 20191 | No | 49.9 | 39.2 | 62.8 | 15 | stable | 0.2 | Kansas | 20 | 0 |
| 37 | 1632 | Shawnee | 20177 | No | 49.9 | 45.8 | 54.4 | 109 | falling | -0.9 | Kansas | 20 | 0 |
| 38 | 1646 | Neosho | 20133 | No | 49.7 | 37.5 | 65 | 11 | stable | 0.4 | Kansas | 20 | 0 |
| 39 | 1688 | Republic | 20157 | No | 49 | 29.6 | 79.9 | 4 | stable | -0.3 | Kansas | 20 | 0 |
| 40 | 1735 | Cloud | 20029 | No | 48.4 | 33.4 | 68.7 | 7 | stable | 0.3 | Kansas | 20 | 0 |
| 41 | 1745 | Leavenworth | 20103 | No | 48.3 | 41.4 | 56 | 37 | falling | -1.5 | Kansas | 20 | 0 |
| 42 | 1797 | Jefferson | 20087 | No | 47.7 | 36.2 | 62.2 | 12 | stable | -1.2 | Kansas | 20 | 0 |
| 43 | 1817 | Bourbon | 20011 | No | 47.5 | 34.9 | 63.7 | 10 | stable | -1.3 | Kansas | 20 | 0 |
| 44 | 1912 | Ottawa | 20143 | No | 46.4 | 27.6 | 74.6 | 4 | nan | nan | Kansas | 20 | 0 |
| 45 | 1923 | Pratt | 20151 | No | 46.2 | 31.7 | 66 | 7 | stable | 0.2 | Kansas | 20 | 0 |
| 46 | 1987 | Barton | 20009 | Yes | 45.5 | 36.3 | 56.6 | 17 | stable | -0.4 | Kansas | 20 | 1 |
| 47 | 1990 | Mitchell | 20123 | Yes | 45.4 | 28.4 | 70.9 | 5 | stable | 1.2 | Kansas | 20 | 1 |
| 48 | 2008 | Riley | 20161 | Yes | 45.2 | 36.5 | 55.2 | 20 | stable | 1.4 | Kansas | 20 | 1 |
| 49 | 2148 | Cowley | 20035 | Yes | 43.1 | 35 | 52.6 | 20 | falling | -4.5 | Kansas | 20 | 1 |
| 50 | 2166 | Ellsworth | 20053 | Yes | 42.8 | 26.6 | 67.2 | 4 | nan | nan | Kansas | 20 | 1 |
| 51 | 2174 | Washington | 20201 | Yes | 42.7 | 26.2 | 68.3 | 4 | stable | -1.5 | Kansas | 20 | 1 |
| 52 | 2259 | Harvey | 20079 | Yes | 41.4 | 33.5 | 50.7 | 20 | stable | 0.5 | Kansas | 20 | 1 |
| 53 | 2274 | Douglas | 20045 | Yes | 41.2 | 35.2 | 47.8 | 36 | stable | -0.9 | Kansas | 20 | 1 |
| 54 | 2275 | Ford | 20057 | Yes | 41.2 | 31.3 | 53.1 | 12 | stable | -1 | Kansas | 20 | 1 |
| 55 | 2289 | Miami | 20121 | Yes | 41 | 32.3 | 51.5 | 15 | stable | -1.1 | Kansas | 20 | 1 |
| 56 | 2350 | Finney | 20055 | Yes | 40.2 | 30.1 | 52.4 | 11 | falling | -1.9 | Kansas | 20 | 1 |
| 57 | 2363 | Russell | 20167 | Yes | 39.9 | 25.6 | 61.4 | 5 | stable | -1.2 | Kansas | 20 | 1 |
| 58 | 2375 | Johnson | 20091 | Yes | 39.6 | 37.2 | 42.2 | 210 | falling | -1.3 | Kansas | 20 | 1 |
| 59 | 2396 | Pottawatomie | 20149 | Yes | 39.2 | 28.5 | 52.7 | 9 | stable | -1 | Kansas | 20 | 1 |
| 60 | 2468 | Ellis | 20051 | Yes | 37.7 | 28.4 | 49 | 12 | stable | -1.5 | Kansas | 20 | 1 |
| 61 | 2473 | Marshall | 20117 | Yes | 37.5 | 25.4 | 54.5 | 6 | stable | 0.4 | Kansas | 20 | 1 |
| 62 | 2497 | Seward | 20175 | Yes | 37.1 | 25.4 | 52.2 | 7 | stable | -1.6 | Kansas | 20 | 1 |
| 63 | 2531 | Wilson | 20205 | Yes | 36.5 | 23.7 | 54.9 | 5 | stable | -1.9 | Kansas | 20 | 1 |
| 64 | 2607 | McPherson | 20113 | Yes | 34.4 | 26.7 | 43.7 | 15 | stable | 0.4 | Kansas | 20 | 1 |
| 65 | 2648 | Pawnee | 20145 | Yes | 33.2 | 19.1 | 55.1 | 3 | stable | -2.2 | Kansas | 20 | 1 |
| 66 | 2727 | Nemaha | 20131 | Yes | 29.6 | 18.7 | 45.6 | 5 | stable | -0.6 | Kansas | 20 | 1 |
| 67 | 2754 | Marion | 20115 | Yes | 26.8 | 17 | 41.1 | 5 | stable | -0.9 | Kansas | 20 | 1 |
| 68 | 2757 | Kingman | 20095 | Yes | 26.5 | 15.1 | 44.8 | 3 | stable | -1.3 | Kansas | 20 | 1 |
There are 105 counties in kansas and this data is missing quite a bit of the information that I would find useful for a reliable measure on the state of the State. Many of the age related deaths are hovering around ~50-55 but a few are very low in the 20's. While the majority is mostly stable to slightly falling of incidence and death. Lets get a glimpse of this information into a table so its easier to understand.
| Average deaths per year per county | avg upper 95th for death per year | avg lower 95th for death per year | avg age adjusted death rate | avg five year trend | State total deaths per year | |
|---|---|---|---|---|---|---|
| 0 | 13.69 | 45.47 | 23.9 | 32.95 | -0.18 | 1437.0 |
The population of the entire state of Kansas is relatively small at about 2.94 Million across a fairly large swath of land ~ 82,000 sq mi (~213,000 sq km). So having a per county death rate of about 14 people per year per county is fantastic! Also, compared to previous years the trend for dying of cancer related causes has been decreasing in the last 5 years. Although slowly, Kansas as a whole has a largely aging population so, that is quite the accomplishment. They also average about 1,437 deaths from cancer throughout the entire state per year. That is well below 1% and could also be due to the larger farming population forcing a lot more outdoor activity/less sedentary lifestyles.
Conclusion¶
Thank you so much for reading my EDA over the Cancer County dataset. I hope I can inspire you to plug in your own state and discover more through this data. I have really only scratched the surface of what could be uncovered through the information provided by the CDC and Cancer society. In closing if you hope to look into bettering your own state maybe take a page out of New Jersey, Wyoming, or Vermont for their incredible progress in the last 5 years of trends for incidence and deaths. And if you find yourself or a loved one being diagnosed with any general form of Cancer (please do your research). This data suggests a move to mostly medium-size population states with mountainous regions being your best bet for a good recovery. Fresh air and wonderful healthcare provided.
If you have any suggestions on improving my analysis or want to show off your own EDA, Regression model, or other training/prediction model please leave it in the comments so I can grow my own knowledge in this field/language!
¶
External Analysis¶
Further in-depth data visualization and analysis can be found at the following Tableau dashboard:
Project Deep-Dive: Tableau Public
References¶
[NbConvertApp] Converting notebook pandasql_cancer_eda.ipynb to html [NbConvertApp] Writing 391391 bytes to Final_Report.html