{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Solution - Multidimensional Linear Regression\n",
"\n",
""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Goals\n",
"\n",
"After completing this notebook, you will be able to:\n",
"1. Read in a fixed width data set and assign column names\n",
"2. Clean missing data from a data set\n",
"3. Construct a set of linear regression model usings `scikit-learn`\n",
"4. Evaluate the quality of fit for a set of models using adjusted $R^2$ and by comparing true and predicted values\n",
"5. Explain why that model is the best fit for this data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. Our imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"\n",
"from IPython.display import HTML\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"import sklearn.metrics as metrics\n",
"\n",
"%matplotlib inline"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Working with example data \n",
"\n",
"We are going to work with some data generated by U.N.E.S.C.O. (United Nations Education, Scientific, and Cultural Organization) and data they collected relating to poverty and inequality in the world. There are two files you need to do the work:\n",
"\n",
"- `unesco.dat` which is the data file itself\n",
"- `unesco.txt` which describes the data columns as **fixed width column** data. That is, this file describes the columns of the data for each category. For example, the data in columns 1-6 of `unesco.dat` contain the \"live birth rates per 1,000 population\".\n",
"\n",
"[https://raw.githubusercontent.com/dannycab/MSU_REU_ML_course/main/notebooks/day-3/unesco.dat](https://raw.githubusercontent.com/dannycab/MSU_REU_ML_course/main/notebooks/day-3/unesco.dat)\n",
"\n",
"[https://raw.githubusercontent.com/dannycab/MSU_REU_ML_course/main/notebooks/day-3/unesco.txt](https://raw.githubusercontent.com/dannycab/MSU_REU_ML_course/main/notebooks/day-3/unesco.txt)\n",
"\n",
"Conveniently there is a fixed width column pandas data reader called `read_fwf` ([Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html)). You will need to specific column names also. No column headers appear in this data file, so look at the `unesco.txt` file and give them short but useful names.\n",
"\n",
"✎ Do This - Read in the data into a DataFrame and print the `head()`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"## your code here"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
birth rate
\n",
"
death rate
\n",
"
infant mortality
\n",
"
male LE
\n",
"
female LE
\n",
"
GNP
\n",
"
country group
\n",
"
country
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
24.7
\n",
"
5.7
\n",
"
30.8
\n",
"
69.6
\n",
"
75.5
\n",
"
600
\n",
"
1
\n",
"
Albania
\n",
"
\n",
"
\n",
"
1
\n",
"
12.5
\n",
"
11.9
\n",
"
14.4
\n",
"
68.3
\n",
"
74.7
\n",
"
2250
\n",
"
1
\n",
"
Bulgaria
\n",
"
\n",
"
\n",
"
2
\n",
"
13.4
\n",
"
11.7
\n",
"
11.3
\n",
"
71.8
\n",
"
77.7
\n",
"
2980
\n",
"
1
\n",
"
Czechoslovakia
\n",
"
\n",
"
\n",
"
3
\n",
"
12.0
\n",
"
12.4
\n",
"
7.6
\n",
"
69.8
\n",
"
75.9
\n",
"
*
\n",
"
1
\n",
"
Former_E._Germany
\n",
"
\n",
"
\n",
"
4
\n",
"
11.6
\n",
"
13.4
\n",
"
14.8
\n",
"
65.4
\n",
"
73.8
\n",
"
2780
\n",
"
1
\n",
"
Hungary
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" birth rate death rate infant mortality male LE female LE GNP \\\n",
"0 24.7 5.7 30.8 69.6 75.5 600 \n",
"1 12.5 11.9 14.4 68.3 74.7 2250 \n",
"2 13.4 11.7 11.3 71.8 77.7 2980 \n",
"3 12.0 12.4 7.6 69.8 75.9 * \n",
"4 11.6 13.4 14.8 65.4 73.8 2780 \n",
"\n",
" country group country \n",
"0 1 Albania \n",
"1 1 Bulgaria \n",
"2 1 Czechoslovakia \n",
"3 1 Former_E._Germany \n",
"4 1 Hungary "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### ANSWER ###\n",
"\n",
"poverty_df = pd.read_fwf('https://raw.githubusercontent.com/dannycab/MSU_REU_ML_course/main/notebooks/day-3/unesco.dat', \n",
" names = ['birth rate',\n",
" 'death rate',\n",
" 'infant mortality',\n",
" 'male LE',\n",
" 'female LE',\n",
" 'GNP',\n",
" 'country group',\n",
" 'country'])\n",
"poverty_df.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Type of the data\n",
"\n",
"Now look at the `.dtypes` of your DataFrame and describe to me anything unusual. Can you explain why? Please write below. Don't skimp and look ahead, think about it and answer! We'll all wait ⏲"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Answer here"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 Handling missing data - Imputation\n",
"\n",
"Let's face it, sometimes data is bad. Values are not recorded, or are mis-recorded, or are so far out of expectation that you expect there is something wrong. On the other hand, just **changing** the data seems like cheating. We have to work with what we have, and if we have to make changes it would be good to do that programmatically so that it is recorded for others to see. \n",
"\n",
"The process of imputation is the statistical replacement of missing/bad data with substitute values. We have that problem here. In the **GNP** column some of the values are set to \" \\* \" indicating missing data. When pandas read in the column the only type that makes sense for both characters and numbers is a string. Therefore it set the type to `object` instead of the expected `int64` or `float64`.\n",
"\n",
"#### Using numpy.nan\n",
"\n",
"For better or worse, pandas assumes that \"bad values\" are marked in the data as numpy **NaN**. NaN is short for \"Not a Number\". If they are so marked we have access to some of the imputation methods, replacing NaN with various values (mean, median, specific value, etc.). \n",
"\n",
"There are two ways to do this:\n",
"1. you can do a `.replace` on the column using a dictionary of {value to replace : new value, ...} pairs. Remember to save the result. This leaves you with changing the column type using `.astype` but you will have convert to a float, perhaps `\"float64\"` would be good. You cannot convert a `np.nan` to an integer but you can to a float.\n",
"2. you can convert the everything that can be converted to a number using `.to_numeric`. Conveniently if it can't do the conversion on a particular value it is set to a `np.nan`\n",
"\n",
"✎ Do This - Convert the missing entries in the GNP column to `np.nan` and show the head of your modified DataFrame. Also print the `dtypes` to show that the column has change type."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"## your code here"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/5_/9z7lhk0s2y95hvkzs6lzdvvc0000gn/T/ipykernel_18113/3843585704.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.\n",
"The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.\n",
"\n",
"For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.\n",
"\n",
"\n",
" poverty_df['GNP'].replace(to_replace='*', value=np.nan, inplace=True)\n"
]
},
{
"data": {
"text/plain": [
"birth rate float64\n",
"death rate float64\n",
"infant mortality float64\n",
"male LE float64\n",
"female LE float64\n",
"GNP float64\n",
"country group int64\n",
"country object\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### ANSWER ###\n",
"\n",
"poverty_df['GNP'].replace(to_replace='*', value=np.nan, inplace=True)\n",
"poverty_df['GNP'] = poverty_df['GNP'].astype('float64')\n",
"\n",
"poverty_df.dtypes"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Changing numpy.nan\n",
"\n",
"Now that \"bad values\" are marked as `numpy.nan`, we can use the DataFrame method `fillna` to change those values. For example:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 600.0\n",
"1 2250.0\n",
"2 2980.0\n",
"3 0.0\n",
"4 2780.0\n",
" ... \n",
"92 220.0\n",
"93 110.0\n",
"94 220.0\n",
"95 420.0\n",
"96 640.0\n",
"Name: GNP, Length: 97, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Uncomment to run\n",
"\n",
"poverty_df[\"GNP\"].fillna(0)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"returns a new DataFrame where all the `np.nan` in the GNP column are replaced with 0. You can do other things are well, for example:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
birth rate
\n",
"
death rate
\n",
"
infant mortality
\n",
"
male LE
\n",
"
female LE
\n",
"
GNP
\n",
"
country group
\n",
"
country
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
24.7
\n",
"
5.7
\n",
"
30.8
\n",
"
69.6
\n",
"
75.5
\n",
"
600.000000
\n",
"
1
\n",
"
Albania
\n",
"
\n",
"
\n",
"
1
\n",
"
12.5
\n",
"
11.9
\n",
"
14.4
\n",
"
68.3
\n",
"
74.7
\n",
"
2250.000000
\n",
"
1
\n",
"
Bulgaria
\n",
"
\n",
"
\n",
"
2
\n",
"
13.4
\n",
"
11.7
\n",
"
11.3
\n",
"
71.8
\n",
"
77.7
\n",
"
2980.000000
\n",
"
1
\n",
"
Czechoslovakia
\n",
"
\n",
"
\n",
"
3
\n",
"
12.0
\n",
"
12.4
\n",
"
7.6
\n",
"
69.8
\n",
"
75.9
\n",
"
5741.252747
\n",
"
1
\n",
"
Former_E._Germany
\n",
"
\n",
"
\n",
"
4
\n",
"
11.6
\n",
"
13.4
\n",
"
14.8
\n",
"
65.4
\n",
"
73.8
\n",
"
2780.000000
\n",
"
1
\n",
"
Hungary
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
92
\n",
"
52.2
\n",
"
15.6
\n",
"
103.0
\n",
"
49.9
\n",
"
52.7
\n",
"
220.000000
\n",
"
6
\n",
"
Uganda
\n",
"
\n",
"
\n",
"
93
\n",
"
50.5
\n",
"
14.0
\n",
"
106.0
\n",
"
51.3
\n",
"
54.7
\n",
"
110.000000
\n",
"
6
\n",
"
Tanzania
\n",
"
\n",
"
\n",
"
94
\n",
"
45.6
\n",
"
14.2
\n",
"
83.0
\n",
"
50.3
\n",
"
53.7
\n",
"
220.000000
\n",
"
6
\n",
"
Zaire
\n",
"
\n",
"
\n",
"
95
\n",
"
51.1
\n",
"
13.7
\n",
"
80.0
\n",
"
50.4
\n",
"
52.5
\n",
"
420.000000
\n",
"
6
\n",
"
Zambia
\n",
"
\n",
"
\n",
"
96
\n",
"
41.7
\n",
"
10.3
\n",
"
66.0
\n",
"
56.5
\n",
"
60.1
\n",
"
640.000000
\n",
"
6
\n",
"
Zimbabwe
\n",
"
\n",
" \n",
"
\n",
"
97 rows × 8 columns
\n",
"
"
],
"text/plain": [
" birth rate death rate infant mortality male LE female LE GNP \\\n",
"0 24.7 5.7 30.8 69.6 75.5 600.000000 \n",
"1 12.5 11.9 14.4 68.3 74.7 2250.000000 \n",
"2 13.4 11.7 11.3 71.8 77.7 2980.000000 \n",
"3 12.0 12.4 7.6 69.8 75.9 5741.252747 \n",
"4 11.6 13.4 14.8 65.4 73.8 2780.000000 \n",
".. ... ... ... ... ... ... \n",
"92 52.2 15.6 103.0 49.9 52.7 220.000000 \n",
"93 50.5 14.0 106.0 51.3 54.7 110.000000 \n",
"94 45.6 14.2 83.0 50.3 53.7 220.000000 \n",
"95 51.1 13.7 80.0 50.4 52.5 420.000000 \n",
"96 41.7 10.3 66.0 56.5 60.1 640.000000 \n",
"\n",
" country group country \n",
"0 1 Albania \n",
"1 1 Bulgaria \n",
"2 1 Czechoslovakia \n",
"3 1 Former_E._Germany \n",
"4 1 Hungary \n",
".. ... ... \n",
"92 6 Uganda \n",
"93 6 Tanzania \n",
"94 6 Zaire \n",
"95 6 Zambia \n",
"96 6 Zimbabwe \n",
"\n",
"[97 rows x 8 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Uncomment to run\n",
"poverty_df[\"GNP\"].fillna(poverty_df[\"GNP\"].mean() )\n",
"poverty_df.fillna({\"GNP\": poverty_df[\"GNP\"].mean() })"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The first version changes any `np.nan` in the `GNP` column to be the mean of the column. The second takes a dictionary where the the key of the dictionary is the column to change and the value is what to replace the `np.nan` with. Note you could replace with other values like: median, min, max, or some other fixed value.\n",
"\n",
"Remember that all of these examples return either a new Series (when working with just a column) or a DataFrame (if working with the entire element). Nothing is changed in the original unless you assign the result or use `inplace=True` in the call.\n",
"\n",
"Finally, if you decide that the right thing to do is **remove** any row with a `np.nan` value, we can use the `.dropna` method of DataFrames as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"97 91\n"
]
}
],
"source": [
"## Uncomment to run\n",
"len(poverty_df)\n",
"poverty_df_dropped = poverty_df.dropna()\n",
"print(len(poverty_df), len(poverty_df_dropped))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### What do you think\n",
"\n",
"In the cell below, discuss with your group what you think is the best thing to do with the \"bad values\" in the DataFrame given the discussion above. Write your result below."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Answer here"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Multiple Regression\n",
"\n",
"In the past, we have limited ourselves to either a single feature or, in the pre-class, doing polynomial regression with other features we created. However, we can just as easily use all, or some combination of all, the features available to make an ordinary least squares (OLS) model. The question is, is it a good idea to just use all the possible features available to make a model?\n",
"\n",
"Please discuss that idea with your group and record your answer below."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Answer here"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 Infant Mortality model\n",
"\n",
"Using the U.N.E.S.C.O. data, we can make a model of \"Infant Mortality\" as the dependent variable against all the other available features. As a hint, an easy way to do this is the make the model with \"Infant Mortality\" as the prediction (the dependent variable) and then the entire DataFrame where \"Infant Mortality is dropped as the data (the independent variables). **You should also drop the \"Country\" column as unique strings don't play well in basic linear models.**\n",
"\n",
"✎ Do This - Make an linear model (did you split your data?) that predicts \"Infant Mortality\" using the other variables (dropping the \"Country\" column) and print the `.r2_score` of that process. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"r2: 0.9401\n"
]
}
],
"source": [
"### ANSWER ###\n",
"\n",
"X = poverty_df_dropped.drop(columns = ['infant mortality', 'country'])\n",
"y = poverty_df_dropped['infant mortality']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"linear = LinearRegression()\n",
"\n",
"linear.fit(X_train,y_train)\n",
"\n",
"y_pred = linear.predict(X_test)\n",
"\n",
"r2=metrics.r2_score(y_test, y_pred)\n",
"print('r2: ', round(r2,4))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Visualizing your fit\n",
"\n",
"We can check how well we are justified in using this model, by comparing the actual and predicted values. Plot the predicted values against the real values. In a perfect model, they would form a line with a slope of 1.\n",
"\n",
"\n",
"✎ Do This - Make the plots mentioned above. How well does your model fit your data? What can you conclude from this graph?"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"### your code here"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"### ANSWER ###\n",
"\n",
"plt.figure(figsize=(5,5))\n",
"plt.scatter(y_test,y_pred)\n",
"plt.xlabel('True Values')\n",
"plt.ylabel('Predicted Values')\n",
"plt.plot([-50,200],[-50,200], color='k', lw=3)\n",
"plt.show()\n",
"\n",
"# res = y_test-y_pred\n",
"\n",
"# plt.figure(figsize=(5,5))\n",
"# plt.scatter(X_test['male LE'], res)\n",
"# plt.xlabel('x')\n",
"# plt.ylabel('Residuals')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 A \"reduced\" model using only the \"significant\" features\n",
"\n",
"Modeling data is as much a craft as it is a science. We often seek the simplest models that explain or data well because they are typically more interpretable, easier to explain, and provide the information on the main influences of the system we are studying. There are reasons we might want a more complex model to capture the details and the nuance of the system. But for the U.N.E.S.C.O. data that we have, we are likely able to capture most of the system using a smaller number of features. \n",
"\n",
"✎ Do This - use `pandas` built=in correlation function (`.corr()`) to find the top 3 variables that correlate strongly with \"Infant Mortality\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
birth rate
\n",
"
death rate
\n",
"
infant mortality
\n",
"
male LE
\n",
"
female LE
\n",
"
GNP
\n",
"
country group
\n",
"
\n",
" \n",
" \n",
"
\n",
"
birth rate
\n",
"
1.000000
\n",
"
0.505626
\n",
"
0.856537
\n",
"
-0.866200
\n",
"
-0.894440
\n",
"
-0.629059
\n",
"
0.710212
\n",
"
\n",
"
\n",
"
death rate
\n",
"
0.505626
\n",
"
1.000000
\n",
"
0.677741
\n",
"
-0.754032
\n",
"
-0.714751
\n",
"
-0.302754
\n",
"
0.344168
\n",
"
\n",
"
\n",
"
infant mortality
\n",
"
0.856537
\n",
"
0.677741
\n",
"
1.000000
\n",
"
-0.935238
\n",
"
-0.954225
\n",
"
-0.601647
\n",
"
0.631913
\n",
"
\n",
"
\n",
"
male LE
\n",
"
-0.866200
\n",
"
-0.754032
\n",
"
-0.935238
\n",
"
1.000000
\n",
"
0.981957
\n",
"
0.642963
\n",
"
-0.642059
\n",
"
\n",
"
\n",
"
female LE
\n",
"
-0.894440
\n",
"
-0.714751
\n",
"
-0.954225
\n",
"
0.981957
\n",
"
1.000000
\n",
"
0.650040
\n",
"
-0.699097
\n",
"
\n",
"
\n",
"
GNP
\n",
"
-0.629059
\n",
"
-0.302754
\n",
"
-0.601647
\n",
"
0.642963
\n",
"
0.650040
\n",
"
1.000000
\n",
"
-0.283399
\n",
"
\n",
"
\n",
"
country group
\n",
"
0.710212
\n",
"
0.344168
\n",
"
0.631913
\n",
"
-0.642059
\n",
"
-0.699097
\n",
"
-0.283399
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" birth rate death rate infant mortality male LE \\\n",
"birth rate 1.000000 0.505626 0.856537 -0.866200 \n",
"death rate 0.505626 1.000000 0.677741 -0.754032 \n",
"infant mortality 0.856537 0.677741 1.000000 -0.935238 \n",
"male LE -0.866200 -0.754032 -0.935238 1.000000 \n",
"female LE -0.894440 -0.714751 -0.954225 0.981957 \n",
"GNP -0.629059 -0.302754 -0.601647 0.642963 \n",
"country group 0.710212 0.344168 0.631913 -0.642059 \n",
"\n",
" female LE GNP country group \n",
"birth rate -0.894440 -0.629059 0.710212 \n",
"death rate -0.714751 -0.302754 0.344168 \n",
"infant mortality -0.954225 -0.601647 0.631913 \n",
"male LE 0.981957 0.642963 -0.642059 \n",
"female LE 1.000000 0.650040 -0.699097 \n",
"GNP 0.650040 1.000000 -0.283399 \n",
"country group -0.699097 -0.283399 1.000000 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### ANSWER ###\n",
"\n",
"# Drop the 'country' column\n",
"poverty_df_dropped_no_country = poverty_df_dropped.drop('country', axis=1)\n",
"\n",
"# Now you can calculate correlation\n",
"poverty_df_dropped_no_country.corr()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Do This - Redo the model with only the top three features you found above vs \"Infant Mortality\". Print the `.r2_score`, how does it compare to the full model?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"### your code here"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"r2: 0.8723\n"
]
}
],
"source": [
"### ANSWER ###\n",
"\n",
"X1 = poverty_df_dropped.drop(columns = ['infant mortality', 'country', 'country group', 'GNP', 'death rate'])\n",
"y1 = poverty_df_dropped['infant mortality']\n",
"\n",
"X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2)\n",
"\n",
"linear = LinearRegression()\n",
"\n",
"linear.fit(X_train1,y_train1)\n",
"\n",
"y_pred1 = linear.predict(X_test1)\n",
"\n",
"r2=metrics.r2_score(y_test1, y_pred1)\n",
"print('r2: ', round(r2,4))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Do This - Make the same comparison plot mentioned above. How well does your model fit your data? What can you conclude from this graph? Can you compare it to the previous fit?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"### your code here"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"### ANSWER ###\n",
"\n",
"plt.figure(figsize=(5,5))\n",
"plt.scatter(y_test,y_pred)\n",
"plt.scatter(y_test,y_pred1)\n",
"plt.xlabel('True Values')\n",
"plt.ylabel('Predicted Values')\n",
"plt.legend(['Full Model', 'Reduced Model'])\n",
"\n",
"plt.plot([-50,200],[-50,200], color='k', lw=3)\n",
"plt.show()\n",
"\n",
"# res = y_test-y_pred\n",
"\n",
"# plt.figure(figsize=(5,5))\n",
"# plt.scatter(X_test['male LE'], res)\n",
"# plt.xlabel('x')\n",
"# plt.ylabel('Residuals')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Review this model and the one you constructed earlier in the notebook. Report how the Adjusted R-squared value changed from using only the top three vs using all the available features. How well does this reduced model appear to fit your data?"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"✎ Answer here"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ml-short-course",
"language": "python",
"name": "ml-short-course"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}