OLS
is doing the fit that I wantYou will get more practice with this today.
We will use quintic data: $y = 5x^5 + 3x^4 - 6x^3 + 2x^2 -3x + 2$ from $x =$ -2 to +2
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
x_ary = np.arange(-2, 2, 0.05)
y_ary = +x_ary**5 - 3*x_ary**4 - 6*x_ary**3 + 2*x_ary**2 - 3*x_ary + 2
y_noisy = y_ary + np.random.uniform(-15, 15.0, len(y_ary))
fig = plt.figure(figsize=(4,4))
plt.plot(x_ary, y_ary, 'x')
plt.plot(x_ary, y_noisy, 'o')
plt.xlabel('x', fontsize = 14)
plt.ylabel('y', fontsize = 14)
plt.legend(['Original', 'Noisy'])
<matplotlib.legend.Legend at 0x7ff780edef10>
Our model will try to fit:
$$y = c_0 + c_1x + c_2x^2 + c_3x^3 + c_4x^4 + c_5x^5 + c_6x^6 + c_7 x^7 + c_8 x^8 + c_9 x^9 + c_{10} x^{10} +\epsilon $$to the noisy data. statsmodels
will estimate the $c_i$ each time we make the model. We will be able to set different $c_i$'s to zero to reduce the complexity of the model.
import pandas as pd
column_names = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
const_values = np.ones(len(x_ary))
data = np.array([const_values, x_ary, x_ary**2, x_ary**3, x_ary**4,
x_ary**5,x_ary**6, x_ary**7, x_ary**8, x_ary**9,
x_ary**10])
df = pd.DataFrame(data.T, columns = column_names)
df.head()
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | -2.00 | 4.0000 | -8.000000 | 16.000000 | -32.000000 | 64.000000 | -128.000000 | 256.000000 | -512.000000 | 1024.000000 |
1 | 1.0 | -1.95 | 3.8025 | -7.414875 | 14.459006 | -28.195062 | 54.980371 | -107.211724 | 209.062862 | -407.672580 | 794.961532 |
2 | 1.0 | -1.90 | 3.6100 | -6.859000 | 13.032100 | -24.760990 | 47.045881 | -89.387174 | 169.835630 | -322.687698 | 613.106626 |
3 | 1.0 | -1.85 | 3.4225 | -6.331625 | 11.713506 | -21.669987 | 40.089475 | -74.165529 | 137.206229 | -253.831523 | 469.588318 |
4 | 1.0 | -1.80 | 3.2400 | -5.832000 | 10.497600 | -18.895680 | 34.012224 | -61.222003 | 110.199606 | -198.359290 | 357.046723 |
model = sm.OLS(y_noisy, df['x0']) # make the model
results = model.fit()
results.summary()
Dep. Variable: | y | R-squared: | 0.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.000 |
Method: | Least Squares | F-statistic: | nan |
Date: | Thu, 24 Sep 2020 | Prob (F-statistic): | nan |
Time: | 10:11:12 | Log-Likelihood: | -344.10 |
No. Observations: | 80 | AIC: | 690.2 |
Df Residuals: | 79 | BIC: | 692.6 |
Df Model: | 0 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x0 | -3.8682 | 2.009 | -1.926 | 0.058 | -7.867 | 0.130 |
Omnibus: | 13.787 | Durbin-Watson: | 0.539 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 14.995 |
Skew: | -1.011 | Prob(JB): | 0.000554 |
Kurtosis: | 3.642 | Cond. No. | 1.00 |
model = sm.OLS(y_noisy, df[['x0','x1']]) # make the model
results = model.fit()
results.summary()
Dep. Variable: | y | R-squared: | 0.469 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.462 |
Method: | Least Squares | F-statistic: | 68.96 |
Date: | Thu, 24 Sep 2020 | Prob (F-statistic): | 2.42e-12 |
Time: | 10:11:14 | Log-Likelihood: | -318.76 |
No. Observations: | 80 | AIC: | 641.5 |
Df Residuals: | 78 | BIC: | 646.3 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x0 | -4.1330 | 1.473 | -2.806 | 0.006 | -7.066 | -1.200 |
x1 | -10.5928 | 1.276 | -8.304 | 0.000 | -13.132 | -8.053 |
Omnibus: | 4.286 | Durbin-Watson: | 1.014 |
---|---|---|---|
Prob(Omnibus): | 0.117 | Jarque-Bera (JB): | 4.015 |
Skew: | -0.483 | Prob(JB): | 0.134 |
Kurtosis: | 2.480 | Cond. No. | 1.16 |
model = sm.OLS(y_noisy, df[['x0','x1', 'x2']]) # make the model
results = model.fit()
results.summary()
Dep. Variable: | y | R-squared: | 0.714 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.706 |
Method: | Least Squares | F-statistic: | 96.07 |
Date: | Thu, 24 Sep 2020 | Prob (F-statistic): | 1.19e-21 |
Time: | 10:11:15 | Log-Likelihood: | -294.04 |
No. Observations: | 80 | AIC: | 594.1 |
Df Residuals: | 77 | BIC: | 601.2 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x0 | 5.7385 | 1.632 | 3.515 | 0.001 | 2.488 | 8.989 |
x1 | -10.9633 | 0.944 | -11.617 | 0.000 | -12.842 | -9.084 |
x2 | -7.4083 | 0.913 | -8.115 | 0.000 | -9.226 | -5.590 |
Omnibus: | 13.192 | Durbin-Watson: | 1.870 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 3.971 |
Skew: | 0.143 | Prob(JB): | 0.137 |
Kurtosis: | 1.947 | Cond. No. | 3.21 |
rsqr_arry = []
columns = []
for column in column_names:
columns.append(column)
model = sm.OLS(y_noisy, df[columns]) # make the model
results = model.fit() # run the OLS fit
rsqr_arry.append(results.rsquared_adj)
fig = plt.figure(figsize=(4,4))
plt.plot(np.arange(0,11,1), rsqr_arry, 'x-')
plt.xlabel('Power', fontsize = 14)
plt.ylabel(r'$R^2$', fontsize = 14);
As we might have expected, $y_5 = c_0 + c_1x + c_2x^2 + c_3x^3 + c_4x^4 + c_5x^5 +\epsilon $ produces the "best" model
best_fit = ['x0', 'x1', 'x2', 'x3', 'x4', 'x5']
model = sm.OLS(y_noisy, df[best_fit]) # make the model
results_b = model.fit()
fig = plt.figure(figsize=(4,4))
plt.plot(x_ary, y_noisy, 'o')
plt.plot(x_ary, results_b.predict(), '-x')
plt.xlabel('x', fontsize = 14)
plt.ylabel('y', fontsize = 14)
plt.legend(['data', 'best fit'], fontsize = 14)
<matplotlib.legend.Legend at 0x7ff7811ad910>
Linear fit: $y_1 = c_1 x + c_0 +\epsilon$
This is underfitting!
linear_fit = ['x0','x1']
model = sm.OLS(y_noisy, df[linear_fit]) # make the model
results_l = model.fit()
fig = plt.figure(figsize=(4,4))
plt.plot(x_ary, y_noisy, 'o')
plt.plot(x_ary, results_l.predict(), '-o')
plt.xlabel('x', fontsize = 14)
plt.ylabel('y', fontsize = 14)
plt.legend(['data', 'linear fit'], fontsize = 14)
<matplotlib.legend.Legend at 0x7ff770dacf50>
Full fit: $y = c_0 + c_1x + c_2x^2 + c_3x^3 + c_4x^4 + c_5x^5 + c_6x^6 + c_7 x^7 + c_8 x^8 + c_9 x^9 + c_{10} x^{10} +\epsilon $$
This is overfitting!
full_fit = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
model = sm.OLS(y_noisy, df[full_fit]) # make the model
results_f = model.fit()
fig = plt.figure(figsize=(4,4))
plt.plot(x_ary, y_noisy, 'o')
plt.plot(x_ary, results_f.predict(), '-s')
plt.xlabel('x', fontsize = 14)
plt.ylabel('y', fontsize = 14)
plt.legend(['data', 'full fit'], fontsize = 14)
<matplotlib.legend.Legend at 0x7ff7909f3a90>
fig,(ax1, ax2) = plt.subplots(1, 2, figsize = (8,4))
ax1.plot(x_ary, y_ary, '-', lw = 2)
ax1.plot(x_ary, y_noisy, 'o')
ax1.plot(x_ary, results_b.predict(), '-s')
ax2.plot(x_ary, y_ary, '-', lw = 2)
ax2.plot(x_ary, y_noisy, 'o')
ax2.plot(x_ary, results_f.predict(), '-d')
ax1.set_xlabel('x', fontsize = 14)
ax2.set_xlabel('x', fontsize = 14)
ax1.set_ylabel('y', fontsize = 14)
ax2.set_ylabel('y', fontsize = 14)
ax1.legend(['original', 'data', 'best_fit'], fontsize = 14)
ax2.legend(['original', 'data', 'full_fit'], fontsize = 14)
<matplotlib.legend.Legend at 0x7ff790a31ed0>
We would expect a perfect model to produce a line.
fig = plt.figure(figsize = (4,4))
plt.plot(y_noisy, results_l.predict(), 'd')
plt.plot(y_noisy, results_b.predict(), 's')
plt.plot([-60,20], [-60,20], '-r', lw=2)
plt.xlabel('data', fontsize = 14)
plt.ylabel('fitted values', fontsize = 14)
plt.legend(['linear fit', 'best fit'], fontsize = 14)
<matplotlib.legend.Legend at 0x7ff760ee6d50>