Day 12 - Principal Component Analysis

Oct. 20, 2020

CMSE logo

Administrative

  • Midterm will be given Thursday 10/29 in class
    • Study Guide posted on D2L
    • Thursday: Discussion of questions for review
    • Tuesday: Midterm review assignment and discussion
  • Thursday's Class:
    • We will spend the first ~1/3 of class checking in with you (re: mental health and strategies that are working)
    • We will spend the last ~2/3 brainstorming detailed questions and tasks for the review on Tuesday

Sec 003 Midterm

Classification Problem

In this midterm you will be asked to:

  • Read in a data set and describe different properties of it (counts, means, etc.)
  • Investigate the data for less relevant features and drop them
  • Visualize feature spaces and discuss the plots
  • Build a classification model using the train/test paradigm
  • Evaluate and discuss the fit of model using testing data

Nearly everything we have done so far is important for your success on the midterm. But we are focused on classification and modeling with the train/test split on the midterm.

Assignments to definitely study: Day-09, Day-10, Day 11, and Day 11.5

From Pre-Class Assignment

Useful bits

  • Most folks got the code working
  • Videos were helpful in understanding the conceptual aspects of PCA

Challenging bits

Some really great questions:

  • Why do we need to use a PCA?
  • When do we use a PCA?
  • What is the PCA doing with the iris data set?

Principal Component Analysis (PCA)

Why do we need PCA?

There are lots of reasons, but two major ones are below.

  • Consider a data set with many, many features. It might be computationally intensive to perform analysis on such a large data set, so instead we use PCA to extra the major contributions to the modeled output and analyze the components instead. Benefit: less computationally intensive; quicker work
  • Consider a data set with a basis that has signifcant overlap between features. That is, it's hard to tell what's important and what isn't. PCA can produce a better basis with similar (sometimes the same) information for modeling. Benefit: more meaningful features; more accurate models

Let's dive into the iris data set to see this

In [14]:
##imports
import numpy as np
import scipy.linalg
import sklearn.decomposition as dec
import sklearn.datasets as ds
import matplotlib.pyplot as plt
import pandas as pd

iris = ds.load_iris()
data = pd.DataFrame(iris.data, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
target = pd.DataFrame(iris.target, columns=['species'])

Let's look at the data

In [15]:
plt.figure(figsize=(8,5));
plt.scatter(data['sepal_length'],data['sepal_width'], c=target['species'], s=30, cmap=plt.cm.rainbow);
plt.xlabel('feature 0'); plt.ylabel('feature 1')
plt.axis([4, 8, 2, 4.5])
Out[15]:
(4.0, 8.0, 2.0, 4.5)

Let's make a KNN classifier

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

train_features, test_features, train_labels, test_labels = train_test_split(data,
                                                                            target['species'],
                                                                            train_size = 0.75,
                                                                            random_state=3)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_features, train_labels)

y_predict = neigh.predict(test_features)
print(confusion_matrix(test_labels, y_predict))
print(neigh.score(test_features, test_labels))
[[15  0  0]
 [ 0 10  2]
 [ 0  0 11]]
0.9473684210526315

What happens if we use fewer features?

In [22]:
train_features, test_features, train_labels, test_labels = train_test_split(data.drop(columns=['petal_length','petal_width']),
                                                                            target['species'],
                                                                            train_size = 0.75,
                                                                            random_state=3)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_features, train_labels)

y_predict = neigh.predict(test_features)
print(confusion_matrix(test_labels, y_predict))
print(neigh.score(test_features, test_labels))
[[14  1  0]
 [ 0  7  5]
 [ 0  7  4]]
0.6578947368421053

Let's do a PCA to find the principal components

In [23]:
pca = dec.PCA()
pca_data = pca.fit_transform(data)
print(pca.explained_variance_)

pca_data = pd.DataFrame(pca_data, columns=['PC1', 'PC2', 'PC3', 'PC4'])
plt.figure(figsize=(8,3));
plt.scatter(pca_data['PC1'], pca_data['PC2'], c=target['species'], s=30, cmap=plt.cm.rainbow);
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.axis([-4, 4, -1.5, 1.5])
[4.22824171 0.24267075 0.0782095  0.02383509]
Out[23]:
(-4.0, 4.0, -1.5, 1.5)

Let's train a KNN model

In [24]:
train_features, test_features, train_labels, test_labels = train_test_split(pca_data,
                                                                            target['species'],
                                                                            train_size = 0.75,
                                                                            random_state=3)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_features, train_labels)

y_predict = neigh.predict(test_features)
print(confusion_matrix(test_labels, y_predict))
print(neigh.score(test_features, test_labels))
[[15  0  0]
 [ 0 10  2]
 [ 0  0 11]]
0.9473684210526315

Let's use only the first two principal components

In [25]:
train_features, test_features, train_labels, test_labels = train_test_split(pca_data.drop(columns=['PC3','PC4']),
                                                                            target['species'],
                                                                            train_size = 0.75,
                                                                            random_state=3)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_features, train_labels)

y_predict = neigh.predict(test_features)
print(confusion_matrix(test_labels, y_predict))
print(neigh.score(test_features, test_labels))
[[15  0  0]
 [ 0 10  2]
 [ 0  0 11]]
0.9473684210526315

Questions, Comments, Concerns?