make_classification
is doingWe will be doing classification tasks for a few weeks, so we will get lots of practice
We will learn Logisitic Regression, KNN, and SVM, but sklearn
provides access to the other three methods as well.
make_classification
lets us make fake data and control the kind of data we get.
n_features
- the total number of features that can be used in the modeln_informative
- the total number of features that provide unique information for classesn_redundant
- the total number of features that are built from informative features (i.e., have redundant information)n_class
- the number of class labels (default 2: 0/1)n_clusters_per_class
- the number of clusters per classimport matplotlib.pyplot as plt
plt.style.use('seaborn-colorblind')
from sklearn.datasets import make_classification
features, class_labels = make_classification(n_samples = 1000,
n_features = 3,
n_informative = 2,
n_redundant = 1,
n_clusters_per_class=1,
random_state=201)
## Let's look at these 3D data
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=30, azim=135)
xs = features[:, 0]
ys = features[:, 1]
zs = features[:, 2]
ax.scatter3D(xs, ys, zs, c=class_labels, ec='k')
ax.set_xlabel('feature 0')
ax.set_ylabel('feature 1')
ax.set_zlabel('feature 2')
Text(0.5, 0, 'feature 2')
## From a different angle, we see the 2D nature of the data
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=15, azim=90)
xs = features[:, 0]
ys = features[:, 1]
zs = features[:, 2]
ax.scatter3D(xs, ys, zs, c=class_labels, ec = 'k')
ax.set_xlabel('feature 0')
ax.set_ylabel('feature 1')
ax.set_zlabel('feature 2')
Text(0.5, 0, 'feature 2')
For higher dimensions, we have take 2D slices of the data (called "projections" or "subspaces")
f, axs = plt.subplots(1,3,figsize=(15,4))
plt.subplot(131)
plt.scatter(features[:, 0], features[:, 1], marker = 'o', c = class_labels, ec = 'k')
plt.xlabel('feature 0')
plt.ylabel('feature 1')
plt.subplot(132)
plt.scatter(features[:, 0], features[:, 2], marker = 'o', c = class_labels, ec = 'k')
plt.xlabel('feature 0')
plt.ylabel('feature 2')
plt.subplot(133)
plt.scatter(features[:, 1], features[:, 2], marker = 'o', c = class_labels, ec = 'k')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.tight_layout()
Logistic Regression attempts to fit a sigmoid (S-shaped) function to your data. This shapes assumes that the probability of finding class 0 versus class 1 increases as the feature changes value.
f, axs = plt.subplots(1,3,figsize=(15,4))
plt.subplot(131)
plt.scatter(features[:,0], class_labels, c=class_labels, ec='k')
plt.xlabel('feature 0')
plt.ylabel('class label')
plt.subplot(132)
plt.scatter(features[:,1], class_labels, c=class_labels, ec='k')
plt.xlabel('feature 1')
plt.ylabel('class label')
plt.subplot(133)
plt.scatter(features[:,2], class_labels, c=class_labels, ec='k')
plt.xlabel('feature 2')
plt.ylabel('class label')
plt.tight_layout()