{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Using PCA" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Goals\n", "\n", "After completing this notebook, you will be able to:\n", "1. Standardize data with `scikit-learn`\n", "2. Perform Principal Component Analysis (PCA) on data\n", "3. Evaluate the influence of different principal components by seeing how much variance they explain\n", "4. Be able to transform data into lower dimensions uing PCA\n", "5. Be able to use KernelPCA to separate nonlinearly separable data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Our Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Standard Imports\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# PCA Imports\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.decomposition import KernelPCA\n", "\n", "# Import for 3d plotting\n", "from mpl_toolkits import mplot3d\n", "\n", "# For making nonlinear data\n", "from sklearn.datasets import make_circles\n", "\n", "%matplotlib inline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Getting Example Data\n", "\n", "Today we'll be looking at sklearn's breast cancer identification dataset. You could get this directly from sklearn with `from sklearn.datasets import load_breast_cancer`, but its good to practice reading in data so we'll do it by hand. There are two files you'll need for this data:\n", "\n", "- `cancer.csv` contains the cell measurements\n", "- `target.csv` has if each cell is malignant (1) or benign (0).\n", "\n", "cancer.csv link here\n", "\n", "target.csv link here\n", "\n", "✎ Do This - Read in these files as separate DataFrames with `pd.read_csv()`and print ther `head()`s." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "mean radius | \n", "mean texture | \n", "mean perimeter | \n", "mean area | \n", "mean smoothness | \n", "mean compactness | \n", "mean concavity | \n", "mean concave points | \n", "mean symmetry | \n", "... | \n", "worst radius | \n", "worst texture | \n", "worst perimeter | \n", "worst area | \n", "worst smoothness | \n", "worst compactness | \n", "worst concavity | \n", "worst concave points | \n", "worst symmetry | \n", "worst fractal dimension | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "0.2419 | \n", "... | \n", "25.38 | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "
1 | \n", "1 | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "0.1812 | \n", "... | \n", "24.99 | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "
2 | \n", "2 | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "0.2069 | \n", "... | \n", "23.57 | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "
3 | \n", "3 | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "0.2597 | \n", "... | \n", "14.91 | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "
4 | \n", "4 | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "0.1809 | \n", "... | \n", "22.54 | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "
5 rows × 31 columns
\n", "\n", " | mean radius | \n", "mean texture | \n", "mean perimeter | \n", "mean area | \n", "mean smoothness | \n", "mean compactness | \n", "mean concavity | \n", "mean concave points | \n", "mean symmetry | \n", "mean fractal dimension | \n", "... | \n", "worst radius | \n", "worst texture | \n", "worst perimeter | \n", "worst area | \n", "worst smoothness | \n", "worst compactness | \n", "worst concavity | \n", "worst concave points | \n", "worst symmetry | \n", "worst fractal dimension | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "0.2419 | \n", "0.07871 | \n", "... | \n", "25.38 | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "
1 | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "0.1812 | \n", "0.05667 | \n", "... | \n", "24.99 | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "
2 | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "0.2069 | \n", "0.05999 | \n", "... | \n", "23.57 | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "
3 | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "0.2597 | \n", "0.09744 | \n", "... | \n", "14.91 | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "
4 | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "0.1809 | \n", "0.05883 | \n", "... | \n", "22.54 | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "
5 rows × 30 columns
\n", "\n", " | mean radius | \n", "mean texture | \n", "mean perimeter | \n", "mean area | \n", "mean smoothness | \n", "mean compactness | \n", "mean concavity | \n", "mean concave points | \n", "mean symmetry | \n", "mean fractal dimension | \n", "... | \n", "worst radius | \n", "worst texture | \n", "worst perimeter | \n", "worst area | \n", "worst smoothness | \n", "worst compactness | \n", "worst concavity | \n", "worst concave points | \n", "worst symmetry | \n", "worst fractal dimension | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1.097064 | \n", "-2.073335 | \n", "1.269934 | \n", "0.984375 | \n", "1.568466 | \n", "3.283515 | \n", "2.652874 | \n", "2.532475 | \n", "2.217515 | \n", "2.255747 | \n", "... | \n", "1.886690 | \n", "-1.359293 | \n", "2.303601 | \n", "2.001237 | \n", "1.307686 | \n", "2.616665 | \n", "2.109526 | \n", "2.296076 | \n", "2.750622 | \n", "1.937015 | \n", "
1 | \n", "1.829821 | \n", "-0.353632 | \n", "1.685955 | \n", "1.908708 | \n", "-0.826962 | \n", "-0.487072 | \n", "-0.023846 | \n", "0.548144 | \n", "0.001392 | \n", "-0.868652 | \n", "... | \n", "1.805927 | \n", "-0.369203 | \n", "1.535126 | \n", "1.890489 | \n", "-0.375612 | \n", "-0.430444 | \n", "-0.146749 | \n", "1.087084 | \n", "-0.243890 | \n", "0.281190 | \n", "
2 | \n", "1.579888 | \n", "0.456187 | \n", "1.566503 | \n", "1.558884 | \n", "0.942210 | \n", "1.052926 | \n", "1.363478 | \n", "2.037231 | \n", "0.939685 | \n", "-0.398008 | \n", "... | \n", "1.511870 | \n", "-0.023974 | \n", "1.347475 | \n", "1.456285 | \n", "0.527407 | \n", "1.082932 | \n", "0.854974 | \n", "1.955000 | \n", "1.152255 | \n", "0.201391 | \n", "
3 | \n", "-0.768909 | \n", "0.253732 | \n", "-0.592687 | \n", "-0.764464 | \n", "3.283553 | \n", "3.402909 | \n", "1.915897 | \n", "1.451707 | \n", "2.867383 | \n", "4.910919 | \n", "... | \n", "-0.281464 | \n", "0.133984 | \n", "-0.249939 | \n", "-0.550021 | \n", "3.394275 | \n", "3.893397 | \n", "1.989588 | \n", "2.175786 | \n", "6.046041 | \n", "4.935010 | \n", "
4 | \n", "1.750297 | \n", "-1.151816 | \n", "1.776573 | \n", "1.826229 | \n", "0.280372 | \n", "0.539340 | \n", "1.371011 | \n", "1.428493 | \n", "-0.009560 | \n", "-0.562450 | \n", "... | \n", "1.298575 | \n", "-1.466770 | \n", "1.338539 | \n", "1.220724 | \n", "0.220556 | \n", "-0.313395 | \n", "0.613179 | \n", "0.729259 | \n", "-0.868353 | \n", "-0.397100 | \n", "
5 rows × 30 columns
\n", "\n", " | Principal Component 1 | \n", "Principal Component 2 | \n", "
---|---|---|
0 | \n", "9.192837 | \n", "1.948583 | \n", "
1 | \n", "2.387802 | \n", "-3.768172 | \n", "
2 | \n", "5.733896 | \n", "-1.075174 | \n", "
3 | \n", "7.122953 | \n", "10.275589 | \n", "
4 | \n", "3.935302 | \n", "-1.948072 | \n", "