Iris Dataset¶

In the 1820s, a botanist named Edgar Anderson collected morphologic variation data on three related species of irises. In 1936, statistician and biologist Ronald Fisher would use this data to introduce a model called linear discriminant analysis, which he used to desribe how the iris species could be correctly classified based on their measured features.

Iris Picture

Iris Dataset¶

150 Samples of 4 different measurements (sepal length, sepal width, petal length, petal width)
50 Samples each of different iris species (setosa, versicolor, and virginica)

import seaborn
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
%matplotlib inline
seaborn.set(style="white", color_codes=True)
# import load_iris function from datasets module
from sklearn.datasets import load_iris

Import Data¶

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

# print the iris data
print(iris.data)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5  1.6  0.6]
 [ 5.1  3.8  1.9  0.4]
 [ 4.8  3.   1.4  0.3]
 [ 5.1  3.8  1.6  0.2]
 [ 4.6  3.2  1.4  0.2]
 [ 5.3  3.7  1.5  0.2]
 [ 5.   3.3  1.4  0.2]
 [ 7.   3.2  4.7  1.4]
 [ 6.4  3.2  4.5  1.5]
 [ 6.9  3.1  4.9  1.5]
 [ 5.5  2.3  4.   1.3]
 [ 6.5  2.8  4.6  1.5]
 [ 5.7  2.8  4.5  1.3]
 [ 6.3  3.3  4.7  1.6]
 [ 4.9  2.4  3.3  1. ]
 [ 6.6  2.9  4.6  1.3]
 [ 5.2  2.7  3.9  1.4]
 [ 5.   2.   3.5  1. ]
 [ 5.9  3.   4.2  1.5]
 [ 6.   2.2  4.   1. ]
 [ 6.1  2.9  4.7  1.4]
 [ 5.6  2.9  3.6  1.3]
 [ 6.7  3.1  4.4  1.4]
 [ 5.6  3.   4.5  1.5]
 [ 5.8  2.7  4.1  1. ]
 [ 6.2  2.2  4.5  1.5]
 [ 5.6  2.5  3.9  1.1]
 [ 5.9  3.2  4.8  1.8]
 [ 6.1  2.8  4.   1.3]
 [ 6.3  2.5  4.9  1.5]
 [ 6.1  2.8  4.7  1.2]
 [ 6.4  2.9  4.3  1.3]
 [ 6.6  3.   4.4  1.4]
 [ 6.8  2.8  4.8  1.4]
 [ 6.7  3.   5.   1.7]
 [ 6.   2.9  4.5  1.5]
 [ 5.7  2.6  3.5  1. ]
 [ 5.5  2.4  3.8  1.1]
 [ 5.5  2.4  3.7  1. ]
 [ 5.8  2.7  3.9  1.2]
 [ 6.   2.7  5.1  1.6]
 [ 5.4  3.   4.5  1.5]
 [ 6.   3.4  4.5  1.6]
 [ 6.7  3.1  4.7  1.5]
 [ 6.3  2.3  4.4  1.3]
 [ 5.6  3.   4.1  1.3]
 [ 5.5  2.5  4.   1.3]
 [ 5.5  2.6  4.4  1.2]
 [ 6.1  3.   4.6  1.4]
 [ 5.8  2.6  4.   1.2]
 [ 5.   2.3  3.3  1. ]
 [ 5.6  2.7  4.2  1.3]
 [ 5.7  3.   4.2  1.2]
 [ 5.7  2.9  4.2  1.3]
 [ 6.2  2.9  4.3  1.3]
 [ 5.1  2.5  3.   1.1]
 [ 5.7  2.8  4.1  1.3]
 [ 6.3  3.3  6.   2.5]
 [ 5.8  2.7  5.1  1.9]
 [ 7.1  3.   5.9  2.1]
 [ 6.3  2.9  5.6  1.8]
 [ 6.5  3.   5.8  2.2]
 [ 7.6  3.   6.6  2.1]
 [ 4.9  2.5  4.5  1.7]
 [ 7.3  2.9  6.3  1.8]
 [ 6.7  2.5  5.8  1.8]
 [ 7.2  3.6  6.1  2.5]
 [ 6.5  3.2  5.1  2. ]
 [ 6.4  2.7  5.3  1.9]
 [ 6.8  3.   5.5  2.1]
 [ 5.7  2.5  5.   2. ]
 [ 5.8  2.8  5.1  2.4]
 [ 6.4  3.2  5.3  2.3]
 [ 6.5  3.   5.5  1.8]
 [ 7.7  3.8  6.7  2.2]
 [ 7.7  2.6  6.9  2.3]
 [ 6.   2.2  5.   1.5]
 [ 6.9  3.2  5.7  2.3]
 [ 5.6  2.8  4.9  2. ]
 [ 7.7  2.8  6.7  2. ]
 [ 6.3  2.7  4.9  1.8]
 [ 6.7  3.3  5.7  2.1]
 [ 7.2  3.2  6.   1.8]
 [ 6.2  2.8  4.8  1.8]
 [ 6.1  3.   4.9  1.8]
 [ 6.4  2.8  5.6  2.1]
 [ 7.2  3.   5.8  1.6]
 [ 7.4  2.8  6.1  1.9]
 [ 7.9  3.8  6.4  2. ]
 [ 6.4  2.8  5.6  2.2]
 [ 6.3  2.8  5.1  1.5]
 [ 6.1  2.6  5.6  1.4]
 [ 7.7  3.   6.1  2.3]
 [ 6.3  3.4  5.6  2.4]
 [ 6.4  3.1  5.5  1.8]
 [ 6.   3.   4.8  1.8]
 [ 6.9  3.1  5.4  2.1]
 [ 6.7  3.1  5.6  2.4]
 [ 6.9  3.1  5.1  2.3]
 [ 5.8  2.7  5.1  1.9]
 [ 6.8  3.2  5.9  2.3]
 [ 6.7  3.3  5.7  2.5]
 [ 6.7  3.   5.2  2.3]
 [ 6.3  2.5  5.   1.9]
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]

# print the names of the four features
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

# print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']

# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)

# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

Convert to dataframe¶

# Create a dataframe from iris data (useful for plotting)
iris_structure = { iris.feature_names[0]: iris.data[:,0], iris.feature_names[1]: iris.data[:,1], \
                  iris.feature_names[2]: iris.data[:,2], iris.feature_names[3]: iris.data[:,3], \
                  'Species': iris.target_names[iris.target]}
iris_df = pandas.DataFrame( iris_structure );
iris_df

Exploratory Data Analysis¶

# One piece of information missing in the plots above is what species each plant is
# We'll use seaborn's FacetGrid to color the scatterplot by species
seaborn.FacetGrid(iris_df, hue="Species", size=5) \
   .map(pyplot.scatter, "sepal length (cm)", "sepal width (cm)") \
   .add_legend()

<seaborn.axisgrid.FacetGrid at 0x1bf27718dd8>

iris.target_names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')

seaborn.pairplot( iris_df, hue="Species", size=3, diag_kind="kde")

<seaborn.axisgrid.PairGrid at 0x1bf288d4748>

From the data visualization, the setosa species looks linearly separable from versicolor and virginica. However, versicolor and virginica do not appear to be linearly separable.

Models and Model Comparison¶

from sklearn import model_selection
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neural_network import MLPClassifier

# prepare configuration for cross validation test harness
seed = 1
# prepare models
models = []
models.append(('LR',    'Logistic Regression', LogisticRegression()))
models.append(('LDA',   'Linear Discriminant Analysis', LinearDiscriminantAnalysis()))
models.append(('QDA',   'Quadratic Discriminant Analysis', QuadraticDiscriminantAnalysis()))
models.append(('KNN1',  'K-Nearest Neighbors (K=1)', KNeighborsClassifier(n_neighbors=1)))
models.append(('KNN3',  'K-Nearest Neighbors (K=3)', KNeighborsClassifier(n_neighbors=3)))
models.append(('KNN5',  'K-Nearest Neighbors (K=5)', KNeighborsClassifier(n_neighbors=5)))
models.append(('GP',    'Gaussian Process Classifier', GaussianProcessClassifier()))
models.append(('CART',  'Decision Tree Classifier', DecisionTreeClassifier()))
models.append(('RF',    'Random Forest', RandomForestClassifier(n_estimators=10, max_features=2)))
models.append(('GB',    'AdaBoost Gradient Boosting', AdaBoostClassifier()))
models.append(('NB',    'Naive Bayes', GaussianNB()))
models.append(('SVM_L', 'Support Vector Machine (Linear Kernel)', SVC(kernel='linear')))
models.append(('SVM_R', 'Support Vector Machine (Radial Kernel)', SVC(gamma=2,C=1)))
models.append(('NN',    'Multilayer Perceptron', MLPClassifier(alpha=1,learning_rate_init=0.01)))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
print( "%40s  %8s (%8s)" % ("Model", "Accuracy", "Std Dev") );
for shortname, longname, model in models:
	kfold = model_selection.RepeatedKFold(n_splits=10, n_repeats=3, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(shortname)
	msg = "%40s: %f (%f)" % (longname, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
matplotlib.rc('font', **font)
fig = pyplot.figure(figsize=(12,10))
fig.suptitle('Model Accuracy (10-fold cross-validation)')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

                                   Model  Accuracy ( Std Dev)
                     Logistic Regression: 0.953333 (0.054840)
            Linear Discriminant Analysis: 0.977778 (0.039752)
         Quadratic Discriminant Analysis: 0.968889 (0.056394)
               K-Nearest Neighbors (K=1): 0.957778 (0.053008)
               K-Nearest Neighbors (K=3): 0.962222 (0.050723)
               K-Nearest Neighbors (K=5): 0.964444 (0.047868)
             Gaussian Process Classifier: 0.953333 (0.052068)
                Decision Tree Classifier: 0.942222 (0.063790)
                           Random Forest: 0.946667 (0.060614)
              AdaBoost Gradient Boosting: 0.935556 (0.060817)
                             Naive Bayes: 0.953333 (0.049141)
  Support Vector Machine (Linear Kernel): 0.973333 (0.040734)
  Support Vector Machine (Radial Kernel): 0.960000 (0.050479)

C:\Users\nk_wh\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

                   Multilayer Perceptron: 0.964444 (0.076465)

Perhaps not surprisingly, the LDA model performed the best (lower error bounds than the other models and more interpretable than the Neural Network model).

	Species	petal length (cm)	petal width (cm)	sepal length (cm)	sepal width (cm)
0	setosa	1.4	0.2	5.1	3.5
1	setosa	1.4	0.2	4.9	3.0
2	setosa	1.3	0.2	4.7	3.2
3	setosa	1.5	0.2	4.6	3.1
4	setosa	1.4	0.2	5.0	3.6
5	setosa	1.7	0.4	5.4	3.9
6	setosa	1.4	0.3	4.6	3.4
7	setosa	1.5	0.2	5.0	3.4
8	setosa	1.4	0.2	4.4	2.9
9	setosa	1.5	0.1	4.9	3.1
10	setosa	1.5	0.2	5.4	3.7
11	setosa	1.6	0.2	4.8	3.4
12	setosa	1.4	0.1	4.8	3.0
13	setosa	1.1	0.1	4.3	3.0
14	setosa	1.2	0.2	5.8	4.0
15	setosa	1.5	0.4	5.7	4.4
16	setosa	1.3	0.4	5.4	3.9
17	setosa	1.4	0.3	5.1	3.5
18	setosa	1.7	0.3	5.7	3.8
19	setosa	1.5	0.3	5.1	3.8
20	setosa	1.7	0.2	5.4	3.4
21	setosa	1.5	0.4	5.1	3.7
22	setosa	1.0	0.2	4.6	3.6
23	setosa	1.7	0.5	5.1	3.3
24	setosa	1.9	0.2	4.8	3.4
25	setosa	1.6	0.2	5.0	3.0
26	setosa	1.6	0.4	5.0	3.4
27	setosa	1.5	0.2	5.2	3.5
28	setosa	1.4	0.2	5.2	3.4
29	setosa	1.6	0.2	4.7	3.2
...	...	...	...	...	...
120	virginica	5.7	2.3	6.9	3.2
121	virginica	4.9	2.0	5.6	2.8
122	virginica	6.7	2.0	7.7	2.8
123	virginica	4.9	1.8	6.3	2.7
124	virginica	5.7	2.1	6.7	3.3
125	virginica	6.0	1.8	7.2	3.2
126	virginica	4.8	1.8	6.2	2.8
127	virginica	4.9	1.8	6.1	3.0
128	virginica	5.6	2.1	6.4	2.8
129	virginica	5.8	1.6	7.2	3.0
130	virginica	6.1	1.9	7.4	2.8
131	virginica	6.4	2.0	7.9	3.8
132	virginica	5.6	2.2	6.4	2.8
133	virginica	5.1	1.5	6.3	2.8
134	virginica	5.6	1.4	6.1	2.6
135	virginica	6.1	2.3	7.7	3.0
136	virginica	5.6	2.4	6.3	3.4
137	virginica	5.5	1.8	6.4	3.1
138	virginica	4.8	1.8	6.0	3.0
139	virginica	5.4	2.1	6.9	3.1
140	virginica	5.6	2.4	6.7	3.1
141	virginica	5.1	2.3	6.9	3.1
142	virginica	5.1	1.9	5.8	2.7
143	virginica	5.9	2.3	6.8	3.2
144	virginica	5.7	2.5	6.7	3.3
145	virginica	5.2	2.3	6.7	3.0
146	virginica	5.0	1.9	6.3	2.5
147	virginica	5.2	2.0	6.5	3.0
148	virginica	5.4	2.3	6.2	3.4
149	virginica	5.1	1.8	5.9	3.0