Iris Dataset

In the 1820s, a botanist named Edgar Anderson collected morphologic variation data on three related species of irises. In 1936, statistician and biologist Ronald Fisher would use this data to introduce a model called linear discriminant analysis, which he used to desribe how the iris species could be correctly classified based on their measured features.

Iris Picture

Iris Dataset

  • 150 Samples of 4 different measurements (sepal length, sepal width, petal length, petal width)
  • 50 Samples each of different iris species (setosa, versicolor, and virginica)
In [94]:
import seaborn
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
%matplotlib inline
seaborn.set(style="white", color_codes=True)
# import load_iris function from datasets module
from sklearn.datasets import load_iris

Import Data

In [95]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()
type(iris)
Out[95]:
sklearn.utils.Bunch
In [96]:
# print the iris data
print(iris.data)
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5  1.6  0.6]
 [ 5.1  3.8  1.9  0.4]
 [ 4.8  3.   1.4  0.3]
 [ 5.1  3.8  1.6  0.2]
 [ 4.6  3.2  1.4  0.2]
 [ 5.3  3.7  1.5  0.2]
 [ 5.   3.3  1.4  0.2]
 [ 7.   3.2  4.7  1.4]
 [ 6.4  3.2  4.5  1.5]
 [ 6.9  3.1  4.9  1.5]
 [ 5.5  2.3  4.   1.3]
 [ 6.5  2.8  4.6  1.5]
 [ 5.7  2.8  4.5  1.3]
 [ 6.3  3.3  4.7  1.6]
 [ 4.9  2.4  3.3  1. ]
 [ 6.6  2.9  4.6  1.3]
 [ 5.2  2.7  3.9  1.4]
 [ 5.   2.   3.5  1. ]
 [ 5.9  3.   4.2  1.5]
 [ 6.   2.2  4.   1. ]
 [ 6.1  2.9  4.7  1.4]
 [ 5.6  2.9  3.6  1.3]
 [ 6.7  3.1  4.4  1.4]
 [ 5.6  3.   4.5  1.5]
 [ 5.8  2.7  4.1  1. ]
 [ 6.2  2.2  4.5  1.5]
 [ 5.6  2.5  3.9  1.1]
 [ 5.9  3.2  4.8  1.8]
 [ 6.1  2.8  4.   1.3]
 [ 6.3  2.5  4.9  1.5]
 [ 6.1  2.8  4.7  1.2]
 [ 6.4  2.9  4.3  1.3]
 [ 6.6  3.   4.4  1.4]
 [ 6.8  2.8  4.8  1.4]
 [ 6.7  3.   5.   1.7]
 [ 6.   2.9  4.5  1.5]
 [ 5.7  2.6  3.5  1. ]
 [ 5.5  2.4  3.8  1.1]
 [ 5.5  2.4  3.7  1. ]
 [ 5.8  2.7  3.9  1.2]
 [ 6.   2.7  5.1  1.6]
 [ 5.4  3.   4.5  1.5]
 [ 6.   3.4  4.5  1.6]
 [ 6.7  3.1  4.7  1.5]
 [ 6.3  2.3  4.4  1.3]
 [ 5.6  3.   4.1  1.3]
 [ 5.5  2.5  4.   1.3]
 [ 5.5  2.6  4.4  1.2]
 [ 6.1  3.   4.6  1.4]
 [ 5.8  2.6  4.   1.2]
 [ 5.   2.3  3.3  1. ]
 [ 5.6  2.7  4.2  1.3]
 [ 5.7  3.   4.2  1.2]
 [ 5.7  2.9  4.2  1.3]
 [ 6.2  2.9  4.3  1.3]
 [ 5.1  2.5  3.   1.1]
 [ 5.7  2.8  4.1  1.3]
 [ 6.3  3.3  6.   2.5]
 [ 5.8  2.7  5.1  1.9]
 [ 7.1  3.   5.9  2.1]
 [ 6.3  2.9  5.6  1.8]
 [ 6.5  3.   5.8  2.2]
 [ 7.6  3.   6.6  2.1]
 [ 4.9  2.5  4.5  1.7]
 [ 7.3  2.9  6.3  1.8]
 [ 6.7  2.5  5.8  1.8]
 [ 7.2  3.6  6.1  2.5]
 [ 6.5  3.2  5.1  2. ]
 [ 6.4  2.7  5.3  1.9]
 [ 6.8  3.   5.5  2.1]
 [ 5.7  2.5  5.   2. ]
 [ 5.8  2.8  5.1  2.4]
 [ 6.4  3.2  5.3  2.3]
 [ 6.5  3.   5.5  1.8]
 [ 7.7  3.8  6.7  2.2]
 [ 7.7  2.6  6.9  2.3]
 [ 6.   2.2  5.   1.5]
 [ 6.9  3.2  5.7  2.3]
 [ 5.6  2.8  4.9  2. ]
 [ 7.7  2.8  6.7  2. ]
 [ 6.3  2.7  4.9  1.8]
 [ 6.7  3.3  5.7  2.1]
 [ 7.2  3.2  6.   1.8]
 [ 6.2  2.8  4.8  1.8]
 [ 6.1  3.   4.9  1.8]
 [ 6.4  2.8  5.6  2.1]
 [ 7.2  3.   5.8  1.6]
 [ 7.4  2.8  6.1  1.9]
 [ 7.9  3.8  6.4  2. ]
 [ 6.4  2.8  5.6  2.2]
 [ 6.3  2.8  5.1  1.5]
 [ 6.1  2.6  5.6  1.4]
 [ 7.7  3.   6.1  2.3]
 [ 6.3  3.4  5.6  2.4]
 [ 6.4  3.1  5.5  1.8]
 [ 6.   3.   4.8  1.8]
 [ 6.9  3.1  5.4  2.1]
 [ 6.7  3.1  5.6  2.4]
 [ 6.9  3.1  5.1  2.3]
 [ 5.8  2.7  5.1  1.9]
 [ 6.8  3.2  5.9  2.3]
 [ 6.7  3.3  5.7  2.5]
 [ 6.7  3.   5.2  2.3]
 [ 6.3  2.5  5.   1.9]
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]
In [97]:
# print the names of the four features
print(iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [98]:
# print integers representing the species of each observation
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
In [99]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)
['setosa' 'versicolor' 'virginica']
In [100]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
In [101]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)
(150, 4)
In [102]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)
(150,)
In [103]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

Convert to dataframe

In [104]:
# Create a dataframe from iris data (useful for plotting)
iris_structure = { iris.feature_names[0]: iris.data[:,0], iris.feature_names[1]: iris.data[:,1], \
                  iris.feature_names[2]: iris.data[:,2], iris.feature_names[3]: iris.data[:,3], \
                  'Species': iris.target_names[iris.target]}
iris_df = pandas.DataFrame( iris_structure );
iris_df
Out[104]:
Species petal length (cm) petal width (cm) sepal length (cm) sepal width (cm)
0 setosa 1.4 0.2 5.1 3.5
1 setosa 1.4 0.2 4.9 3.0
2 setosa 1.3 0.2 4.7 3.2
3 setosa 1.5 0.2 4.6 3.1
4 setosa 1.4 0.2 5.0 3.6
5 setosa 1.7 0.4 5.4 3.9
6 setosa 1.4 0.3 4.6 3.4
7 setosa 1.5 0.2 5.0 3.4
8 setosa 1.4 0.2 4.4 2.9
9 setosa 1.5 0.1 4.9 3.1
10 setosa 1.5 0.2 5.4 3.7
11 setosa 1.6 0.2 4.8 3.4
12 setosa 1.4 0.1 4.8 3.0
13 setosa 1.1 0.1 4.3 3.0
14 setosa 1.2 0.2 5.8 4.0
15 setosa 1.5 0.4 5.7 4.4
16 setosa 1.3 0.4 5.4 3.9
17 setosa 1.4 0.3 5.1 3.5
18 setosa 1.7 0.3 5.7 3.8
19 setosa 1.5 0.3 5.1 3.8
20 setosa 1.7 0.2 5.4 3.4
21 setosa 1.5 0.4 5.1 3.7
22 setosa 1.0 0.2 4.6 3.6
23 setosa 1.7 0.5 5.1 3.3
24 setosa 1.9 0.2 4.8 3.4
25 setosa 1.6 0.2 5.0 3.0
26 setosa 1.6 0.4 5.0 3.4
27 setosa 1.5 0.2 5.2 3.5
28 setosa 1.4 0.2 5.2 3.4
29 setosa 1.6 0.2 4.7 3.2
... ... ... ... ... ...
120 virginica 5.7 2.3 6.9 3.2
121 virginica 4.9 2.0 5.6 2.8
122 virginica 6.7 2.0 7.7 2.8
123 virginica 4.9 1.8 6.3 2.7
124 virginica 5.7 2.1 6.7 3.3
125 virginica 6.0 1.8 7.2 3.2
126 virginica 4.8 1.8 6.2 2.8
127 virginica 4.9 1.8 6.1 3.0
128 virginica 5.6 2.1 6.4 2.8
129 virginica 5.8 1.6 7.2 3.0
130 virginica 6.1 1.9 7.4 2.8
131 virginica 6.4 2.0 7.9 3.8
132 virginica 5.6 2.2 6.4 2.8
133 virginica 5.1 1.5 6.3 2.8
134 virginica 5.6 1.4 6.1 2.6
135 virginica 6.1 2.3 7.7 3.0
136 virginica 5.6 2.4 6.3 3.4
137 virginica 5.5 1.8 6.4 3.1
138 virginica 4.8 1.8 6.0 3.0
139 virginica 5.4 2.1 6.9 3.1
140 virginica 5.6 2.4 6.7 3.1
141 virginica 5.1 2.3 6.9 3.1
142 virginica 5.1 1.9 5.8 2.7
143 virginica 5.9 2.3 6.8 3.2
144 virginica 5.7 2.5 6.7 3.3
145 virginica 5.2 2.3 6.7 3.0
146 virginica 5.0 1.9 6.3 2.5
147 virginica 5.2 2.0 6.5 3.0
148 virginica 5.4 2.3 6.2 3.4
149 virginica 5.1 1.8 5.9 3.0

150 rows × 5 columns

Exploratory Data Analysis

In [105]:
# One piece of information missing in the plots above is what species each plant is
# We'll use seaborn's FacetGrid to color the scatterplot by species
seaborn.FacetGrid(iris_df, hue="Species", size=5) \
   .map(pyplot.scatter, "sepal length (cm)", "sepal width (cm)") \
   .add_legend()
Out[105]:
<seaborn.axisgrid.FacetGrid at 0x1bf27718dd8>
In [106]:
iris.target_names
Out[106]:
array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')
In [107]:
seaborn.pairplot( iris_df, hue="Species", size=3, diag_kind="kde")
Out[107]:
<seaborn.axisgrid.PairGrid at 0x1bf288d4748>

From the data visualization, the setosa species looks linearly separable from versicolor and virginica. However, versicolor and virginica do not appear to be linearly separable.

Models and Model Comparison

In [108]:
from sklearn import model_selection
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neural_network import MLPClassifier
In [114]:
# prepare configuration for cross validation test harness
seed = 1
# prepare models
models = []
models.append(('LR',    'Logistic Regression', LogisticRegression()))
models.append(('LDA',   'Linear Discriminant Analysis', LinearDiscriminantAnalysis()))
models.append(('QDA',   'Quadratic Discriminant Analysis', QuadraticDiscriminantAnalysis()))
models.append(('KNN1',  'K-Nearest Neighbors (K=1)', KNeighborsClassifier(n_neighbors=1)))
models.append(('KNN3',  'K-Nearest Neighbors (K=3)', KNeighborsClassifier(n_neighbors=3)))
models.append(('KNN5',  'K-Nearest Neighbors (K=5)', KNeighborsClassifier(n_neighbors=5)))
models.append(('GP',    'Gaussian Process Classifier', GaussianProcessClassifier()))
models.append(('CART',  'Decision Tree Classifier', DecisionTreeClassifier()))
models.append(('RF',    'Random Forest', RandomForestClassifier(n_estimators=10, max_features=2)))
models.append(('GB',    'AdaBoost Gradient Boosting', AdaBoostClassifier()))
models.append(('NB',    'Naive Bayes', GaussianNB()))
models.append(('SVM_L', 'Support Vector Machine (Linear Kernel)', SVC(kernel='linear')))
models.append(('SVM_R', 'Support Vector Machine (Radial Kernel)', SVC(gamma=2,C=1)))
models.append(('NN',    'Multilayer Perceptron', MLPClassifier(alpha=1,learning_rate_init=0.01)))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
print( "%40s  %8s (%8s)" % ("Model", "Accuracy", "Std Dev") );
for shortname, longname, model in models:
	kfold = model_selection.RepeatedKFold(n_splits=10, n_repeats=3, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(shortname)
	msg = "%40s: %f (%f)" % (longname, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
matplotlib.rc('font', **font)
fig = pyplot.figure(figsize=(12,10))
fig.suptitle('Model Accuracy (10-fold cross-validation)')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()
                                   Model  Accuracy ( Std Dev)
                     Logistic Regression: 0.953333 (0.054840)
            Linear Discriminant Analysis: 0.977778 (0.039752)
         Quadratic Discriminant Analysis: 0.968889 (0.056394)
               K-Nearest Neighbors (K=1): 0.957778 (0.053008)
               K-Nearest Neighbors (K=3): 0.962222 (0.050723)
               K-Nearest Neighbors (K=5): 0.964444 (0.047868)
             Gaussian Process Classifier: 0.953333 (0.052068)
                Decision Tree Classifier: 0.942222 (0.063790)
                           Random Forest: 0.946667 (0.060614)
              AdaBoost Gradient Boosting: 0.935556 (0.060817)
                             Naive Bayes: 0.953333 (0.049141)
  Support Vector Machine (Linear Kernel): 0.973333 (0.040734)
  Support Vector Machine (Radial Kernel): 0.960000 (0.050479)
C:\Users\nk_wh\Anaconda3\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
                   Multilayer Perceptron: 0.964444 (0.076465)

Perhaps not surprisingly, the LDA model performed the best (lower error bounds than the other models and more interpretable than the Neural Network model).