import pandas as pd
= "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/train.csv"
train_url = pd.read_csv(train_url) train
Classifying Palmer Penguins
Classifying Palmer Penguins
Abstract
In this blog post, I analyze the Palmer Penguins dataset to identify the best features for determining penguin species based on their measurements. First, I create two figures and a table to explore the relationships between different features. Next, I utilize Scikit-learn’s feature selection methods, employing chi-squared tests to select two numerical features and one categorical feature. With these features, I train and test a logistic regression model. The model demonstrates reasonable accuracy; however, to gain a better understanding of the results, I visualize the decision regions and present a confusion matrix.st features to be used to determine the species of a penguin based on its measurements. Firstly, I create two figures and a table to analysize the relationships between features. Then I use sci-kit learns feature selection with chi-squared tests to pick 2 numerical features and 1 categorical feature. Then using those features, I train and test a logistic regression model. The model was fairly accurate but to understand the results better, I plot the decision regions and use a confusion matrix.
train.head()
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0809 | 31 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N63A1 | Yes | 11/24/08 | 40.9 | 16.6 | 187.0 | 3200.0 | FEMALE | 9.08458 | -24.54903 | NaN |
1 | PAL0809 | 41 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N74A1 | Yes | 11/24/08 | 49.0 | 19.5 | 210.0 | 3950.0 | MALE | 9.53262 | -24.66867 | NaN |
2 | PAL0708 | 4 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N32A2 | Yes | 11/27/07 | 50.0 | 15.2 | 218.0 | 5700.0 | MALE | 8.25540 | -25.40075 | NaN |
3 | PAL0708 | 15 | Gentoo penguin (Pygoscelis papua) | Anvers | Biscoe | Adult, 1 Egg Stage | N38A1 | Yes | 12/3/07 | 45.8 | 14.6 | 210.0 | 4200.0 | FEMALE | 7.79958 | -25.62618 | NaN |
4 | PAL0809 | 34 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N65A2 | Yes | 11/24/08 | 51.0 | 18.8 | 203.0 | 4100.0 | MALE | 9.23196 | -24.17282 | NaN |
Data Preparation
This code is from Professor Phil’s website. It removes unused columns and NA values, converts categorical feature columns into “one-hot encoded” 0-1 columns, and saves the resulting DataFrame as X_train. Additionally, the “Species” column is encoded using LabelEncoder and stored as y_train.
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le "Species"])
le.fit(train[
def prepare_data(df):
= df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
df = df[df["Sex"] != "."]
df = df.dropna()
df = le.transform(df["Species"])
y = df.drop(["Species"], axis = 1)
df = pd.get_dummies(df)
df return df, y
= prepare_data(train) X_train, y_train
We can check what the columns look like now.
X_train.head()
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Island_Biscoe | Island_Dream | Island_Torgersen | Stage_Adult, 1 Egg Stage | Clutch Completion_No | Clutch Completion_Yes | Sex_FEMALE | Sex_MALE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.9 | 16.6 | 187.0 | 3200.0 | 9.08458 | -24.54903 | False | True | False | True | False | True | True | False |
1 | 49.0 | 19.5 | 210.0 | 3950.0 | 9.53262 | -24.66867 | False | True | False | True | False | True | False | True |
2 | 50.0 | 15.2 | 218.0 | 5700.0 | 8.25540 | -25.40075 | True | False | False | True | False | True | False | True |
3 | 45.8 | 14.6 | 210.0 | 4200.0 | 7.79958 | -25.62618 | True | False | False | True | False | True | True | False |
4 | 51.0 | 18.8 | 203.0 | 4100.0 | 9.23196 | -24.17282 | False | True | False | True | False | True | False | True |
Data Visualization
Figures
I created two graphs each two quantative columns and one qualitative columns. Plot 1 shows the relationship with the body mass and flipper length between different penguin species. Plot 2 shows the difference in Culmen Length and depth across different penguin species.
# Get the unencoded columns for easier graphing.
= train[["Island", "Sex", "Species"]].dropna()
qual
# Shorten species label for the legend
"Species"] = qual["Species"].apply(lambda x: "Chinstrap" if x == "Chinstrap penguin (Pygoscelis antarctica)"
qual[else ("Gentoo" if x == "Gentoo penguin (Pygoscelis papua)"
else "Adelie"))
from matplotlib import pyplot as plt
import seaborn as sns
'seaborn-v0_8-whitegrid')
plt.style.use(
= plt.subplots(1, 2, figsize = (10, 4))
fig, ax
= sns.scatterplot(X_train, x = "Body Mass (g)", y = "Flipper Length (mm)", hue=qual["Species"], ax = ax[0])
p1 = sns.scatterplot(X_train, x = "Culmen Length (mm)", y = "Culmen Depth (mm)", hue=qual["Species"], ax = ax[1]) p2
Plot 1 (Left): This graph shows the relationship between body mass and flipper length among different penguin species. Gentoo penguins are the largest, while Chinstrap and Adelie penguins overlap considerably in size. Adelie penguins show slightly more variation in mass for a given flipper length compared to Chinstraps.
Plot 2 (Right): This graph illustrates the differences in culmen length and depth among the species. Adelie penguins have the deepest but shortest culmen, Gentoo penguins have longer but less deep culmens, and Chinstraps are in between. These differences in beak size are significant for distinguishing penguin species.
Table
Now I create a summary table of the penguins measurements based on clutch completetion.
= X_train[["Clutch Completion_Yes", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)"]]
table "Clutch Completion_Yes").aggregate(['min', 'median', 'max']) table.groupby(
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
min | median | max | min | median | max | min | median | max | min | median | max | |
Clutch Completion_Yes | ||||||||||||
False | 35.9 | 43.35 | 58.0 | 13.7 | 17.85 | 19.9 | 172.0 | 195.0 | 225.0 | 2700.0 | 3737.5 | 5700.0 |
True | 34.0 | 45.10 | 55.9 | 13.1 | 17.20 | 21.5 | 176.0 | 198.0 | 230.0 | 2850.0 | 4100.0 | 6300.0 |
Table 1: This table shows that the most significant difference between penguins that had a full clutch and those that did not is their weight. Most of the penguins that produced two eggs weighed approximately 300 grams more. While there may be a correlation between clutch completion and weight, it is unlikely that there is a direct causation. Since clutch completion does not appear to impact this data significantly, it may not be a feature worth further investigation.
Feature selection
Here I used the SelectKBest function from the sci-kit-learn library to choose the three features that I will include in my model. I separated feature selection because all three selected features are numerical. SelectKBest identifies the k best features based on a user-specified scoring function. I chose the chi-squared scoring function, as my features are intended for classification and are non-negative
from sklearn.feature_selection import SelectKBest, chi2
# Selecting 2 numerical feature
= ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']
quant = SelectKBest(chi2, k=2)
sel1
sel1.fit_transform(X_train[quant], y_train)= sel1.get_feature_names_out()
f1
# Selecting 1 categorical feature
= ["Clutch Completion_Yes", "Clutch Completion_No", "Island_Biscoe", "Island_Dream", "Island_Torgersen", "Sex_FEMALE", "Sex_MALE"]
qual = SelectKBest(chi2, k=1)
sel2
sel2.fit_transform(X_train[qual], y_train)= sel2.get_feature_names_out() f2
This function is so that I can get all the variations of the categorical feature.
def get_feat(f1, cat):
= list(f1)
cols = ["Clutch Completion_Yes", "Clutch Completion_No"]
clutch = ["Island_Biscoe", "Island_Dream", "Island_Torgersen"]
island = ["Sex_FEMALE", "Sex_MALE"]
sex
if cat in clutch: return cols + clutch
if cat in island: return cols + island
if cat in sex: return cols + sex
= get_feat(f1, f2[0])
cols cols
['Flipper Length (mm)',
'Body Mass (g)',
'Island_Biscoe',
'Island_Dream',
'Island_Torgersen']
Although the culmen sizes initially appeared to be better features, the statistical tests indicated that flipper length and body mass were, in fact, the more significant features.
Training
The model is trained on the data with features determined from above. I had to use StandardScalar to avoid a convergence error. I used the Logistic Regression model as it is a good fit for classification.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
= make_pipeline(StandardScaler(), LogisticRegression())
pipe
pipe.fit(X_train[cols], y_train) pipe.score(X_train[cols], y_train)
0.8984375
Testing
= "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/test.csv"
test_url = pd.read_csv(test_url)
test
= prepare_data(test)
X_test, y_test pipe.score(X_test[cols], y_test)
0.8970588235294118
Results
Plotting Decision Regions
Most of this code is adapted from Prof. Phil’s website.
from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Patch
def plot_regions(model, X, y):
= X[X.columns[0]]
x0 = X[X.columns[1]]
x1 = X.columns[2:]
qual_features
= plt.subplots(1, len(qual_features), figsize = (8, 3))
fig, axarr
# create a grid
= np.linspace(x0.min(),x0.max(),501)
grid_x = np.linspace(x1.min(),x1.max(),501)
grid_y = np.meshgrid(grid_x, grid_y)
xx, yy
= xx.ravel()
XX = yy.ravel()
YY
for i in range(len(qual_features)):
= pd.DataFrame({
XY 0] : XX,
X.columns[1] : YY
X.columns[
})
for j in qual_features:
= 0
XY[j]
= 1
XY[qual_features[i]]
= model.predict(XY)
p = p.reshape(xx.shape)
p
# use contour plot to visualize the predictions
= "jet", alpha = 0.2, vmin = 0, vmax = 2)
axarr[i].contourf(xx, yy, p, cmap
= X[qual_features[i]] == 1
ix # plot the data
= y[ix], cmap = "jet", vmin = 0, vmax = 2)
axarr[i].scatter(x0[ix], x1[ix], c
set(xlabel = X.columns[0],
axarr[i].= X.columns[1],
ylabel = qual_features[i])
title
= []
patches for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
= color, label = spec))
patches.append(Patch(color
= "Species", handles = patches, loc = "best")
plt.legend(title
plt.tight_layout()
Regions for training set:
plot_regions(pipe, X_train[cols], y_train)
Regions for testing set:
plot_regions(pipe, X_test[cols], y_test)
Looking at the decision plots, we can see that our model is quite successful in distinguishing between Gentoo and Adelie penguins on the Biscoe and Torgersen islands. However, on Dream Island, where there is a mixture of Gentoo and Chinstrap penguins, the model struggles to differentiate between the two species.
Confusion Matrix
from sklearn.metrics import confusion_matrix
= pipe.predict(X_test[cols])
y_test_pred confusion_matrix(y_test, y_test_pred)
array([[26, 5, 0],
[ 2, 9, 0],
[ 0, 0, 26]])
Once again, this shows that model struggled the most with Gentoo and Chinstrap.
Discussion
The model achieved a training accuracy of 0.89, which is the same as its testing accuracy. This indicates that the model is quite effective at predicting penguin species based on flipper length, body mass, and island. However, the decision regions suggest that the model had difficulty distinguishing between Chinstrap and Gentoo penguins on Dream Island. It appears that these two species have similar sizes, making them challenging to differentiate.
Acknowledgements
Adapted code from Professor Phil Chodrow at Middlebury College in his class CSCI 0451: Machine Learning.