Classifying Palmer Penguins

In this blog post, I analyze the Palmer Penguins dataset to identify the best features for determining penguin species based on their measurements.

Author

Emmanuel Towner

Published

February 12, 2025

Classifying Palmer Penguins

Abstract

In this blog post, I analyze the Palmer Penguins dataset to identify the best features for determining penguin species based on their measurements. First, I create two figures and a table to explore the relationships between different features. Next, I utilize Scikit-learn’s feature selection methods, employing chi-squared tests to select two numerical features and one categorical feature. With these features, I train and test a logistic regression model. The model demonstrates reasonable accuracy; however, to gain a better understanding of the results, I visualize the decision regions and present a confusion matrix.st features to be used to determine the species of a penguin based on its measurements. Firstly, I create two figures and a table to analysize the relationships between features. Then I use sci-kit learns feature selection with chi-squared tests to pick 2 numerical features and 1 categorical feature. Then using those features, I train and test a logistic regression model. The model was fairly accurate but to understand the results better, I plot the decision regions and use a confusion matrix.

import pandas as pd

train_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)

train.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0809	31	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N63A1	Yes	11/24/08	40.9	16.6	187.0	3200.0	FEMALE	9.08458	-24.54903	NaN
1	PAL0809	41	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N74A1	Yes	11/24/08	49.0	19.5	210.0	3950.0	MALE	9.53262	-24.66867	NaN
2	PAL0708	4	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N32A2	Yes	11/27/07	50.0	15.2	218.0	5700.0	MALE	8.25540	-25.40075	NaN
3	PAL0708	15	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N38A1	Yes	12/3/07	45.8	14.6	210.0	4200.0	FEMALE	7.79958	-25.62618	NaN
4	PAL0809	34	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N65A2	Yes	11/24/08	51.0	18.8	203.0	4100.0	MALE	9.23196	-24.17282	NaN

Data Preparation

This code is from Professor Phil’s website. It removes unused columns and NA values, converts categorical feature columns into “one-hot encoded” 0-1 columns, and saves the resulting DataFrame as X_train. Additionally, the “Species” column is encoded using LabelEncoder and stored as y_train.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train, y_train = prepare_data(train)

We can check what the columns look like now.

X_train.head()

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Island_Biscoe	Island_Dream	Island_Torgersen	Stage_Adult, 1 Egg Stage	Clutch Completion_No	Clutch Completion_Yes	Sex_FEMALE	Sex_MALE
0	40.9	16.6	187.0	3200.0	9.08458	-24.54903	False	True	False	True	False	True	True	False
1	49.0	19.5	210.0	3950.0	9.53262	-24.66867	False	True	False	True	False	True	False	True
2	50.0	15.2	218.0	5700.0	8.25540	-25.40075	True	False	False	True	False	True	False	True
3	45.8	14.6	210.0	4200.0	7.79958	-25.62618	True	False	False	True	False	True	True	False
4	51.0	18.8	203.0	4100.0	9.23196	-24.17282	False	True	False	True	False	True	False	True

Data Visualization

Figures

I created two graphs each two quantative columns and one qualitative columns. Plot 1 shows the relationship with the body mass and flipper length between different penguin species. Plot 2 shows the difference in Culmen Length and depth across different penguin species.

# Get the unencoded columns for easier graphing.
qual = train[["Island", "Sex", "Species"]].dropna()

# Shorten species label for the legend
qual["Species"] = qual["Species"].apply(lambda x: "Chinstrap" if x == "Chinstrap penguin (Pygoscelis antarctica)" 
                                         else ("Gentoo" if x == "Gentoo penguin (Pygoscelis papua)" 
                                               else "Adelie"))

from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')

fig, ax = plt.subplots(1, 2, figsize = (10, 4))
   
p1 = sns.scatterplot(X_train, x = "Body Mass (g)", y = "Flipper Length (mm)", hue=qual["Species"], ax = ax[0])
p2 = sns.scatterplot(X_train, x = "Culmen Length (mm)", y = "Culmen Depth (mm)", hue=qual["Species"], ax = ax[1])

Plot 1 (Left): This graph shows the relationship between body mass and flipper length among different penguin species. Gentoo penguins are the largest, while Chinstrap and Adelie penguins overlap considerably in size. Adelie penguins show slightly more variation in mass for a given flipper length compared to Chinstraps.

Plot 2 (Right): This graph illustrates the differences in culmen length and depth among the species. Adelie penguins have the deepest but shortest culmen, Gentoo penguins have longer but less deep culmens, and Chinstraps are in between. These differences in beak size are significant for distinguishing penguin species.

Table

Now I create a summary table of the penguins measurements based on clutch completetion.

table = X_train[["Clutch Completion_Yes", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)"]]
table.groupby("Clutch Completion_Yes").aggregate(['min', 'median', 'max'])

	Culmen Length (mm)			Culmen Depth (mm)			Flipper Length (mm)			Body Mass (g)
	min	median	max	min	median	max	min	median	max	min	median	max
Clutch Completion_Yes
False	35.9	43.35	58.0	13.7	17.85	19.9	172.0	195.0	225.0	2700.0	3737.5	5700.0
True	34.0	45.10	55.9	13.1	17.20	21.5	176.0	198.0	230.0	2850.0	4100.0	6300.0

Table 1: This table shows that the most significant difference between penguins that had a full clutch and those that did not is their weight. Most of the penguins that produced two eggs weighed approximately 300 grams more. While there may be a correlation between clutch completion and weight, it is unlikely that there is a direct causation. Since clutch completion does not appear to impact this data significantly, it may not be a feature worth further investigation.

Feature selection

Here I used the SelectKBest function from the sci-kit-learn library to choose the three features that I will include in my model. I separated feature selection because all three selected features are numerical. SelectKBest identifies the k best features based on a user-specified scoring function. I chose the chi-squared scoring function, as my features are intended for classification and are non-negative

from sklearn.feature_selection import SelectKBest, chi2

# Selecting 2 numerical feature
quant = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']
sel1 = SelectKBest(chi2, k=2)
sel1.fit_transform(X_train[quant], y_train)
f1 = sel1.get_feature_names_out()

# Selecting 1 categorical feature
qual = ["Clutch Completion_Yes", "Clutch Completion_No", "Island_Biscoe", "Island_Dream", "Island_Torgersen", "Sex_FEMALE", "Sex_MALE"]
sel2 = SelectKBest(chi2, k=1)
sel2.fit_transform(X_train[qual], y_train)
f2 = sel2.get_feature_names_out()

This function is so that I can get all the variations of the categorical feature.

def get_feat(f1, cat):
    cols = list(f1)
    clutch = ["Clutch Completion_Yes", "Clutch Completion_No"]
    island = ["Island_Biscoe", "Island_Dream", "Island_Torgersen"]
    sex = ["Sex_FEMALE", "Sex_MALE"]
    
    if cat in clutch: return cols + clutch
    if cat in island: return cols + island
    if cat in sex: return cols + sex

cols = get_feat(f1, f2[0])
cols

['Flipper Length (mm)',
 'Body Mass (g)',
 'Island_Biscoe',
 'Island_Dream',
 'Island_Torgersen']

Although the culmen sizes initially appeared to be better features, the statistical tests indicated that flipper length and body mass were, in fact, the more significant features.

Training

The model is trained on the data with features determined from above. I had to use StandardScalar to avoid a convergence error. I used the Logistic Regression model as it is a good fit for classification.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train[cols], y_train)
pipe.score(X_train[cols], y_train)

0.8984375

Testing

test_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/test.csv"
test = pd.read_csv(test_url)

X_test, y_test = prepare_data(test)
pipe.score(X_test[cols], y_test)

0.8970588235294118

Results

Plotting Decision Regions

Most of this code is adapted from Prof. Phil’s website.

from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Patch

def plot_regions(model, X, y):
    
    x0 = X[X.columns[0]]
    x1 = X[X.columns[1]]
    qual_features = X.columns[2:]
    
    fig, axarr = plt.subplots(1, len(qual_features), figsize = (8, 3))

    # create a grid
    grid_x = np.linspace(x0.min(),x0.max(),501)
    grid_y = np.linspace(x1.min(),x1.max(),501)
    xx, yy = np.meshgrid(grid_x, grid_y)
    
    XX = xx.ravel()
    YY = yy.ravel()

    for i in range(len(qual_features)):
      XY = pd.DataFrame({
          X.columns[0] : XX,
          X.columns[1] : YY
      })

      for j in qual_features:
        XY[j] = 0

      XY[qual_features[i]] = 1

      p = model.predict(XY)
      p = p.reshape(xx.shape)
      
      
      # use contour plot to visualize the predictions
      axarr[i].contourf(xx, yy, p, cmap = "jet", alpha = 0.2, vmin = 0, vmax = 2)
      
      ix = X[qual_features[i]] == 1
      # plot the data
      axarr[i].scatter(x0[ix], x1[ix], c = y[ix], cmap = "jet", vmin = 0, vmax = 2)
      
      axarr[i].set(xlabel = X.columns[0], 
            ylabel  = X.columns[1], 
            title = qual_features[i])
      
      patches = []
      for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
        patches.append(Patch(color = color, label = spec))

      plt.legend(title = "Species", handles = patches, loc = "best")
      
      plt.tight_layout()

Regions for training set:

plot_regions(pipe, X_train[cols], y_train)

Regions for testing set:

plot_regions(pipe, X_test[cols], y_test)

Looking at the decision plots, we can see that our model is quite successful in distinguishing between Gentoo and Adelie penguins on the Biscoe and Torgersen islands. However, on Dream Island, where there is a mixture of Gentoo and Chinstrap penguins, the model struggles to differentiate between the two species.

Confusion Matrix

from sklearn.metrics import confusion_matrix

y_test_pred = pipe.predict(X_test[cols])
confusion_matrix(y_test, y_test_pred)

array([[26,  5,  0],
       [ 2,  9,  0],
       [ 0,  0, 26]])

Once again, this shows that model struggled the most with Gentoo and Chinstrap.

Discussion

The model achieved a training accuracy of 0.89, which is the same as its testing accuracy. This indicates that the model is quite effective at predicting penguin species based on flipper length, body mass, and island. However, the decision regions suggest that the model had difficulty distinguishing between Chinstrap and Gentoo penguins on Dream Island. It appears that these two species have similar sizes, making them challenging to differentiate.

Acknowledgements

Adapted code from Professor Phil Chodrow at Middlebury College in his class CSCI 0451: Machine Learning.