Auditing Bias

In this blog post, my goal was to make a prediction of employment status based on various demographics excluding race using a subset of data from the American Community Service focused on Massachusetts residents in 2023. After training a model on this data, I evaluated its performance and examined potential biases in its predictions.
Author

Emmanuel Towner

Published

March 12, 2025

Auditing Bias

Abstract

My goal was to make a predicition of employment status based on various demographic excluding race using a subset of data from the American Community Service focused on Massachusetts residents in 2023. Based on the data of the 58,500 residents, only half were employed. Most of the time men, people without disabilities, and people who born abroad or with no citizensip had higher proportions of employment. The model I used was sklearns Decision Tree Classifier because results are easy to interpret. I tuned complexity by I using GridSearchCV which cross-validated that the best depth out of the numbers I provided was 10 which overall accuracy 0.82. The different group accuracies weren’t that much different. Auditing my model showed that white people lead in PPV and FNR while Asians lead in TPR and FPR rates. In these summary I left races 5 and 6 because their data often had many missing values. Based on those values, my model failed approximate error balance and statistical parity but satisfied calibration. The plot that used the fixed PPV values and p values to graph feasible FPR and FNR combinations between Black and White residents.

Code
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MA" 

data_source = ACSDataSource(survey_year='2023', # Get more recent data
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()
RT SERIALNO DIVISION SPORDER PUMA REGION STATE ADJINC PWGTP AGEP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2023GQ0000077 1 1 503 1 25 1019518 11 89 ... 12 13 13 13 13 13 13 12 13 13
1 P 2023GQ0000098 1 1 613 1 25 1019518 11 20 ... 3 4 2 20 13 9 2 20 13 2
2 P 2023GQ0000109 1 1 613 1 25 1019518 80 68 ... 28 34 78 73 34 68 82 15 17 79
3 P 2023GQ0000114 1 1 801 1 25 1019518 69 21 ... 60 74 161 11 127 57 11 12 11 12
4 P 2023GQ0000135 1 1 1201 1 25 1019518 27 84 ... 27 28 27 29 27 29 27 27 28 27

5 rows × 287 columns

Code
# No RELP avaiable 
possible_features=['AGEP', 'SCHL', 'MAR', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()
AGEP SCHL MAR DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX RAC1P ESR
0 89 16.0 2 1 NaN 1 1.0 4.0 3 1 1 1 1.0 2 1 6.0
1 20 16.0 5 2 NaN 1 1.0 4.0 1 1 2 2 2.0 1 9 6.0
2 68 18.0 5 1 NaN 1 1.0 4.0 1 1 1 2 2.0 1 1 6.0
3 21 19.0 5 2 NaN 1 1.0 4.0 1 1 2 2 2.0 2 1 1.0
4 84 16.0 1 1 NaN 1 3.0 4.0 3 1 2 1 1.0 2 1 6.0

Model

Feature selection And Data Split

Adapted from Professor Phil’s code.

Code
features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]
Code
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

for obj in [features, label, group]:
  print(obj.shape)
(73126, 14)
(73126,)
(73126,)
Code
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)
Code

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["race"] = group_train
df["label"] = y_train

This method allows the conversion of race numbers to more helpful categorical labels. Some of them have been shortened because they were too long. 1. SPAA - “American Indian and Alaska Native tribes specified, or American Indian or AlaskaNative, not specified and no other races”. 2. NPI - Native Hawaiian and Other Pacific Islander alone

Code
def convert_race(df: pd.DataFrame):
    df = df.sort_values(by='race')
    df['race'] = df['race'].replace({1: "White", 2: "Black", 3: "N. American", 4:"N. Alaskan", 
                        5:"SPAA", 
                        6:'Asian', 7: 'NPI', 8:'Other', 9: 'Multi'})

    df['race'] = pd.Categorical(df['race'])
    return df
    

Basic Descriptives

Code
# Save to the original for calculations and copying
relabeled = df.copy() 
relabeled = convert_race(relabeled)
relabeled
AGEP SCHL MAR DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX race label
0 69.0 19.0 1.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 White False
35997 12.0 9.0 5.0 2.0 1.0 1.0 1.0 0.0 1.0 1.0 2.0 2.0 2.0 2.0 White False
35999 70.0 24.0 1.0 2.0 0.0 1.0 1.0 4.0 2.0 1.0 2.0 2.0 2.0 1.0 White True
36000 74.0 22.0 1.0 2.0 0.0 1.0 1.0 4.0 2.0 1.0 2.0 2.0 2.0 1.0 White False
36001 68.0 23.0 2.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 White True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51260 6.0 3.0 5.0 2.0 2.0 1.0 1.0 0.0 1.0 1.0 2.0 2.0 2.0 2.0 Multi False
51257 7.0 4.0 5.0 2.0 7.0 1.0 1.0 0.0 1.0 1.0 2.0 2.0 2.0 2.0 Multi False
5335 14.0 11.0 5.0 2.0 3.0 1.0 1.0 0.0 3.0 1.0 2.0 2.0 2.0 1.0 Multi False
34217 18.0 14.0 5.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 Multi False
40473 18.0 18.0 5.0 2.0 0.0 1.0 3.0 4.0 4.0 1.0 2.0 2.0 2.0 1.0 Multi True

58500 rows × 16 columns

  1. The number of indiviuals in this dataframe are:
Code
print("The number of indiviuals in this df are", df.shape[0])
The number of indiviuals in this df are 58500
  1. The proportion of employed individuals are:
Code
emp = df[df['label'] == True][["label"]].size # 29703
total = df['label'].size # 58500
emp_prop = emp / total
print("The proportion of employed indiviuals are", emp_prop)
The proportion of employed indiviuals are 0.5077435897435898
  1. The population of each group is:
Code
relabeled.groupby('race')[['label']].aggregate('sum')
label
race
Asian 2399
Black 1529
Multi 2478
N. American 52
NPI 15
Other 1284
SPAA 26
White 21920

Based on this, I might consider whether groups with less than 30 labels provide an adequate sample size.

  1. The proportion of employed people in each group are:
Code
relabeled.groupby('race')[['label']].aggregate('mean')
label
race
Asian 0.540315
Black 0.473668
Multi 0.456270
N. American 0.530612
NPI 0.625000
Other 0.486364
SPAA 0.520000
White 0.514687
  1. Intersectional Trends
Code
import seaborn as sns
from matplotlib import pyplot as plt

fig, ax = plt.subplots(1, 2, figsize = (14, 4))

colors = ["#1f77b4", "#ff7f0e"]
plt1 = sns.barplot(relabeled, x = "race", y = "label", hue="SEX", palette=colors, ax=ax[0])
plt1.set_xlabel("Race")
plt1.set_ylabel("Employment")
handles, _ = plt1.get_legend_handles_labels()
plt1.legend(handles=handles, title='Sex', labels=['Male', 'Female'])

plt2 = sns.barplot(relabeled, x = "race", y = "label", hue="DIS", palette=colors, ax=ax[1])
plt2.set_xlabel("Race")
plt2.set_ylabel("Employment")
handles, _ = plt2.get_legend_handles_labels()
plt2.legend(handles=handles, title='Disability', labels=['Has disability', 'No disability'])
plt.show()

In these graphs, we observe the intersectionality of employment rates based on gender and disability status. Generally, men and individuals without disabilities tend to be more likely to secure jobs. Several factors contribute to this disparity. For one, women and people with disabilities often face discrimination from employers. Additionally, the number of applicants from these groups could also play a role; even though more women are entering the workforce, many continue to take on childcare responsibilities at home. Moreover, “disability” encompasses a wide range of conditions, making it difficult to determine which individuals are capable of working and which are not. This uncertainty likely contributes to the lower employment rates among these groups. Interestingly, it’s worth noting that for Black Americans, Native Americans, and SPAA, there are more women employed than men.In this we see intersectionality between employment based on the sex of the person or whether they have a disability. As we can see in both graphs generally male and able bodied people are more likely to be employed. Multiple factors can contribute to this desparity. One is that women and people with disabilities are discriminated against by employers. Another is also considering how many of them are applying for jobs. For women, even with more of them joining the workforce, they may be more likely to doing childcare at home. Because disabilites is such a broad it’s hard which may be capable of working and those that aren’t. This probably a big factor why relatively few are employed. Something I noticed is that for Black Americans, Native Americans and SPAA more females are employed than males.

Code
plt3 = sns.barplot(relabeled, x = "race", y = "label", hue="CIT")
plt3.set_xlabel("Race")
plt3.set_ylabel("Employment")
handles, _ = plt3.get_legend_handles_labels()
status = ['Born in US', 'Born in US Territories', 'Born Abroad', 'US by naturalization', 'Not a citizen']
plt3.legend(handles=handles, title='Citizenship', labels=status, loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

This graph shows that individuals born in the U.S. or its territories have a lower percentage of employment compared to those born abroad or through naturalization. I suspect this may be due to the smaller population size and the fact that immigrants often come with specific job opportunities in mind.

Training Model

The decision tree classifier was one of the recommedations given by the professor and scikit-learn said the results are easy to interpret. In order to find the best depth for the model training, I used GridSearchCV which cross validated the best parameters to use.

Code
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

param_grid = {
    'max_depth': [1, 3, 5, 10]
}

model = DecisionTreeClassifier() 
grid = GridSearchCV(model, param_grid, scoring="accuracy")
grid.fit(X_train, y_train)

print("Best params", grid.best_params_)

best_dtree = grid.best_estimator_

best_dtree.predict(X_train)
best_dtree.score(X_train, y_train)
Best params {'max_depth': 10}
0.8346666666666667

The best depth was 10 and gave us a solid accuracy.

Testing model

Code
pred = best_dtree.predict(X_test)
score = best_dtree.score(X_test, y_test)

print("Test score", score)
Test score 0.821892520169561

Bias Audit

Overall Measures

The overall accuracy:

Code
(pred == y_test).mean()
np.float64(0.821892520169561)

Here we calculate the PPV, the false negative and false positive rate

Code
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()

ppv = tp / (tp + fp)
fnr = fn / (fn + tp)
fpr = tp / (tp + tn)

print("Overall PPV:", ppv)
print("Overall false negative rate:", fnr)
print("Overall false positive rate:", fpr)
Overall PPV: 0.7938385705483673
Overall false negative rate: 0.12639001898562516
Overall false positive rate: 0.5358955161800183

Based on this the model is pretty good at predicting who’s not employed but seems to overestimate the amount of people employed.

By-Group Measures

This is the accuracy by group:

Code
groups = df["race"].unique() # Need race as numbers in order to compare with group_test
audit = pd.DataFrame(groups, columns=["race"])

accuracies = []
for group in groups:
   accuracy = (pred == y_test)[group_test == group].mean()
   accuracies.append(accuracy)

audit["accuracy"] = accuracies
audit = convert_race(audit)
audit
race accuracy
0 White 0.824150
3 Black 0.799747
7 N. American 0.812500
6 SPAA 1.000000
4 Asian 0.821859
5 NPI 0.500000
1 Other 0.807092
2 Multi 0.825095

Basides NPI and SPAA, the accuracies are rather similar to each other.

This is the PPV, the false negative and false positive rate by group:

Code
ppvs = []
tprs = []
fnrs = []
fprs = []

for group in groups:
    tp = int(0)
    fp = int(0)
    tn = int(0)
    fn = int(0)
    for n, m, grp in zip(y_test, pred, group_test):
        if(grp == group):
            if m == n:
                if n == True:
                    tp += 1
                if n == False:
                    tn += 1
            if m != n:
                if n == True:
                    fn += 1
                if n == False:
                    fp += 1
    ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
    ppvs.append(ppv)
    fnr = fn / (fn + tp)  if (fn + tp) > 0 else 0
    fnrs.append(fnr)
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0 
    fprs.append(fpr)
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tprs.append(tpr)
    

audit["ppv"] = ppvs
audit["tpr"] = tprs
audit["fpr"] = fprs
audit["fnr"] = fnrs
audit
race accuracy ppv tpr fpr fnr
0 White 0.824150 0.802322 0.868922 0.222350 0.131078
3 Black 0.799747 0.750630 0.889552 0.267568 0.110448
7 N. American 0.812500 0.769345 0.873311 0.214385 0.126689
6 SPAA 1.000000 0.739631 0.877049 0.267139 0.122951
4 Asian 0.821859 0.807588 0.901664 0.283433 0.098336
5 NPI 0.500000 0.000000 0.000000 0.000000 1.000000
1 Other 0.807092 0.000000 0.000000 0.000000 0.000000
2 Multi 0.825095 0.625000 1.000000 0.272727 0.000000

Bias Measures

Using code adapted from from Machine Learning Master and sklearn documentation, I used a sci-kit learn’s calibration curve to diagnose the calibration:

Code
from sklearn.calibration import calibration_curve 

y_prob = best_dtree.predict_proba(X_test)[:, 1] # Gets the probability of positive values
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)

avgs = []
actuals = []
for group in groups:
    # P(Y = 1 | S = {0,1}, R = r) 
    avg_pred_prob = y_prob[group_test == group].mean()
    avgs.append(avg_pred_prob)
    actual_employment_rate = (y_test == 1)[group_test == group].mean()
    actuals.append(actual_employment_rate)

calibrate= pd.DataFrame(groups, columns=["race"])
calibrate["employed"] = actuals
calibrate["avg_predicted_probs"] = avgs
print(calibrate)

cal = plt
cal.plot([0, 1], [0, 1], linestyle='--') # Actual values
cal.plot(prob_pred, prob_true, marker='.') # Predicted values
cal.title("Model Calibration")
cal.xlabel("Average Probability")
cal.ylabel("Fraction of Positive")
cal.show()
   race  employed  avg_predicted_probs
0     1  0.509466             0.510189
1     8  0.475177             0.486277
2     9  0.450190             0.449267
3     2  0.463878             0.482898
4     6  0.568847             0.571642
5     7  0.500000             0.336508
6     5  0.000000             0.094724
7     3  0.312500             0.424606

The above graph and table shows that my model is nearly calibrated, meaning that the model usually predicts probabilites that are the same as the real probabilities.

Approximate error balance rate: Looking at the table above we can see that the model does not meet approximate error rate balance for groups. The groups differ in true and false positive rates. The code below double checks.

Code
for i, row1 in audit.iterrows():
    equal = True;
    race1 = row1['race']
    tpr1 = row1['tpr']
    fpr1 = row1['fpr']
    for j, row2 in audit.iterrows():
        race2 = row2['race']
        tpr2 = row2['tpr']
        fpr2 = row2['fpr']
        if(tpr1 != tpr2 or fpr1 != fpr2):
            equal = False
            print(f"{race1} did not have an equal TPR or FPR as {race2}")
            break
    if(equal != True):
        break
White did not have an equal TPR or FPR as Black

Statistical parity:

Code
probs = []
for group in groups:
    prob = (pred == True)[group_test == group].mean()
    probs.append(prob)

parity= pd.DataFrame(groups, columns=["race"])
parity["prob"] = probs
parity
race prob
0 1 0.551757
1 8 0.563121
2 9 0.511027
3 2 0.550063
4 6 0.635112
5 7 0.000000
6 5 0.000000
7 3 0.500000

The model does not meet statistical parity which means not all groups have an equal change of achieving favorable odds. Therefore we can assume that the probability of predicting employment is not independent of race.

Feasible FNR and FPR Rates

Add prevalance to data table

Code
audit["p"] = (1 + (audit["tpr"] / audit["fpr"]) * ((1 - audit["ppv"])/(audit["ppv"]))) ** -1
audit["p"] = audit["p"].fillna(0)
audit
print(audit)
          race  accuracy       ppv       tpr       fpr       fnr         p
0        White  0.824338  0.802492  0.869107  0.222158  0.130893  0.509466
3        Black  0.798479  0.752525  0.889552  0.264865  0.110448  0.475177
7  N. American  0.843750  0.769001  0.871622  0.214385  0.128378  0.450190
6         SPAA  1.000000  0.739030  0.874317  0.267139  0.125683  0.463878
4        Asian  0.821859  0.807588  0.901664  0.283433  0.098336  0.568847
5          NPI  0.500000  0.000000  0.000000  0.000000  1.000000  0.000000
1        Other  0.808511  0.000000  0.000000  0.000000  0.000000  0.000000
2        Multi  0.824335  0.666667  1.000000  0.227273  0.000000  0.312500

Here I plot the feasibility of FPR and FNR for Black and White groups.

Code
import seaborn as sns

# Filter to only include Black and White groups
filtered = audit[audit["race"].isin(["Black", "White"])]

orange_color = "#E69F00"
black_color = "#000000"

# A cleaner table
feasible = filtered[["race", "fpr", "fnr", "p"]].copy() 
lines = []

# Make fixed ppv based on the black ppv
fixed_ppv = filtered.loc[filtered["race"] == "Black", "ppv"].values[0]
fnr_range = np.linspace(0, 1, 100)

# Compute feasible FPR for different FNR values
for i, row in feasible.iterrows():
    race = row["race"]
    p = row["p"]  
    
    fprs = (p / (1 - p)) * ((1 - fixed_ppv) / fixed_ppv) * (1 - fnr_range)

    for fnr, fpr in zip(fnr_range, fprs):
        lines.append({"race": race, "fnr": fnr, "fpr": fpr})

lines_df = pd.DataFrame(lines)

plt.figure(figsize=(7, 5))
sns.set_style("whitegrid")

# Plot observed (fnr, fpr)
for i, row in feasible.iterrows():
    color = orange_color if row["race"] == "White" else black_color
    plt.scatter(row["fnr"], row["fpr"], color=color)

# Plot feasible (fnr, fpr) line
for race, color in zip(["White", "Black"], [orange_color, black_color]):
    line = lines_df[lines_df["race"] == race]
    plt.plot(line["fnr"], line["fpr"], color=color)

plt.xlabel("False Negative Rate")
plt.ylabel("False Positive Rate")
plt.title("Feasible (FNR, FPR) combinations")
plt.show()

Based on this plot, to get equal false positive rates we would need to reduce \(\mathrm{FNR}_w\) by about 0.04.

Acknowledgements

Adapted code from Professor Phil Chodrow at Middlebury College in his class CSCI 0451: Machine Learning.