Triplebyte Blog

We help engineers join great companies
Try our coding quiz

Share this article

Racist Robots: Auditing Machine Learning Systems for Bias

By Jen Ciarochi on Jul 1, 2020

Racist Robots: Auditing Machine Learning Systems for Bias

People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.” -Pedro Domingos, The Master Algorithm

First of all, “Biased Bots” is a better description of the current technological landscape than “Racist Robots.” Fortunately, we can deal with biases in AI before robots can actually walk around spouting offensive nonsense. Think of “Racist Robots” as a preventable future dystopian scenario.


A dictionary entry from a dystopian future.

In fact, many engineers are optimistic that AI can help identify and even diminish the effects of human prejudices, but the road ahead is convoluted. Studies show that some AI systems disproportionately disadvantage groups that are already marginalized due to ethnicity, gender, or socioeconomic status.

Evidence of bias is, in part, what drove Amazon, Microsoft, and IBM to backpedal on the controversial facial recognition business1—catalyzed by mounting pressure following George Floyd’s death at the hands of police officers.

Biased decision-making certainly isn’t unique to AI systems, but in many ways, it is uniquely discoverable in these systems. Biases in AI systems have been detected in law enforcement, banking, insurance, hiring, and healthcare. Most of these issues stem from training data that are not representative (e.g., a data set containing only white people) or are unintentionally embedded with prejudices (e.g., historical hiring data comparing males and females).

Biased data beget biased models, which beget biased data, and so on (notably, under certain conditions, biased decision-making can also result in “algorithmic affirmative action”2). Machine learning models can get caught in feedback loops that exacerbate biases. Consider the Strategic Subject List (SSL), which Chicago used from 2012 to late 2019 to identify likely victims or perpetrators of violent crimes3. Predictive policing systems, like SSL, often rely on historical crime and arrest data to pinpoint neighborhoods that require more policing.

These predictions become self-fulfilling prophecies when police patrol these areas more, and in turn discover more crimes and make more arrests than they do in less-patrolled areas. As a result, more records from the neighborhood are input into crime data bases, while similar crimes in other areas are overlooked. When the existing model is retrained (or new algorithms are trained) on the updated data, this perpetuates a bias-deepening cycle that can lead to over-policing. SSL was scrutinized by many groups for its reliance on self-generated data—among other issues—and ultimately did not reduce crime. In short, even the most accurate model can’t reduce crime if what it accurately predicts is social injustice.

• COMPAS, a system used by judges to inform decisions about pretrial inmate release, incorrectly flags black defendants as probable repeat offenders nearly twice as often as their white counterparts4.
• Google adjusted its Google Photos image recognition system after it classified black people as gorillas5.
• Facial recognition software in Nikon cameras erroneously warned Asian users that they were blinking6.
• Google Translate debuted gender-specific translations7 after researchers found that it defaults to masculine pronouns (he, him), even when translating texts that specifically refer to females8.
• The German job portal Xing ranks female candidates below less-qualified male candidates9.
• Researchers from MIT and Microsoft tested three commercial gender classification systems, and found that the error rate was the highest for darker-skinned women (maximum error rate: 34.7%) and the lowest for lighter-skinned men (maximum error rate: 0.8%)10.
• Amazon discovered that an internal recruiting tool was discriminating against female job candidates, and traced the problem to the use of historical hiring data—which favors men11 . After they reprogrammed the model to ignore gendered words, it started making decisions based on gender-correlated words. Ultimately, Amazon scrapped the system altogether, illustrating the difficulty of retroactively eliminating bias.

What Makes a Model Fair?

One difficulty of designing a fair model is defining what exactly constitutes a fair model. There is no consensus on the best definition or mathematical formulation of fairness. Fairness is defined in many ways12—and some of these definitions contradict one another. For example, predictive parity (equal fractions of correct positive predictions among groups), equal false positive rates, and equal false negative rates are all definitions of fairness. However, it is mathematically impossible to satisfy all three definitions when there are strong group differences in prevalence (positive outcomes)13.

Prioritizing Bias and Fairness Metrics

Since there is no consensus on the definition of fairness, a promising approach is to use tools that test for several bias and fairness metrics and, on a case-by-case basis, focus on the metrics that are most appropriate for a given situation.

The first factor to consider is the availability of outcome (i.e., label) data. In other words, a model can be tested for bias after it makes predictions, but before the actual outcomes are known. Unsupervised bias metrics, which are calculated without outcome data, can be used in these cases. Unsupervised metrics are used to assess the distribution of predictions across groups, with common examples including Predicted Positive Rate (PPR) and Predicted Prevalence (PPrev).

A model can also be tested for bias after the outcomes are known, at which point supervised bias metrics can be leveraged. Supervised metrics are error-based, and are calculated using outcomes as well as predictions. Examples of common supervised metrics include False Discovery Rate (FDR), False Omission Rate (FOR), False Positive Rate (FPR), and False Negative Rate (FNR).

The relationship between these fairness metrics and different definitions of fairness is outlined in the table below.

If this metric is equal between groups… ...then this definition of fairness is satisfied Mathematically equivalent metric
False Discovery Rate (FDR) Predictive parity If the FDR of a classifier is equal for two groups, the Positive Predicted Value (PPV) is also equal
False Negative Rate (FNR) Equal opportunity If the FNR of a classifier is equal for two groups, the True Positive Rate (TPR) is also equal
False Positive Rate (FPR) Predictive equality If the FPR of a classifier is equal for two groups, the True Negative Rate (TNR) is also equal
Predicted Positive Rate (PPR) Statistical (or demographic) parity -
Predicted Prevalence (PPrev) Impact parity -

Aside from the availability of outcome data, the appropriateness of a fairness metric essentially depends on whether it's more important to minimize false positives or false negatives. In other words, interventions can be harmful or beneficial—and policymakers want to avoid disproportionately punishing or withholding benefits from particular groups.

If an intervention is costly or harmful, false positives are the primary concern; it is definitely inappropriate to punish someone because they were flagged incorrectly. In these cases, FDR and FPR are particularly important fairness metrics.

On the other hand, if an intervention is beneficial, false negatives are a bigger concern. It is inappropriate to exclude someone who is in need, but assisting someone who isn't in need will not harm them. In these cases, FOR and FNR are very relevant metrics.

These principles are conveniently captured by the “Fairness Tree” below, developed by the creators of the open-source, bias-detecting toolkit Aequitas14; this diagram links six common bias and fairness metrics to real-world applications.


Detecting Bias: Auditing AI

By comparing fairness metrics across groups, AI “auditors” like Aequitas can test machine learning models for bias. Other AI-auditing toolkits include FairTest15, FairML16, Google’s ML-fairness-gym17, and IBM’s AI Fairness 36018. The table below summarizes the terms in the mathematical formulations of six bias and fairness metrics used in AI audits19.

Notation Name Definition and Equation
A Attribute An attribute with multiple values (e.g., gender = {female, male, other}; A = {a1,a2…an}
ai Group A group that shares the same attribute value (e.g., gender = female)
ar Reference group A group defined by A that is used as a reference to calculate bias and fairness disparity measures
FNg False negatives Type II errors; the number of predicted negatives in a group that are actually positive; (Ŷ = 0 ^ Y = 1)
FPg False positives Type I errors; the number of predicted positive in a group that are actually negative; (Ŷ = 1 ^ Y = 0)
K Total predicted positive The total number of predicted positives across all groups defined by A
LNg Labeled negative The number of labeled negatives (i.e., negative outcomes) in a group
LPg Labeled positive The number of labeled positives (i.e., positive outcomes) in a group
PNg Predicted negative The number of people that the model predicts are negative in a group; (Ŷ = 0)
PPg Predicted positive The number of people that the model predicts are positive in a group; (Ŷ = 1)
Y Outcome The true binary classification (i.e., label)—negative (0) or positive (1); Y ∈ 0,1}
Ŷ Decision A binary prediction—negative (0) or positive (1)—assigned based on thresholding (e.g., top K); Ŷ ∈ {0,1}

Below are equations for and graphic representations of these six bias and fairness metrics, along with corresponding Python code (Aequitas source code20 is used as an example). In the Python code, each lambda function includes the following arguments:

  • rank_col corresponds to the predictions the model makes.
  • label_col contains the actual outcomes, where 0 is negative and 1 is positive.
  • thres is a threshold for classification, where any score below the threshold results in positive classification.
  • k is a predefined number specifying how many people the model should classify as positive[1].

Additionally, divide = lambda x, y: x / y if y != 0 else—which essentially just prevents division with a denominator of 0.

1. Predicted Prevalence (PPrev)


Python Code

# (x[rank_col] <= thres).sum() corresponds to predicted positives (PPg), and adds up all instances 
# of rank_col (predictions) that fall within the threshold for positive classification. 
# len(x) + 0.0 corresponds to |g|, and returns the number of entities in a group.
predicted_pos_ratio_g = lambda rank_col, label_col, thres, k: lambda x: \
 divide((x[rank_col] <= thres).sum(), len(x) + 0.0)


2. Predicted Positive Rate (PPR)


Python Code

# (x[rank_col] <= thres).sum() corresponds to predicted positives (PPg). 
# k + 0.0 corresponds to K, and returns the number of entities the model 
# predicts are positive (across all groups). 

predicted_pos_ratio_k = lambda rank_col, label_col, thres, k: lambda x: \
 divide((x[rank_col] <= thres).sum(), k + 0.0)


3. False Discovery Rate (FDR)


Python Code

# ((x[rank_col] <= thres) & (x[label_col] == 0)).sum() corresponds to false positives (FPg), 
# and adds up all entities that fall within the threshold for positive classification (rank_col predictions), 
# but are actually negatives (i.e., have a label_col value of 0). 
# (x[rank_col] <= thres).sum() corresponds to predicted positives (PPg).

fdr = lambda rank_col, label_col, thres, k: lambda x: \
 divide(((x[rank_col] <= thres) & (x[label_col] == 0)).sum(), 
       (x[rank_col] <= thres).sum().astype(float))


4. False Omission Rate (FOR)


Python Code

# ((x[rank_col] > thres) & (x[label_col] == 1)).sum() corresponds to false negatives (FNg), 
# and adds up all entities that fall within the negative classification threshold, but are actually positives. 
# (x[rank_col] > thres).sum() corresponds to predicted negatives (PNg), and adds up all 
# entities that fall within the threshold for negative classification.

fomr = lambda rank_col, label_col, thres, k: lambda x: \
divide(((x[rank_col] > thres) & (x[label_col] == 1)).sum(), 
       (x[rank_col] > thres).sum().astype(float))


5. False Positive Rate (FPR)


Python Code

# ((x[rank_col] <= thres) & (x[label_col] == 0)).sum() corresponds to false positives (FPg). 
# (x[label_col] == 0).sum() corresponds to labeled negatives (LNg), 
# and adds up all entities with a label_col value of 0.

fpr = lambda rank_col, label_col, thres, k: lambda x: \
divide(((x[rank_col] <= thres) & (x[label_col] == 0)).sum(),
       (x[label_col] == 0).sum().astype(float))


6. False Negative Rate (FNR)


Python Code

# ((x[rank_col] > thres) & (x[label_col] == 1)).sum() corresponds to false negatives (FNg). 
# (x[label_col] == 1).sum() corresponds to labeled positives (LPg), 
# and adds up all entities with a label_col value of 1.

fnr = lambda rank_col, label_col, thres, k: lambda x: \
divide(((x[rank_col] > thres) & (x[label_col] == 1)).sum(), 
       (x[label_col] == 1).sum().astype(float))


Bias and Fairness Disparity Measures

Once bias metrics—like Pprev, PPR, FDR, FOR, FPR, and FNR—are calculated for each group, they can be compared to those of a reference group to calculate disparity measures. The reference group can be selected based on different criteria, such as majority status (i.e., the largest population) or historical favoritism.

For example, predicted prevalence disparity is defined as:

$$PPrev_{g}disp = \frac{PPrev_{a_{i}}} {PPrev_{a_{r}}}$$

Similarly, false positive rate disparity is defined as:

$$FPR_{g}disp = \frac{FPR_{a_{i}}} {FPR_{a_{r}}}$$

The disparity metrics can then be tested for fairness against the flexible parameter τ ∈ (0,1] to provide a range of disparity values that are considered fair.

$$ τ ≤ DisparityMeasure_{g_{i}} ≤ \frac {1} {τ} $$

τ could, for example, be set to 0.8 to adhere to the 80% rule—a threshold set as part of 1970s fair employment legislation to assess adverse impact on minority groups. In essence, this rule states that companies should hire applicants from minority groups at at least 80% of the rate that they hire applicants from non-minority groups. For instance, businesses hiring 50% of male applicants should also hire at least 40% of female applicants.

Example: an Aequitas Bias Report Using COMPAS Data

COMPAS is a controversial predictive risk software used across the United States to identify future criminals. ProPublica reported that COMPAS incorrectly labels black defendants as high risk reoffenders at nearly twice the rate of white defendants, while white defendants are much more likely to be incorrectly labeled as low risk reoffenders21 . ProPublica's COMPAS data22 includes recidivism risk scores, recidivism outcomes (2-year), and demographic variables from over 7,000 people. Using the same data, does Aequitas reveal similar biases?

In short, yes. COMPAS helps judges make punitive decisions, so FPR and FDR parity are the most relevant bias metrics (i.e., the most unfavorable outcome is unfairly punishing people). When Aequitas audits COMPAS based on these two metrics and only for the race attribute (with Caucasian as the reference group), COMPAS fails the FPR disparity test. As ProPublica reported, the FPR is nearly two times as high for black compared to white defendants. While COMPAS doesn't seem biased against black people based on FDR parity, it still fails the FDR disparity test for race due to disproportionately lower rates for Asians and Native Americans. Note that the value for the reference group (in this case, Caucasians) is 1, because the disparity metrics are based on comparisons with the reference group.


In fact, COMPAS fails based on every fairness metric for race, not just the two most relevant ones. COMPAS also fails all but one test for age bias and half the tests for sex bias.


Disparity metrics for age, sex, and race.


Bias metrics for age, sex, and race.

The Model Is Biased. Now What?

Bias in machine learning systems can be corrected before training, during training, or after the model makes predictions. These methods are briefly introduced here, and will be explored in more detail in a forthcoming Triplebyte article.

Correcting Bias Before Training

Information that can lead to unfair decisions can be removed from the training data before training; this is sometimes called preprocessing. Importantly, preprocessing is not as simple as removing sensitive variables, because other variables can be highly correlated with them; one approach that addresses this issue uses a learning algorithm that finds the best representation of the data while simultaneously obscuring sensitive information (e.g., gender, ethnicity, income) and any information correlated with it23.

Another example of a preprocessing algorithm is reweighing. Reweighing compensates for bias by assigning lower weights to favored individuals and higher weights to unfavored ones24. An imbalanced data set can also be mitigated by resampling, wherein instances of an underrepresented group are added (oversampling), or those of an overrepresented group are removed (undersampling).

• The classifier doesn’t need to be modified.
• The preprocessed data can be used for any machine learning task.

• Other methods often achieve better accuracy and fairness.

Correcting Bias During Training

Bias can also be corrected during training. One approach uses one or more “fairness” constraints to guide the model and ensure that similar individuals are treated equitably across subpopulations25. For example, the optimization objective of the algorithm could include the condition that the false positive rate of the protected group is equal to that of the other individuals in the data set.

Another method is adversarial debiasing, which simultaneously trains two classifiers—a predictor and an adversary26. The predictor aims to predict a target variable by minimizing some loss function, while the adversary tries to predict a sensitive variable (given the raw output of the classifier) by minimizing a different loss function. The goal is to get to predictor to minimize the first loss function and maximize the second one, such that the adversary fails to predict the sensitive variable.

• Fairness is improved without compromising accuracy.
• The programmer can focus on improving specific fairness metrics.

• The classifier code must be modified, which is not always feasible.

Correcting Bias After Training

Finally, bias can be corrected after training by correcting classifier results to improve fairness. This can be accomplished by plotting the true positive rate against the false positive rate to find the threshold at which these rates are equal between protected and unprotected groups; such a plot is called a receiver operating characteristic (ROC) curve.

• The classifier doesn’t need to be modified.
• Fairness is improved.

• There is less flexibility for balancing accuracy and fairness.
• Protected attributes must be accessed during test time.

If All Else Fails, There’s Always the Big Red Button?

The ability to interrogate machine learning systems to uncover bias is incredibly valuable, and we should avail ourselves of the opportunity.

Hopefully, we will eradicate biases from AI long before we arrive at “Racist Robots.” Barring that, researchers at Google DeepMind and Oxford’s “Future of Humanity Institute”[2] have developed a framework involving a big red button that can interrupt wayward AI (as well as prevent said AI from learning how to thwart these interruptions)27.

Have you ever experienced or worked on bias in AI systems? (Maybe you’re a “guerrilla auditor,” like the guy who used a crawler to simulate a recruiter and test resume search engines for gender bias28). What do you think are the toughest problems and the most promising solutions? Leave a comment and let us know!

[1] While other machine learning models often use accuracy or AUC-ROC as the metric of interest (i.e., the metric being maximized), many public policy models instead use precision at top k; this is often a consequence of limited resources, which require that the top k individuals receive an assistive or punitive intervention.

[2] The future of humanity seems like a heavy, ambitious goal for one institute, so I’m glad they’re at least collaborating.


  1. Hale, Kori. “Amazon, Microsoft & IBM Slightly Social Distancing From The $8 Billion Facial Recognition Market,” June 15, 2020.
  2. Sweeney, Annie, and Jeremy Gorner. “For Years Chicago Police Rated the Risk of Tens of Thousands Being Caught up in Violence. That Controversial Effort Has Quietly Been Ended.” Chicago Tribune, January 25, 2020.
  3. Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias.” ProPublica, May 23, 2016.
  4. Vincent, James. “Google 'Fixed' Its Racist Algorithm by Removing Gorillas from Its Image-Labeling Tech.” The Verge, January 12, 2018.
  5. Crawford, Kate. “Artificial Intelligence's White Guy Problem.” The New York Times. The New York Times, June 25, 2016.
  6. Wiggers, Kyle. “Google Debuts AI in Google Translate That Addresses Gender Bias.” VentureBeat. VentureBeat, April 22, 2020.
  7. “Machine Translation: Analyzing Gender.” Machine Translation | Gendered Innovations. Stanford University. Accessed July 27, 2020.
  8. Lahoti, Preethi, Krishna P. Gummadi, and Gerhard Weikum. “IFair: Learning Individually Fair Data Representations for Algorithmic Decision Making.” 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019.
  9. Buolamwini, Joy, and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 2018.
  10. Dastin, Jeffrey. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women.” Reuters. Thomson Reuters, October 10, 2018.
  11. Narayanan, Arvind. “TL;DS - 21 Fairness Definition and Their Politics by Arvind Narayanan.” TL;DS - 21 fairness definition and their politics , July 19, 2019.
  12. Chouldechova, Alexandra. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5, no. 2 (2017): 153–63.
  13. “Aequitas.” Center for Data Science and Public Policy, February 12, 2020.
  14. “Columbia/Fairtest.” GitHub, May 29, 2017.
  15. “Adebayoj/Fairml.” GitHub, March 23, 2017.
  16. “Google/Ml-Fairness-Gym.” GitHub. Google, June 17, 2020.
  17. “AI Fairness 360 Open Source Toolkit.” AI Fairness 360. IBM, n.d.
  18. Saleiro, Pedro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld, Kit T. Rodolfa, and Rayid Ghani. “Aequitas: A Bias and Fairness Audit Toolkit.”, April 29, 2019.
  19. “Source Code for” - aequitas documentation, 2018.
  20. “Propublica/Compas-Analysis.” GitHub. Propublica, 2017.
  21. Zemel, Richard, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. “Learning Fair Representations.” ICML’13: Proceedings of the 30th International Conference on Machine Learning, 2013.
  22. Kamiran, Faisal, and Toon Calders. “Data Preprocessing Techniques for Classification without Discrimination.” Knowledge and Information Systems 33, no. 1 (2011): 1–33.
  23. Dwork, Cynthia, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. “Fairness through Awareness.” Proceedings of the 3rd Innovations in Theoretical Computer Science Conference on - ITCS '12, 2012.
  24. Zhang, Brian Hu, Blake Lemoine, and Margaret Mitchell. “Mitigating Unwanted Biases with Adversarial Learning.” Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018.
  25. Orseau, Laurent, and Stuart Armstrong. “Safely Interruptible Agents.” DeepMind. Google DeepMind, 2016.
  26. Chen, Le, Ruijun Ma, Anikó Hannák, and Christo Wilson. “Investigating the Impact of Gender on Rank in Resume Search Engines.” Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI '18, 2018.

Get offers from top tech companies

Take our coding quiz


Liked what you read? Here are some of our other popular posts…

Triplebyte’s Way-Too-Long Technical Interview Prep Guide

By Triplebyte on Apr 29, 2020

A running collection of technical interview preparation resources that we've collected at Triplebyte.

Read More

How to Pass a Programming Interview

By Ammon Bartram on Apr 29, 2020

Being a good programmer has a surprisingly small role in passing programming interviews. To be a productive programmer, you need to be able to solve large, sprawling problems over weeks and months. Each question in an interview, in contrast, lasts less than one hour. To do well in an interview, then, you need to be able to solve small problems quickly, under duress, while explaining your thoughts clearly. This is a different skill. On top of this, interviewers are often poorly trained and inattentive (they would rather be programming), and ask questions far removed from actual work. They bring bias, pattern matching, and a lack of standardization.

Read More

How to Interview Engineers

By Ammon Bartram on Jun 26, 2017

We do a lot of interviewing at Triplebyte. Indeed, over the last 2 years, I've interviewed just over 900 engineers. Whether this was a good use of my time can be debated! (I sometimes wake up in a cold sweat and doubt it.) But regardless, our goal is to improve how engineers are hired. To that end, we run background-blind interviews, looking at coding skills, not credentials or resumes. After an engineer passes our process, they go straight to the final interview at companies we work with (including Apple, Facebook, Dropbox and Stripe). We interview engineers without knowing their backgrounds, and then get to see how they do across multiple top tech companies. This gives us, I think, some of the best available data on interviewing.

Read More