# Bayesian Inference for Hiring Engineers

## By Mike Robbins on May 30, 2018

To get an interview for a technical position, an engineer must run a gauntlet. Their resume has to get past a recruiter or hiring manager. They have to sound excited and competent on a culture fit phone call. And they need to complete either an online test or a technical phone screen. Often, there’s a substantial take-home project as well. All this comes before the final on-site interview.

At every step in this process, the company is legitimately incentivized to reject. Each stage is significantly more expensive than the last for both the candidate and the company. And most applicants for most jobs ARE underqualified. Companies want to screen them out early.

The signals that go into every screening decision, however, are noisy. This is a messy world of overlapping signals and biases. And the recruiters and hiring managers making these decisions are guided primarily by pattern matching.

We can do better. And Bayesian statistics — updating probabilistic beliefs using evidence — can help! Rather than thinking of each technical screening step as crisp yes / no test, we should think of them as evidence updating our prior over a candidate’s skill. This can't totally replace the role of pattern matching and gut calls in recruiting. But I think it can produce some valuable insights that lead to a more fair and efficient hiring process. Bear with me...

## Who Moves Forward?

I’ll define an indicator variable $z$ such that:

$$z_{A,P} = \begin{cases}1 & \text {if applicant } A \text{ would get an offer after interviewing for position } P \\ 0 & \text{otherwise}\end{cases}$$

Consciously or not, recruiters and hiring managers use whatever evidence they have available at the time to create an estimate of whether they’d ultimately give this person an offer. My pre-interview estimate $\hat{z}$ can be framed as a conditional probability:

$$\hat{z} = P(z_{A,P} = 1 | E)$$

Given whatever evidence $E$ I’ve collected so far, how likely is it they’d get an offer if I brought them all the way through our interview process? Intuitively, I won’t spend any more time with applicants for whom $\hat{z} = 0.001$ (just 0.1% chance of getting an offer), but will definitely continue with the $\hat{z} = 0.8$ applicant (80% chance).

More formally, if the $\hat{z}$ is above some threshold, I’ll proceed to the next step, gathering more evidence, from which I’ll calculate a better estimate. My decision to interview is:

$$d_{A,P} = \begin{cases}1 & \text {if } \hat{z} \ge z_\text{threshold} \\ 0 & \text{otherwise}\end{cases}$$

These estimates and decisions happen as “instincts” or “gut feelings” in the minds of recruiters and hiring managers. Nobody whips out a calculator to compute probabilities as they scroll through their ATS.

This evidence-to-prediction process maps closely to Bayesian inference, where we start from assumed prior probabilities, collect data, and use the new data to update our beliefs about a probability distribution. After the final interview, I’ll know the true binary value of $z_{A,P}$, either $0$ or $1$. But at each step before the interview, I can quantify my current beliefs as a likelihood curve — a probability distribution on the domain $[0,1]$ — which represents my state of knowledge about the true value.

Rejecting people pre-interview saves valuable time — but my argument is that’s true **if and only if I’m actually rejecting the right people.**

## Costs of Misclassification Errors

If my probability estimate $\hat{z}$ is noisy or inaccurate enough to cross the decision threshold, I’ll make the wrong screening decision:

False positives on screening are expensive and obvious because I’m spending time interviewing and deciding not to extend an offer. Recruiters keep applying stricter screening because false positives are so loud and visible to their team, but there’s a mathematical tradeoff: stricter screening to reduce false positives has the hidden cost of increasing the false negative rate.

False negatives on screening may be less obvious to companies because they never actually do the interview, but they’re often just as expensive. The cost to find the next promising candidate is substantial. False negatives are especially painful to applicants who would have been a great fit but never get the chance to show it because of a 30-second glance at a resume.

**Bad estimates of $\hat{z}$ explains much of why the job search process is so onerous.**
Job-seekers get rejected by companies where they would have been a good fit, and they waste time on interviews where they aren’t. Recruiter and hiring manager gut calls create noisy estimates, biasing toward low-fidelity pattern matching, overweighting credentials and underweighting ability. This is why the interview-to-offer ratio is only about 20% for software engineers industry-wide.

Let’s see how I can use the Bayesian approach to build a better hiring process.

## One Screening Question

I’ll start with the simplest case: I’ll consider just one piece of binary evidence about a job applicant, which I’ll use to decide who to interview. For the first question, I’ll ask the applicant to write a small, straightforward program to filter and average some values from a CSV file, and set $E_1 = 1$ if they can build a basically working program, or $E_1 = 0$ if they can’t.

If the engineer does well on this coding exercise question, what’s $\hat{z}$ — how likely are they to be someone I’d give an offer to? And what about someone who gets it wrong?

To compute this, I’ll assume:

- People who I’d hire have a
**50%**chance of answering this question correctly. (Even for something straightforward, it’s never 100%! Smart people get stuck sometimes, especially under pressure, and that’s OK. That’s why I combine multiple pieces of data, not just one.) - That drops to about
**5%**for people I wouldn’t give an offer to. **2%**of all applicants are people I’d give offers to if they went through the final interview.

Before you scroll down, make your best guess at these two probabilities in your head:

How likely is it that someone who passes the coding exercise is someone I’d make an offer to if they interviewed?

How many times greater is that than the person who can’t write the program?

(scroll for answers)

.

.

.

.

.

For the person who successfully writes the program, the math looks like this, using Bayes’ rule:

$$ \begin{align} \hat{z} & = P(z_{A, P} = 1 | E_1 = 1) \\ \\ \hat{z} & = \frac {P(z_{A, P} = 1) P(E_1 = 1 | z_{A, P} = 1)} {P(E_1 = 1)} \\ \\ \hat{z} & = \frac {0.02 \cdot 0.5} {0.02 \cdot 0.5 + (1 - 0.02) \cdot 0.05} \\ \\ \hat{z} & = 0.169 = 16.9\% \end{align} $$

The intuition behind the fraction is that some mix of hireable and non-hireable people will pass the screening question. Out of all people who’ll get the question right (denominator), what fraction is actually hireable (numerator)? Since non-hireable candidates are so prevalent in the applicant pool, they still make up 83.1% of people who can pass the exercise!

For the applicant who can’t build the program:

$$ \begin{align} \hat{z} & = P(z_{A, P} = 1 | E_1 = 0) \\ \\ \hat{z} & = \frac {P(z_{A, P} = 1) P(E_1 = 0 | z_{A, P} = 1)} {P(E_1 = 0)} \\ \\ \hat{z} & = \frac {0.02 \cdot (1 - 0.5)} {0.02 \cdot (1 - 0.5) + (1 - 0.02) \cdot (1 - 0.05)} \\ \\ \hat{z} & = 0.011 = 1.1\% \\ \end{align} $$

Did you correctly guess that a person who can build the program is 15 times more likely to get an offer versus an applicant answering incorrectly, but still has only a 1-in-6 shot of getting an offer? If you did, great — I’m about to make the problem harder for you. If not, you’ve just demonstrated that correctly predicting $\hat{z}$ is tough even when you precisely know the priors!

## Two Screening Questions

Let’s make the model a bit more complicated. After the coding exercise, I’ll also ask everyone to describe how HTTP cookies work, and set $E_2 = 1$ if they give a basically satisfactory description, and $E_2 = 0$ if they don’t.

I’ll additionally assume:

- People who I’d hire have a
**70%**chance of answering the cookies question correctly (i.e. $P(E_2 = 1 | z_{A, P} = 1) = 0.7$), and - That drops to about
**40%**for people I wouldn’t give an offer to (i.e. $P(E_2 = 1 | z_{A, P} = 0) = 0.4$).

With two binary questions, there are now $2^2 = 4$ possible states of data about an applicant, and for each state, I want to calculate their $\hat{z}$.

For this toy model, I’m going to make a simplifying assumption: the coding exercise and the cookies question are conditionally independent. This makes the math simple enough to work out by hand, and is known as a Naive Bayes Classifier.

Before scrolling, can you guess the probability of making an offer to the person who gets both questions right? Both wrong? One of each?

(scroll for answers)

.

.

.

.

.

For the person who gets both questions right:

$$ \begin{align} \hat{z} & = P(z_{A,P} = 1 | E_1 = 1, E_2 = 1) \\ \\ \hat{z} & = \frac {P(z_{A,P} = 1) P(E_1 = 1, E_2 = 1 | z_{A,P} = 1)} {P(E_1 = 1, E_2 = 1)} \\ \\ \hat{z} & = \frac {P(z_{A,P} = 1) P(E_1 = 1 | z_{A,P} = 1) P(E_2 = 1 | z_{A,P} = 1)} {P(E_1 = 1, E_2 = 1)} \\ \\ \hat{z} & = \frac{0.02 \cdot 0.5 \cdot 0.7} {0.02 \cdot 0.5 \cdot 0.7 + (1 - 0.02) \cdot 0.05 \cdot 0.4} \\ \\ \hat{z} & = 0.263 = 26.3\% \end{align} $$

For the person who gets both questions wrong:

$$ \begin{align} \hat{z} & = P(z_{A,P} = 1 | E_1 = 0, E_2 = 0) \\ \\ \hat{z} & = \frac {P(z_{A,P} = 1) P(E_1 = 0, E_2 = 0 | z_{A,P} = 1) } {P(E_1 = 0, E_2 = 0)} \\ \\ \hat{z} & = \frac {P(z_{A,P} = 1) P(E_1 = 0 | z_{A,P} = 1) P(E_2 = 0 | z_{A,P} = 1)} {P(E_1 = 0, E_2 = 0)} \\ \\ \hat{z} & = \frac {0.02 \cdot (1 - 0.5) \cdot (1 - 0.7)} {0.02 \cdot (1 - 0.5) \cdot (1 - 0.7) + (1 - 0.02) \cdot (1 - 0.05) \cdot (1 - 0.4)} \\ \\ \hat{z} & = 0.005 = 0.5\% \end{align} $$

The applicant who gets both right is 52 times more likely to be someone I’d give an offer to than the person who gets both wrong. Wow! That’s exactly why companies do screening before inviting people to interview.

What about applicants who correctly answer only one of two questions? Take a guess at the $\hat{z}$ for those cases. Intuitively, it’ll be somewhere between 0.5% and 26.3%, but do you know which way it will skew?

(scroll for answers)

.

.

.

.

.

If someone gets the cookies question right but can’t do the coding exercise:

$$P(z_{A,P} = 1 | E_1 = 0, E_2 = 1) = 0.018 = 1.8\%$$

And if they can do the coding exercise, but not the cookies:

$$P(z_{A,P} = 1 | E_1 = 1, E_2 = 0) = 0.093 = 9.3\%$$

Did you expect there to be a 5x difference between these two screening outcomes?

This simple example shows that most people’s intuition for guessing these probabilities isn’t great, even when all the inputs are known precisely.

Also note that these probabilities are independent of the order in which we ask the two questions. In reality, most recruiters put low-informativeness screening steps first and make sequential decisions, so although the coding exercise carries much more information, a bad recruiter could reject prematurely based on a low-informativeness feature like a resume.

## N Screening Questions

The computational complexity of estimating $\hat{z}$ rises rapidly as our screening becomes more rigorous. If I ask 30 questions, I’ll have $2^{30} = \text{1,073,741,824}$ different probabilities to calculate, and I’ll require $2^{30} = 60$ question-level assumed probabilities to start from.

I also don’t have to restrict myself to purely yes-no questions: for example, I can grade their program on a “strong yes”, “weak yes”, “weak no”, “strong no” 4-point scale. Using categorical variables instead of indicator variables will make my predictions more accurate, but for 30 questions, I’ll now have $4^{30} = 1.2 \times 10^{18}$ probabilities to calculate — impossibly huge. If this simple model is beyond our ability to precompute, it’s clear to me that hiring managers and recruiters are just making a guess and moving on to the next resume in the pile.

I can also get more accuracy by going beyond the Naive Bayes model. A few tools we use here at Triplebyte are allowing for possible correlations between questions, considering the variance of the estimate to know our confidence, and building hierarchical models that extend beyond a single position or single company.

## Learning Priors

The most interesting part of the Bayesian framework is that after gathering data, I can update my priors for the next applicant. While the initial values of $P(E_i = 1 | z_{A,P} = j)$ might just be my guesses about the informativeness of certain questions (perhaps based on my current engineering team), over time my initial priors will be overwhelmed by actual data. Bayes’ theorem can tell me how much to update my priors for the next applicant.

Hiring managers and recruiters should be updating their priors too, but it’s hard to get right. If I see a bunch of applicants with Python and Postgres experience on their resumes pass my interview, I should probably put more weight on those signals going forward. In practice, it’s hard to know how much to update my priors, and hard to pay attention to more than a few signals at once. Human priors can be “sticky” based on cultural beliefs about name-brand institutions, and this can lead to bias. Similarly, if I just recently “took a chance” on interviewing a .NET programmer and got burned, I may overweight my mental update to be unfairly biased against the next resume with .NET on it.

Getting this numerical update process right is crucial for making accurate predictions of whether an applicant is worth interviewing.

## Centralized Data Aggregation

Estimating the prior probabilities accurately requires a substantial dataset.

At Triplebyte, we’re applying this to hiring software engineers, and under the hood we calculate a $\hat{z}$ estimate for every applicant-position pair, exposing this to companies as a “Fit Score.”

We create a massive speedup in the hiring process by moving $N$ repeated screening steps into a single, more predictive one that we run with our own team of interviewers:

Companies hiring through us must agree to skip their own technical screening (such as technical phone screens or take-home assignments) for our applicants. Apple, Dropbox, Mixpanel, Instacart, and hundreds of others are willing to skip their own screening steps because they trust ours.

By aggregating company and candidate demand, we’re quantitatively advantaged in our ability to accurately estimate $\hat{z}$ because:

**More rows.**We can learn from the entire population of software engineers, not just those applying to any one company.**More columns.**We have more features (and more informative features) to consider for fit score estimation than any company’s individual pre-screening. Our feature vectors are based on demonstrated technical ability — not on words appearing on a resume.**A better mathematical model.**We don’t rely on subjective gut calls. We calculate probabilities of an offer being made for a specific applicant and a specific position, and we can tune this prediction for each company’s hiring process. The Naive Bayes model presented in this post is a simple conceptual model. Our production prediction engine is far more advanced.**A better experience for companies.**Companies know that applicants we send them start with a very high $\hat{z}$ even before they know their name! Sourcing and screening time drop as pre-screened applicants are delivered automatically.**A better experience for applicants.**Different companies are looking for different things. We identify specific positions where applicants are the strongest fit based on their technical skills. Centralized screening means job seekers don’t have to jump through $N$ screening hoops to apply to $N$ companies, and getting multiple offers at once improves negotiated outcomes.

Deciding which applicants to interview is a core job function for recruiters and hiring managers, and I’ve shown that this maps to a Bayesian classification problem. Hopefully I’ve demonstrated that this is challenging to “gut call” even in a two-screening-question toy model with known parameters, and the exponential complexity means that combining more questions is even more error prone! Beyond that, it’s hard for recruiters to do all of this without a big dataset and the right tools for updating their prior probabilities as they collect more data.

Triplebyte’s statistically informed process is far more predictive of interview results than purely human based evaluations. We can show sizeable improvements in outcomes versus humans making gut calls. The result is a better hiring experience for companies and candidates.

If you’re interested in signing up as a company to hire software engineers on the Triplebyte platform, sign up here. We work on a fee-per-hire basis with a six month guarantee; there are no upfront costs.

If you’re interested in seeing which companies you’re matched to as an engineer, sign up here.

## Discussion