If you like our assessment approach, you can evaluate your own applicants with it through Triplebyte Screen. Screen allows you to administer our assessments as an adjunct to, or replacement for, resume screens, so that you can choose who to move forward with based on proven technical skills. This helps to reduce bias in your hiring process and to find skilled engineers from underrepresented backgrounds. Screen is completely free to use with your applicants.

It’s easy to underestimate how difficult it is to evaluate a potential hire if you’ve never done so yourself.

A typical job receives orders of magnitude more applicants than qualified candidates. I’ve personally hired for roles where the ratio of applicants to offers was more than 100:1. The situation is even worse for engineering roles, where the competition is stiff and the salaries are high; in some cases, the ratio can be 1000:1 or worse.

No employer can reasonably devote significant resources to every one of a thousand applicants, so most of them filter aggressively in the early stages of their process. This is a practical necessity for most employers today, but it means relying on weak signals of skill that can introduce bias or produce a poor experience for engineers.

Filtering on years of experience, for example, discards skilled applicants who lack credentials and necessarily reproduces the existing biases of the industry (since it couples current hiring decisions to past ones). Filtering on degrees or high-prestige schools has obvious class bias, among other things. Filtering on fizzbuzz-like exercises is low-signal and somewhat demeaning to skilled applicants. And so on.

At Triplebyte, our goal is to gather enough information to make this process smoother for engineers and less error-and bias-prone for companies. We do this by building the best engineering assessments around, using a huge data set and the power of modern testing theory to wring every last bit of data out of a short quiz. Better data helps employers make better decisions about who to reach out to in the first place, and a centralized screening process helps engineers spend less time repeating tedious near-identical application processes at dozens of different employers.

But an assessment is only as good as its development, and we’ve never published a proper breakdown of the theory, the data, and the process behind Triplebyte’s assessments. So today, let’s take a deep dive into how our assessments work – how we model what a score means, how we incorporate data from thousands of past assessments to calibrate our estimates, and how we apply testing best practices to make sure we’re building a test that gives everyone a fair chance.

A Bayesian Take On Tests

Traditional tests are simple: a test-taker answers a list of questions, and their score is simply the number of questions they answer correctly. Some tests may weight questions differently from one another, but those weights are often subjective rather than empirical. This approach – what our in-house expert would call classical test theory – can be a reasonable foundation for test development, but it turns out there are alternative frameworks that (given sufficient data) can extract more information from test data.

Triplebyte’s assessments use a more modern framework called item response theory, or IRT for short. IRT is an umbrella term for a range of testing models that view the result of each question as a probabilistic function of underlying skill. These models are now the standard for many high-stakes tests, most notably the GMAT graduate school entrance exam. Our specific implementation is a fundamentally Bayesian variant of IRT: that is, we think of ourselves as having some baseline beliefs about a test-taker’s abilities at the beginning of a test, and we update those beliefs based on the result of each question.

Formally, we assume that each engineer on our platform has some (unknown) underlying skill, often denoted in IRT literature by the Greek letter \(\theta\) (theta). We begin with a prior belief about how likely an engineer is to be at any given skill level. This prior is represented by a distribution \(f_{prior}\) over possible values of \(\theta\), which we update each time we gain new information. (In fact, we recognize that engineering is a collection of many different skills, and our assessments reflect that by estimating several different skill levels, but we’ll come back to this later.)

Each of these updates is driven by Bayes’ rule, which tells us how to update our prior as we gain new information. (Readers entirely unfamiliar with Bayes’ rule may want to watch this longer video introduction by the excellent mathematics channel 3Blue1Brown.) For any two events \(A\) and \(B\), Bayes’ rule states that: $$P(A|B) = {P(A)P(B|A) \over P(B)}$$, where \(P(A|B)\) is the conditional probability of \(A\) given \(B\). For example, if \(A\) is the event that it rains today, and \(B\) is the event that it is cloudy today, we would expect \(P(A|B)\) to be much higher than \(P(A)\) alone, because cloudiness significantly increases the chance that it will rain.

In our case, we’re trying to evaluate the probability of an engineer having some skill level \(\theta\) given the observed evidence that they answered a question in a particular way. Because our beliefs form a continuous curve (rather than a discrete true/false outcome), the formula looks a little different, with continuous density functions \(f_{prior}(\theta)\) and \(f_{posterior}(\theta)\) replacing discrete probabilities. The structure remains the same, however. Intuitively, it looks like this: $$updated\:chance\:of\:skill\:\theta = {prior\:chance\:of\:skill\:\theta\:*\:chance\:of\:observing\:actual\:response\:given\:skill\:\theta\:\over\:total\:chance\:of\:observing\:actual\:response\:at\:all}$$or more symbolically:$$f_{posterior}(\theta) = {f_{prior}(\theta)P(response|skill = \theta) \over P(response)}$$ where \(f_{prior}\) represents our beliefs before observing an engineer’s response, and \(f_{posterior}\) represents our updated beliefs after observing it.

It turns out that the interesting term in this equation is \(P(response|skill = \theta)\). This term captures the chances of a correct answer at different skill levels, which is what we'll ultimately use to differentiate weaker engineers from stronger ones. First principles can't tell us what its value is: it has to be estimated using real data, an estimation we'll get to shortly.

Visually, this updating process looks something like this. The red curve below represents our prior beliefs about an engineer's skill level. After they answer a few questions correctly, however, our estimates shift to the right. They also become more confident, which is visible as the blue curve narrowing around a single value.

An animation showing one distribution shifting rightward and narrowing to form a new distribution.

We repeat this prior-updating process for the next question, using the posterior from the first question as the prior for the second. (Statistically-astute readers may note that the joint results of two questions may carry information not carried by either individually. One of our model assumptions is that this isn’t the case – we’ll get to why that is near the end of this article.) Over time, we integrate information from correct and incorrect answers into a more-and-more-refined estimate of an engineer’s abilities.

The Item Response Curve

We said a moment ago that the interesting value in this prior-updating formula is the term \(P(response|skill = \theta)\), which represents the chances of particular responses from engineers of various skill levels. Why is this term significant? Because we already know, or can closely estimate, the other two terms. We know what our prior beliefs \(f_{prior}(\theta)\) are before we know whether someone answers a question correctly or not, and we can compute the expected chance of a particular response by summing (or more formally, integrating) over the chances of that at each possible skill level, weighted by how likely we think each skill level is.

\(P(response|skill = \theta)\) should intuitively depend on \(\theta\): we would expect that a skilled engineer would answer a question correctly more often than an unskilled one. This is a requirement for a good question, since a question that is answered correctly equally often by engineers of all skill levels would provide us no ability to differentiate their skills at all.

An ideal question would perfectly differentiate engineers above and below a fixed skill level: engineers above that skill level would always answer correctly; engineers below it would always answer incorrectly. In practice, things aren’t so clear-cut: a high-skill engineer could conceivably answer even a very easy question incorrectly, and a low-skill engineer could conceivably answer even a very hard question correctly.

The idea of Item Response Theory is to formalize this intuitive notion so that we can compute with it. To do this, we define a function \(p_i\)  (for probability [of a correct answer] on the \(i\)th question) for each question on our assessment: $$p_i(\theta) = P(correct\:response\:to\:ith\:question|skill=\theta)$$

This function \(p_i\) goes by a number of names in testing literature, but for the purposes of this article we’ll call it (or more specifically, its graph) the response curve. It turns out that this curve encapsulates almost everything you might want to know about a question’s difficulty and differentiating power. For example, let’s consider the following response curve, where engineer skill increases from left to right and chance of answering the question correctly increases from bottom to top.

A drawing of a mathematical graph. It starts nearly flat at the lower left, then increases more rapidly, reaching maximum slope near the middle. It then tapers off and becomes nearly flat again at top right.

The difficulty of the question can be seen in the left-right position of the curve – it’s the point where the curve crosses \(y = 0.5\). This is the skill level at which a test-taker has a 50-50 chance of answering the question correctly; test-takers above this skill level are more likely than not to answer correctly, and test-takers below this skill level are more likely than not to answer incorrectly. The differentiating power of the question can be seen in how quickly the curve increases – in other words, how quickly it goes from test-takers usually getting it wrong to usually getting it right as their individual skill increases. Our hypothetical perfect-differentiation question from a few paragraphs ago (where engineers above a threshold always answer correctly and those below it always answer incorrectly) would be a sharp discontinuous graph that leaps from 0 to 1 instantly; real questions only approach this as an idealized limit as \(differentiation \rightarrow \infty\). For our purposes, we constrain our response curves to a family of curves characterized by these two parameters (difficulty and differentiating power) – more on this choice later.

Here are a few more curves to drive the point home. From left to right, the questions increase in difficulty (shifting the graphs rightward). From top to bottom, they increase in differentiating power (increasing their maximum slope at their center).

An image showing a range of response curves of questions with different difficulty and differentiating power.

The effect of all of this is that correct answers to hard questions adjust our prior beliefs about an engineer’s skill level upward by more than correct answers to easy questions do, and incorrect answers to easy questions adjust our prior downward by more than incorrect answers to hard questions do. This makes intuitive sense: you might not adjust your estimate of someone’s knowledge of history down very much if they couldn’t tell you who the Doge of Venice was in 1104 A.D. (the equivalent of a very-high-difficulty question), but you would probably adjust your estimate of their historical knowledge up a great deal if they could do so off the top of their head.

The variant of IRT we’ve discussed so far, which treats skill as a one-dimensional quantity, is a bit simpler than the one we actually use, which measures skill along several different axes. These axes correspond to the scores engineers can display to companies through our platform. Because a question tells us about both an engineer’s overall skill and about their knowledge of particular sub-areas, our response curves are really functions of two underlying skill variables: a general factor \(\theta_{overall}\) and a sub-area proficiency \(\theta_{subskill}\). Our prior is also a bit more complex than represented here, covering distributions over \(\theta_{overall}\), \(\theta_{backend}\), \(\theta_{algorithms}\), and so on. The math is not too different from the simpler case presented above, with single integrals replaced by multiple integrals and 1-dimensional curves replaced by \((n-1)\)-dimensional surfaces, but it’s omitted here because the 1-dimensional case is a lot easier to visualize.

We’ll come back to response curves in a moment. But before we can start to calibrate response curves for our question bank, we need to create the questions themselves.

Building The Quiz

Content for each of our quizzes arises from an underlying test plan, which details which skills we’re trying to test and how we intend to use the results that the test produces. This plan starts with a rough outline that is refined by subject-matter experts in the area we’re trying to test. For example, our recently-launched language quizzes began with the broad idea that they should test areas like core syntax and passing familiarity with the purposes of major libraries (but not necessarily with how to actually use them).

This test plan is operationalized into a specification that takes these broad claims and breaks them down into individual skills that are granular enough to be the subject of specific questions. For example, “core syntax” in Python is a broad topic that breaks into granular skills like “understand the output of this list comprehension”. We pass this specification and some guidelines on test question best practices on to a network of experts; these experts (who are sometimes contracted and sometimes in-house, depending on the skillset required) actually write the questions themselves.

With the exception of the initial plan, this entire process is handled by Larry, our in-house psychometrics expert. Larry earned his Master’s in quantitative methodology and psychometrics from UCLA, and he’s worked in educational research and testing for the last decade. He is pictured below in a bunny suit, because we are a Silicon Valley firm and are therefore obligated to include at least one ridiculous image per serious technical article (it’s in the Y Combinator contract somewhere, I think).

Larry, our psychometrics expert, wearing a bunny suit and smiling for the camera.

Larry works with our subject-matter experts to make sure the questions they’re writing follow a long list of testing best practices. For example, we check newly-submitted questions to make sure that the question doesn’t betray necessary information by duplicating a word between the question and the correct answer. Each question is then labeled with a specific type of subskill. For example, we might label a question with “back-end web development” as its general field. (We also label questions with a sub-subskill like "back-end web" -> "database indexes" to ensure we cover subskills evenly, but we don't ultimately score our assessments at that level of granularity)

The result of this work is a pile of questions that we’re confident do in fact test what we’re trying to test. These questions then act as a baseline for our initial beta launch of each assessment. Beta launches let us collect some real world data with which to calibrate our assessments in preparation for a final launch.

Calibration Station

When we add a new question, we don’t know what its response curve looks like yet. We need to estimate it empirically, but we can only do so once we have data from real responses. All of our initial calibration was done using data from our technical interviews, but in some newer cases we’re interested in calibrating questions that don’t correspond to any of our interviews directly, so we’ve branched out into other methods. One simple approach is to integrate a new question (which won’t actually count towards scoring) into a quiz, surrounded by other questions whose response curves are known; we can then calibrate the new curve with reference to the old ones. The range of new question types we develop sometimes requires more robust methods, and the exact approach we use depends on the specific context. Ultimately, though, we refer back to downstream interview outcomes as a source of ground truth whenever possible, which it generally is.

(This calibration, by the way, turns out to be the most statistically-challenging part of our model, because our interviews themselves have to be calibrated, which adds additional uncertainty to the model. We use variational Bayesian methods, which are well beyond the technical scope of this article, to account for the added uncertainty.)

Once we feel that a quiz is sufficiently well-calibrated, we can launch it as a full part of our platform. What qualifies as “sufficiently” well-calibrated depends on the specific claims we’re trying to make – the more granular, confident, and high-stakes the result we’re trying to produce, the better our calibration needs to be. In an ideal world, we would calibrate to a practically-unrealistic degree of accuracy. In practice, we are (like any business) constrained by time and resources, and we do want to make assessments fully available once we have reasonable confidence in them. We also continue to calibrate assessments after their launch as we gather more data.

The result of this process is our quiz set and question bank, which as of this writing contains 14 quizzes and well over 1,000 questions tuned on nearly ten million past answers from engineers. This combined data set is behind our scoring of every quiz an engineer takes on our platform.

A picture of a Triplebyte score report, showing scores on a 1-5 scale for several different skills.

Why Our Tests Are Adaptive

One of the major reasons to use IRT is that it enables adaptive testing.

In a classical test, everyone takes the same set of questions, which means those questions need to cover the entire gamut of possible question difficulties. This means that high-skill test-takers waste time taking low-difficulty questions, and low-skill test-takers are frustrated by questions far beyond their ability. A traditional fixed-question test can’t avoid this problem, because it doesn’t have any way to account for variable question sets or (beyond weighting) variable difficulties, but a Bayesian IRT model has absolutely no difficulty using information from any question to update a running estimate of underlying skill.

Why does this matter? Because it’s a general fact of information theory that the expected information from making an observation increases the more uncertain the result of that observation is. For a binary outcome (like an engineer taking a question on a quiz and answering it either correctly or incorrectly), expected information is maximized when the outcome is 50-50. In other words, we expect to learn the most about an engineer’s skills when we ask them a question whose difficulty is equal to their skill level.

In order to leverage this effect, most Triplebyte assessments are adaptive: the more questions an engineer answers correctly, the more difficult the questions become; conversely, the more they answer incorrectly, the easier the questions become. This keeps us from asking questions with highly certain outcomes, improving the information gained from each question. (In fact, we don't target exactly 50-50 chances on every question, in part because that's a poor experience for test takers. But the underlying principle is the same.)

This has a larger effect than you might think. The exact impact is tricky to quantify empirically for a number of reasons, but I ran a quick-and-dirty simulation based on two hypothetical quizzes with an underlying skill distribution similar to our real pool of engineers. In the first quiz, an engineer of average skill is given questions that are uniformly distributed across a typical range of skills, which by nature include questions much too easy and much too hard for them. In the second, the questions track the engineer’s estimated skill fairly closely, keeping within one standard deviation of the engineer’s true skill. As it turns out, the adaptive quiz in this hypothetical gives approximately four times as much average information per question, meaning that a non-adaptive version of our (currently 45-question) main quiz could need to be nearly 200 questions long (and take as much as three hours for some users) to capture the same level of information!

Limitations

No testing model is perfect, and as much as we might like it to be, ours isn’t either. For the sake of completeness, let’s examine a few of the assumptions necessary to our testing model.

First, it’s possible that a test-taker’s answers carry information about one another that have nothing to do with underlying skill. In other words, the response curve might depend on the results of the rest of the test, even though we’d like to think of each response curve as inherent to its parent question itself. For an extreme example, imagine that we asked the same question twice. Knowing whether the first answer was correct or not would tell us with near-certainty whether the second answer would be, regardless of the test-taker’s underlying skill. (Note that this same problem arises in a classical fixed-question test, too: duplicating a question 100 times would throw off test results quite badly!)

Unfortunately, it turns out to be impractical to model potential dependency between questions. Modeling even pairwise dependencies among \(n\) questions means adding \(O(n^2)\) parameters to our testing model; modeling all possible dependencies adds \(O(2^n)\) parameters (which, for our question bank, would be a number of parameters well above the number of particles in the Universe). Thus, we assume that underlying skill explains all the dependency between questions, leaving it up to our test development to ensure questions don’t have strong skill-independent dependencies. This assumption reduces the number of parameters to a manageable \(O(n)\), and is a standard underlying assumption of IRT testing models in general.

Second, in principle the response curve could take on any shape we like, but for theoretical reasons (e.g. to avoid overfitting) it’s best to estimate it with a small number of parameters. We therefore limit our response curve to the shape of a sigmoid (“s-shaped”) curve defined by two parameters: the difficulty and differentiation factors mentioned earlier. This is absolutely an oversimplification – real response curves don’t work this way – but it’s a numerical necessity that is standard in IRT-based tests. In fact, our production model does use a few additional parameters specific to our tests, but the basic response curve remains the usual two parameters.

And finally, we are necessarily subject to the limitations of a test, which no testing model (including interviews!) can overcome. Any test is necessarily measuring proxies for on-the-job ability, and any test is inherently something of an artificial environment where individuals’ behavior may vary. Some people are good at interviewing as an atomic skill (as opposed to being good at the skills the interview is supposed to test), and I have no doubt that some people are good at taking multiple-choice tests in a way that comes through on our assessments.

That being said, one advantage of using our screening process as an adjunct to traditional hiring processes is that those "test-gaming" skills differ between different kinds of assessment. The interpersonal confidence that helps some candidates perform well in human interviews won’t help them at all on an automated test, while engineers who struggle with focus during an automated quiz may perform better in the more stimulating environment of an onsite interview. Using different screening methods with different drawbacks helps avoid correlated errors, producing a more robust process overall. 

Putting It All Together

With all the pieces in place, let's zoom back out to see the Triplebyte assessment process as a whole.

When an engineer starts an assessment, we assign them a prior. We then repeatedly select questions from our underlying pool that we think are informative, based on the known difficulty and differentiating power of each question. We select questions close to our current best estimate of their skill level, because those questions extract the most information. Once they answer a question, we update our prior estimate of their skills using Bayes’ rule and the estimated response curve for each question. We repeat this process enough times to narrow down our prior, and finally output a distribution of reasonable skills they might have. All of these steps are driven by our underlying model, which contains about 6,000 parameters tuned on results from hundreds of thousands of engineers.

Conclusion

We trust this process. It absolutely has limitations, but we stake our business and our collective expertise on the quality of our assessments. Compared to traditional resume screens, which perpetuate both class and demographic bias and which are unnecessarily frustrating and laborious even for engineers who do fit their mold, we think our model provides an important step forward. We think it’s more pleasant for everyone involved; that it provides better opportunity to people who, for accident of birth or lack of prestigious education or both, would not be considered in traditional processes; and that it ultimately results in a more equitable and skilled workplace overall.

We hope, with better insight into the energy and process that we put into our tests, that you’ll agree.

 

Special thanks to several of my co-workers, particularly psychometrics expert Larry Thomas, product writer Jen Rose, and visual designer Mary Ngo, for their help on this article.

Discussion

Categories
Share Article

Continue Reading