Accelerating Ranking Experimentation at Thumbtack with Interleaving



While A/B tests are rightly considered the gold standard for causal inference, they can also be costly. A typical ranking experiment takes many weeks to complete. This wouldn’t be a big problem if we only had a handful of ideas to try, but Thumbtack’s rankers are powered by ML models that could be improved through any combination of new features, model architectures, and/or training techniques. In other words, there’s a vast space of potential improvements to evaluate, and with A/B testing alone, we don’t have the time to systematically explore it. That’s why we have turned to an experimentation technique called interleaving that powers our tests up to 100X faster than A/B testing. Interleaving is specifically designed to accelerate experiments involving ranked lists by quickly identifying the better of two possible rankings, allowing us to evaluate many more ranking ideas over a much shorter period of time.

To evaluate the impact of a new ranker in an A/B test, we randomly split Thumbtack’s consumers into a control group that sees ordered search results generated by our existing production ranker and a treatment group that sees ordered search results generated by the new ranker. We then perform a hypothesis test to check if there is a statistically significant difference in engagement (e.g. click rate) between the control and treatment groups. In contrast, an interleaving test does not split the consumers into separate groups but rather serves a combined list of search results from the control ranker and the treatment ranker interleaved together.