Personalizing UX: Why Zillow Group moved beyond AB testing
Zillow Group is embracing bandits testing to scale and improve its personalization outcomes.
Zillow Group faced a problem. The online real estate database company had too many personalized user experience tests it wanted to run on the website. The typical AB testing framework it had in place meant it would take years to test all of those ideas.
“Let’s say we have 50 AB tests running. How many AB tests do we need to run to get the independent effect?” asked Aaron Wroblewski, an AI engineering manager at Zillow, during a talk at MarTech Conference in San Jose this month. “Two to the power of 50. The length of the universe in seconds is 2 to the power of 44.” In other words, impossible.
It had to shift away from AB to enable its marketing teams to test multiple, simultaneous UX personalization tests on audience segments at scale. Enter multi-armed bandit testing.
“I love the control AB testing gives, but it left a lot to be desired when applied for personalizing UX,” said Wroblewski. The problems Zillow found with AB testing are they can take a long time and often run one at a time, they decide what’s best for most at a particular point in time, ignoring segments that don’t respond well and seasonality. That lack of finesse meant there were too many variables they couldn’t account for. “We know that user preferences shift with seasonality,” said Wroblewski, “and we can’t test that with AB testing.”
Benefits of bandits for Zillow Group
With bandits, you can run multiple tests simultaneously, and importantly, Wroblewski said, minimize regret. They can iterate in months, not years.
Wroblewski worked on a team of three that consulted with product teams and marketers on strategy and then built a stack to run bandit testing in three months.
With bandits, Zillow can automate the analyzing and optimizing phases, and humans can instead focus on planning, developing and learning from tests.
The team is now running contextual bandits, which randomly expose eligible users to tactics or test elements. The algorithms take the environmental state and historical data into to achieve the maximum reward over time. A model trains daily to predict the combination of context and tactic that will yield the desired KPI. The model decides which tactic to iterate based on performance.
It then retrains the algorithm. “The more confident it gets with more data, the more traffic we can assign to it,” explained Wroblewski, “but it always keeps a traffic segment to continue testing if that learning is in fact correct.”
The team’s eventual goal, said Wreblewski, is to be able to run hundreds of UX experiments simultaneously with tens of treatments per experiment.
Zillow’s learning curve
Starting from scratch meant the team encountered several pitfalls. One of the challenges of its first bandits test, said Wroblewski, was that every day they’d test something they saw different treatments winning. They had to correct for allocation, and because there are different probabilities for different tactics, out of order reward events is a common problem with machine learning. They had to learn to order training samples in time.
Another consideration is that user behavior changes and evolves. A model trained on today’s data may not be effective on next week’s data, said Wreblewski. The team implemented a strategy to feed data to the model measure the learn the learning rate and adjust continuously. The team has also improved monitoring and debugging processes.
Among its 2019 goals is to improve education and evangelism for other groups in the organization, said Wroblewski. He puts a lot of effort into talking to and educating marketing teams about the technology and how it differs from what they’re used to. It doesn’t help to map AB and marketing testing frameworks to bandit because of the differences. “We want to highlight how it’s different,” he said.
Another goal for this year is to launch a self-serve UI from which teams can create and deploy bandits tests themselves. This is an area of growth for Zillow, and Wreblewski’s team is hiring.
What other companies use bandits testing?
Microsoft, Google, Amazon and Netflix are just some of the companies also employing bandits. McDonald’s acquisition of Dynamic Yield is a big bet on bandits, said Wroblewski. In one use case, McDonald’s will use the technology to recognize past drive-through customers and customize the menu board based on previous order history.
More insights from the MarTech Conference
Opinions expressed in this article are those of the guest author and not necessarily MarTech Today. Staff authors are listed here.