Last month, I hosted Dylan Lewis, Experimentation Leader at Atlassian, for a virtual fireside chat on building the culture of experimentation. Dylan brings over two decades' worth of experience in the domain and had a lot of great anecdotes to share.
Back in 2005, when Dylan was working at Intuit-TurboTax as the first Data Analyst on their web team, they had a learning window during tax season, from January through April.
This period essentially provided them one quarter to try out ideas, learn as much as they could, and help customers.
Leadership proposed ideas each Monday morning. The team would then build and launch those experiments by Friday and review the early results the following Monday.
The outcome at the end of the tax season was revealing:
Out of the 40 experiments they ran, 38 didn’t win. Side note: The two winning experiments came from marketers. ;)
The Highest Paid Opinion (HIPO) was not always correct.
The customers—the ones actually using the product and experiencing the treatment variants—helped them understand what would ultimately succeed.
Dylan shared, “The term HIPO was modified to 'HIPPO'. Avinash Kaushik presented it at an Emetrics conference, and Ronny Kohavi published this.” It has since become commonplace in the vocabulary of experimentation. Dylan noted that these symbols added a lot of fun and excitement.
“We loved it, and as teams began experimenting, we sent a (stuffed) hippo to the team with a winning experiment for that week. It moved from one place to another depending on which team was winning, and they got to decorate it. By the end of tax season, the hippo would be covered in souvenirs from the teams.”
It didn't stop with the hippos; they also introduced skunks, awarded for experiments that didn't win. Engineers would write the experiment ID on the skunks, giving them to people whose experiments didn't achieve 100% success. By the end of the tax season, engineers would have collected plenty of skunks—proudly displayed on their tables in intricate dioramas!
Now at Atlassian, Dylan is working to scale a mature experimentation program. Modern-day experimentation platforms have become more robust in terms of metric trustworthiness and statistical capabilities, enabling greater experimentation velocity.
Yet Dylan noted that culture remains the biggest challenge for most organizations. A good example Dylan shared, highlighting how culture can make a difference, concerned one of the key metrics on his dashboard: the percentage of failed/restarted experiments—a figure that should be low ideally.
One of their experimentation teams was experiencing a 40% restart rate. To address this, they organized a launch party, during which the experiment was made available to those in the room. This process allowed them to verify if the experience worked as expected.
One of the critical factors for success here was including someone who wasn’t part of the experiment to ensure an unbiased perspective.
The results were impactful, reducing the percentage of restarts to 5%.
Our conversation was filled with valuable takeaways for operationalizing the culture of experimentation, focusing on themes around identifying roadblocks, conducting reviews, prioritization, and ensuring trustworthiness.
This fireside chat is one you won’t want to miss! Watch below. 👇
Experimenting with query-level optimizations at Statsig: How we reduced latency by testing temp tables vs. CTEs in Metrics Explorer. Read More ⇾
Find out how we scaled our data platform to handle hundreds of petabytes of data per day, and our specific solutions to the obstacles we've faced while scaling. Read More ⇾
The debate between Bayesian and frequentist statistics sounds like a fundamental clash, but it's more about how we talk about uncertainty than the actual decisions we make. Read More ⇾
Building a scalable experimentation platform means balancing cost, performance, and flexibility. Here’s how we designed an elastic, efficient, and powerful system. Read More ⇾
Here's how we optimized store cloning, cut processing time from 500ms to 2ms, and engineered FastCloneMap for blazing-fast entity updates. Read More ⇾
It's one thing to have a really great and functional product. It's another thing to have a product that feels good to use. Read More ⇾