Last month, I hosted Dylan Lewis, Experimentation Leader at Atlassian, for a virtual fireside chat on building the culture of experimentation. Dylan brings over two decades' worth of experience in the domain and had a lot of great anecdotes to share.
Back in 2005, when Dylan was working at Intuit-TurboTax as the first Data Analyst on their web team, they had a learning window during tax season, from January through April.
This period essentially provided them one quarter to try out ideas, learn as much as they could, and help customers.
Leadership proposed ideas each Monday morning. The team would then build and launch those experiments by Friday and review the early results the following Monday.
The outcome at the end of the tax season was revealing:
Out of the 40 experiments they ran, 38 didn’t win. Side note: The two winning experiments came from marketers. ;)
The Highest Paid Opinion (HIPO) was not always correct.
The customers—the ones actually using the product and experiencing the treatment variants—helped them understand what would ultimately succeed.
Dylan shared, “The term HIPO was modified to 'HIPPO'. Avinash Kaushik presented it at an Emetrics conference, and Ronny Kohavi published this.” It has since become commonplace in the vocabulary of experimentation. Dylan noted that these symbols added a lot of fun and excitement.
“We loved it, and as teams began experimenting, we sent a (stuffed) hippo to the team with a winning experiment for that week. It moved from one place to another depending on which team was winning, and they got to decorate it. By the end of tax season, the hippo would be covered in souvenirs from the teams.”
It didn't stop with the hippos; they also introduced skunks, awarded for experiments that didn't win. Engineers would write the experiment ID on the skunks, giving them to people whose experiments didn't achieve 100% success. By the end of the tax season, engineers would have collected plenty of skunks—proudly displayed on their tables in intricate dioramas!
Now at Atlassian, Dylan is working to scale a mature experimentation program. Modern-day experimentation platforms have become more robust in terms of metric trustworthiness and statistical capabilities, enabling greater experimentation velocity.
Yet Dylan noted that culture remains the biggest challenge for most organizations. A good example Dylan shared, highlighting how culture can make a difference, concerned one of the key metrics on his dashboard: the percentage of failed/restarted experiments—a figure that should be low ideally.
One of their experimentation teams was experiencing a 40% restart rate. To address this, they organized a launch party, during which the experiment was made available to those in the room. This process allowed them to verify if the experience worked as expected.
One of the critical factors for success here was including someone who wasn’t part of the experiment to ensure an unbiased perspective.
The results were impactful, reducing the percentage of restarts to 5%.
Our conversation was filled with valuable takeaways for operationalizing the culture of experimentation, focusing on themes around identifying roadblocks, conducting reviews, prioritization, and ensuring trustworthiness.
This fireside chat is one you won’t want to miss! Watch below. 👇
Standard deviation and variance are essential for understanding data spread, evaluating probabilities, and making informed decisions. Read More ⇾
We’ve expanded our SRM debugging capabilities to allow customers to define custom user dimensions for analysis. Read More ⇾
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾