Five important details during AV tests

Five important details during AV tests

In the age of data driven, companies run hundreds of tests to rely on data for decision-making, thereby trying to be more informed. The problem is that if you do not take into account the subtleties during the tests, then all your time can be wasted on them and the decisions you make will turn out to be unsupported by anything. Conducting social surveys of analysts in our team, I discovered the top 5 oversights that exist in their current work or that they allow when performing tasks.

Many people know that you need to look for statistical significance, but not everyone calculates confidence intervals.

Statistical significance means that the result of the experiment is apparently not due to chance, indicating that the feature you are testing really affected the selected metrics.

The confidence interval gives even more input, namely how much the metric has changed and what level of precision you were able to touch on that change. And please don’t use the simple difference between the metrics in the groups in the interpretations, it makes your analysis very imprecise, which will lead to inconsistencies in the future.

Let’s say you’ve added a new feature to your app, and a test shows that more people are using it now. That sounds great. But what if the confidence interval shows that user engagement can increase by 10% at best and only 1% at worst? If so, is this a feature of your efforts?

Examining confidence intervals allows you to weigh the potential benefits, costs, and effort required for implementation. They will give you more confidence that the changes you are investing in will really make a difference.

Some teams ignore corrections for multiple testing, which often leads to incorrect hypothesis acceptance, creating a false positive.

When you test multiple metrics and groups, the risk of false positives increases. For example, if you measure four groups and compare them with three metrics, each group and metric carries a 5% chance of false positive, and for three experimental groups the probability of a type 1 error will be 14.3% instead of the expected 5%.

There are various techniques that can be used to reduce the chance of error in multiple tests:

Bonferroni method this is the easiest way. Divide your α by the number of checks. For example, if you run 4 tests, the significance level for each will be α/4.

Hill method is a sequential correction that is a modification of the Bonferroni correction. Tests are sorted by significance level, and the significance level for the most significant test is adjusted in the same way as the Bonferroni correction. For the next most significant test, the significance level is adjusted by a smaller number of tests, etc.

FDR (False Discovery Rate) control instead of controlling the probability of at least one type 1 error, you control the expected proportion of falsely rejected hypotheses, for example through the Benjamini-Hochberg Method.

Plan experiments instead of testing many hypotheses at once, experiments can be designed to test only one or a small number of hypotheses at a time.

Usually, teams fall into 2 types: those who choose very high-level metrics (turnover, number of bids, conversion to activation) and those who choose very low-level (CTR, session time, feature engagement). The right approach is a combination of these metrics. In order to understand how the tested change affects users, it is better to analyze low-level metrics first, but then do not forget to check how it affects the product in general and whether it was worth it.

For example, e-commerce is testing a new, more prominent “Add to Cart” button to increase purchases. While the primary metric is improving significantly, the secondary metrics tell a different story: higher abandoned cart rates and lower average session times. As a result, more users click on the button, less buy, and the overall engagement on the site decreases.

Segmentation in A/B tests ensures that you understand how changes affect specific types of users, rather than just seeing the average effect for everyone.

Many people find segmentation in tests difficult and sometimes expensive, especially with limited resources. In addition, there is a tendency to focus on broader test results, looking for general conclusions that apply to the entire user base.

Different segments may respond differently to the same change: what works for one group may not work for another.

Imagine you’re on an online streaming product team and you’re A/B testing a new feature that recommends movies based on watch history. By segmenting users, your team learns that while a feature is popular with younger people, it’s less effective for users over 50. Can you then introduce it only for young people?

Maybe, but you might need more data to be sure. To avoid false positives, you should re-run the experiment targeting young people to see if you actually found a positive effect on the metrics.

If complete segmentation is not possible, at least compare your results to your most important user segment. Ask yourself: Are the results of this test an accurate representation of your most important users? This approach gives you an idea of ​​how well your test meets the needs of your target audience.

Every A/B test represents value as an optimization of your product, but also a lesson.

Proper documentation will create a knowledge base from which your teams can learn. That way, team members can understand what worked and what didn’t and why.

Think of user behavior data as raw coffee beans. A/B tests are already roasted coffee with the described technology of how to roast it. From it, you can either prepare good coffee right away or see subtleties in roasting technology that will allow you to roast coffee even tastier next time.

So start notion, confluence or at least google tablet where you will write:

  • Hypothesis: If … , then …

  • What does the user experience look like?

  • Which users are we testing on?

  • Key metrics

  • Expected effect

  • Was the test stat significant

  • Test results with confidence intervals

If you have a lot of commands, you’ll be surprised how useful this can be when you get back to generating hypotheses.

Subscribe to my telegram channelwhere I talk about product development and analytics, and add the article to your favorites!

We are actively expanding, so we need product analysts! If you are a fan of data, want to be the most useful in the team and want results, then [email protected] or send resume via hh.

Related posts