**The Peeking Problem 2.0: Challenges with Sequential Tests in Longitudinal Data Analysis**
At Spotify, our data infrastructure is constantly evolving to improve our online experimentation process. One important aspect of this process is obtaining early feedback on experiments in a risk-managed manner. To achieve this, we utilize sequential tests to monitor regressions in our experiments. However, when working with smaller time frames, we encounter longitudinal data, which presents new challenges for sequential tests. In this article, we will discuss these challenges and present our approach to addressing them.
*Part 1: The Peeking Problem 2.0 and Challenges in Sequential Testing*
Sequential testing is widely used for continuous monitoring of A/B tests in online experimentation. It offers a solution to the “peeking problem,” which occurs when statistical analysis is conducted on a sample before all the results have been observed. This can lead to an inflated false positive risk and violates the statistical assumptions of the test.
However, we have observed a new problem, referred to as the “peeking problem 2.0,” which can still inflate false positive rates despite the use of sequential tests. This problem arises when a participant’s results are analyzed before all the measurements of that participant have been collected, known as “within-unit peeking.”
*Types of Metrics: Cohort-based and Open-ended*
To understand the challenges of longitudinal data, we will examine two common types of metrics: cohort-based metrics and open-ended metrics.
Cohort-based metrics involve measuring units over the same fixed time window after exposure to the experiment. These metrics do not suffer from the peeking problem 2.0 but may require waiting longer for results or using less available data.
On the other hand, open-ended metrics utilize all available data per unit. While appealing for utilizing all the data, standard sequential tests are typically invalid for these metrics and are highly susceptible to the peeking problem 2.0. Despite this, open-ended metrics are commonly used in practice and supported by online experimentation vendors.
*The Need for Precision in Statistical Goals*
When dealing with multiple measurements per unit, it is crucial to be clear about the specific treatment effect we aim to learn about. Precise definition of the statistical goal enables us to select appropriate estimators and statistical tests.
*Cohort-based Metrics vs. Open-ended Metrics*
Cohort-based metrics offer the advantage of avoiding the peeking problem 2.0. However, using these metrics requires waiting longer for results or using less available data. Open-ended metrics, while susceptible to the peeking problem 2.0, utilize all available data. This poses a challenge for sequential statistical analysis.
*A Monte Carlo Simulation Study*
To illustrate the inflated false positive rates resulting from using standard sequential tests for open-ended metrics, we conducted a small Monte Carlo simulation study. The results highlight the importance of employing appropriate sequential tests for such metrics to avoid erroneous conclusions.
*Longitudinal Data and Measurement Frequency*
Advancements in data collection infrastructure have enabled more frequent measurements and analysis during experiments. However, the literature on sequential testing has not explored how to incorporate these more granular measurements in a valid manner. For example, measuring the difference in music consumption within the first few seconds of exposure to an experiment may not yield meaningful results as users may not have had sufficient time to exhibit changed behavior.
The question arises: Should we measure units for a short time window after they enter the experiment to detect changes early, or should we measure them for a longer time window to obtain a more comprehensive understanding of their response? The solution lies in incorporating repeated measurements per unit in sequential analysis.
*Separating Concepts for Efficient Sequential Tests*
To derive valid and efficient sequential tests for more complex data settings, it is essential to separate metrics, estimands, estimators, and statistical tests. It is common for online experimenters to conflate these concepts, leading to confusion. It’s crucial to clearly define the behavior or aspect to be measured per unit and select appropriate treatment effects, estimators, and statistical tests accordingly.
In conclusion, sequential testing is a valuable tool for continuous monitoring of A/B tests in online experimentation. However, when dealing with longitudinal data, challenges such as the peeking problem 2.0 arise. Understanding the different types of metrics and utilizing appropriate sequential tests can help mitigate these challenges and enable reliable analysis of experiments. *