When flaky tests fail together: Empirical evidence for systemic flakiness
Introduction
Have you ever noticed that when one flaky test fails in your continuous integration pipeline, several others seem to fail at the same time? Well, you’re not imagining things! My colleagues and I recently completed a comprehensive study that reveals flaky tests often exist in clusters, failing together due to shared root causes. We call this phenomenon systemic flakiness, and we think it represents a major shift in how we should think about and address flaky tests.
Our research paper “Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures” (Parry et al. 2025)
Summary of Key Results
The Prevalence of Systemic Flakiness: Our analysis of 810 flaky tests across 24 Java projects revealed that 75% of flaky tests belong to clusters with an average of 13.5 tests per cluster. This means that when one flaky test fails, it’s likely that 12 or 13 other test cases will also fail at the same time due to shared root causes.
Predicting Systemic Flakiness with Machine Learning: Rather than requiring thousands of test suite runs to identify which tests fail together, we found that machine learning models can predict systemic flakiness using simple static analysis. Extra trees models achieved the best performance with 74% accuracy, and surprisingly, the structure of test names (like package and class hierarchy) was more predictive than analyzing the actual source code.
Root Causes of Systemic Flakiness: Through manual inspection of error messages and stack traces, we discovered that systemic flakiness is primarily caused by networking issues (e.g., domain name service failures or connection timeouts) and external dependencies (e.g., unstable services or library version conflicts). This differs significantly from individual flaky tests, which are usually caused by concurrency or timing issues.
Future Research Suggestions
This paper’s results suggest several exciting avenues for future investigation:
Expanding Beyond Java: The current study focused on Java projects, but systemic flakiness likely occurs across all programming languages and testing frameworks. Research should examine whether similar clustering patterns exist in other ecosystems.
Automated Root Cause Analysis: While our manual inspection revealed important patterns, the process was time-consuming and sometimes inconclusive. Developing artificial intelligence (AI)-powered techniques to automatically analyze test code and stack traces could significantly accelerate the identification of systemic flakiness causes.
Integration with CI/CD Tools: Current flaky test detection tools normally focus on individual tests without considering failure co-occurrence. Research should explore how to integrate systemic flakiness prediction into existing continuous integration (CI) and continuous delivery (CD) pipelines on platforms like GitHub Actions.
Developer Studies: Our research was based on quantitative analysis, but we need to understand how developers currently perceive and handle systemic flakiness. Surveys and interviews could reveal whether practitioners are already aware of this phenomenon.
Realistic Testing Impact: Previous studies evaluated how flaky tests affect fault localization and mutation testing using simulated individual failures. These assessments should be revisited using realistic models that account for clustered failures.
Benchmark Datasets: The research community needs datasets that explicitly capture failure co-occurrence patterns and the phenomenon of systemic flakiness. These benchmarks would enable comparative evaluation of systemic flakiness detection tools.
This research paper represents a fundamental shift in how we understand flaky tests. If you’re interested in learning more about systemic flakiness or have experiences with clustered test failures, I’d love to hear from you! Please contact me with your insights. To stay updated on the latest developments in flaky test research, consider subscribing to my mailing list.