I recently hosted an episode of Software Engineering Radio called "Will McGugan on Text-Based User Interfaces"!

  • Home
  • Teaching
    • Overview
    • Data Abstraction
    • Operating Systems
  • Research
    • Overview
    • Papers
    • Presentations
  • Outreach
    • Software
    • Service
    • Blog
  • About
    • Biography
    • Schedule
    • Contact
    • Blog
    • Service
    • Papers
    • Presentations

Contents

  • Introduction
  • Summary of Key Results
  • Future Research Suggestions

When flaky tests fail together: Empirical evidence for systemic flakiness

post
flaky tests
empirical study
Flaky tests often cluster together with shared root causes!
Author

Gregory M. Kapfhammer

Published

2025

Introduction

Have you ever noticed that when one flaky test fails in your continuous integration pipeline, several others seem to fail at the same time? Well, you’re not imagining things! My colleagues and I recently completed a comprehensive study that reveals flaky tests often exist in clusters, failing together due to shared root causes. We call this phenomenon systemic flakiness, and we think it represents a major shift in how we should think about and address flaky tests.

Our research paper “Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures” (Parry et al. 2025) reveals that that 75% of flaky tests belong to clusters with an average of 13.5 tests per cluster. As explained in this post, our discovery has profound implications for how developers can reduce the time and cost of fixing flaky tests!

Summary of Key Results

  • The Prevalence of Systemic Flakiness: Our analysis of 810 flaky tests across 24 Java projects revealed that 75% of flaky tests belong to clusters with an average of 13.5 tests per cluster. This means that when one flaky test fails, it’s likely that 12 or 13 other test cases will also fail at the same time due to shared root causes.

  • Predicting Systemic Flakiness with Machine Learning: Rather than requiring thousands of test suite runs to identify which tests fail together, we found that machine learning models can predict systemic flakiness using simple static analysis. Extra trees models achieved the best performance with 74% accuracy, and surprisingly, the structure of test names (like package and class hierarchy) was more predictive than analyzing the actual source code.

  • Root Causes of Systemic Flakiness: Through manual inspection of error messages and stack traces, we discovered that systemic flakiness is primarily caused by networking issues (e.g., domain name service failures or connection timeouts) and external dependencies (e.g., unstable services or library version conflicts). This differs significantly from individual flaky tests, which are usually caused by concurrency or timing issues.

Future Research Suggestions

This paper’s results suggest several exciting avenues for future investigation:

  • Expanding Beyond Java: The current study focused on Java projects, but systemic flakiness likely occurs across all programming languages and testing frameworks. Research should examine whether similar clustering patterns exist in other ecosystems.

  • Automated Root Cause Analysis: While our manual inspection revealed important patterns, the process was time-consuming and sometimes inconclusive. Developing artificial intelligence (AI)-powered techniques to automatically analyze test code and stack traces could significantly accelerate the identification of systemic flakiness causes.

  • Integration with CI/CD Tools: Current flaky test detection tools normally focus on individual tests without considering failure co-occurrence. Research should explore how to integrate systemic flakiness prediction into existing continuous integration (CI) and continuous delivery (CD) pipelines on platforms like GitHub Actions.

  • Developer Studies: Our research was based on quantitative analysis, but we need to understand how developers currently perceive and handle systemic flakiness. Surveys and interviews could reveal whether practitioners are already aware of this phenomenon.

  • Realistic Testing Impact: Previous studies evaluated how flaky tests affect fault localization and mutation testing using simulated individual failures. These assessments should be revisited using realistic models that account for clustered failures.

  • Benchmark Datasets: The research community needs datasets that explicitly capture failure co-occurrence patterns and the phenomenon of systemic flakiness. These benchmarks would enable comparative evaluation of systemic flakiness detection tools.

Further Details

This research paper represents a fundamental shift in how we understand flaky tests. If you’re interested in learning more about systemic flakiness or have experiences with clustered test failures, I’d love to hear from you! Please contact me with your insights. To stay updated on the latest developments in flaky test research, consider subscribing to my mailing list.

Return to Blog Post Listing

References

Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2025. “Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures.” In Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering.

GMK

Top