I recently hosted an episode of Software Engineering Radio called "Eran Yahav on the Tabnine AI Coding Assistant"!

  • Home
  • Teaching
    • Overview
    • Data Abstraction
    • Operating Systems
  • Research
    • Overview
    • Papers
    • Presentations
  • Outreach
    • Software
    • Service
    • Blog
  • About
    • Biography
    • Schedule
    • Contact
    • Blog
    • Service
    • Papers
    • Presentations

Systemic flakiness: An empirical analysis of co-occurring flaky test failures

empirical study
flaky tests
machine learning
Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering
Authors

Owain Parry

Gregory M. Kapfhammer

Michael Hilton

Phil McMinn

Published

2025

Abstract
Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. This paper reveals that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This result suggests that developers can reduce test repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. Using a data set that contains 810 flaky tests, we performed a mixed-method empirical analysis of co-occurring flaky test failures, revealing that systemic flakiness is significant and widespread.

We ran agglomerative clustering of flaky tests based on their failure co-occurrence, showing that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, this paper demonstrates a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by the paper’s four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness in the chosen open-source projects.
Details

Paper

Reference
@inproceedings{Parry2025,
 author = {Owain Parry and Gregory M. Kapfhammer and Michael Hilton and Phil
McMinn},
 booktitle = {Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering},
 title = {Systemic flakiness: An empirical analysis of co-occurring flaky test failures},
 year = {2025}
}

Return to Paper Listing

GMK

Top