What do developer-repaired flaky tests tell us about the effectiveness of automated flaky test detection?

empirical study

flaky tests

program repair

Proceedings of the 3rd International Conference on Automation of Software Test

Authors

Owain Parry

Gregory M. Kapfhammer

Michael Hilton

Phil McMinn

Published

2022

Abstract

Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers’ trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately assess their usefulness for the developers who identify the flaky tests to repair. This is because researchers evaluate detection techniques against baselines that are not derived from past developer behavior or against no baselines at all. To study the effectiveness of an automated test rerunning technique, a common baseline for other approaches to detection, this paper uses 75 commits — authored by human software developers — that repair test flakiness in 31 real-world Python projects. Surprisingly, automated rerunning detects the developer-repaired flaky tests in only 40% of the studied commits. This result suggests that automated rerunning does not often find those flaky tests that developers fix, implying that it makes an unsuitable baseline for assessing a detection technique’s usefulness for developers.

Details

Paper
Presentation
Presentation
flake-it/showflakes

Reference

@inproceedings{Parry2022c,
 author = {Owain Parry and Gregory M. Kapfhammer and Michael Hilton and Phil
McMinn},
 booktitle = {Proceedings of the 3rd International Conference on Automation of
Software Test},
 title = {What do developer-repaired flaky tests tell us about the
effectiveness of automated flaky test detection?},
 year = {2022}
}

Return to Paper Listing