What do developer-repaired flaky tests tell us about the effectiveness of automated flaky test detection?
empirical study
flaky tests
program repair
Proceedings of the 3rd International Conference on Automation of Software Test
Abstract
Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers’ trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately assess their usefulness for the developers who identify the flaky tests to repair. This is because researchers evaluate detection techniques against baselines that are not derived from past developer behavior or against no baselines at all. To study the effectiveness of an automated test rerunning technique, a common baseline for other approaches to detection, this paper uses 75 commits — authored by human software developers — that repair test flakiness in 31 real-world Python projects. Surprisingly, automated rerunning detects the developer-repaired flaky tests in only 40% of the studied commits. This result suggests that automated rerunning does not often find those flaky tests that developers fix, implying that it makes an unsuitable baseline for assessing a detection technique’s usefulness for developers.Details
Presentation
flake-it/showflakes
Reference
@inproceedings{Parry2022c,
author = {Owain Parry and Gregory M. Kapfhammer and Michael Hilton and Phil
McMinn},booktitle = {Proceedings of the 3rd International Conference on Automation of
Software Test},title = {What do developer-repaired flaky tests tell us about the
effectiveness of automated flaky test detection?},year = {2022}
}