I recently hosted an episode of Software Engineering Radio called "Jennings Anderson and Amy Rose on Overture Maps"!

  • Home
  • Teaching
    • Overview
    • Algorithm Analysis
    • Document Engineering
  • Research
    • Overview
    • Papers
    • Presentations
  • Outreach
    • Software
    • Service
    • Blog
  • About
    • Biography
    • Schedule
    • Contact
    • Blog
    • Service
    • Papers
    • Presentations

Test flimsiness: Characterizing flakiness induced by mutation to the code under test

empirical study
flaky tests
mutation testing
Proceedings of the 48th International Conference on Software Engineering
Authors

Owain Parry

Gregory M. Kapfhammer

Michael Hilton

Phil McMinn

Published

2026

Abstract
Flaky tests, which fail non-deterministically against the same version of code, pose a well-established challenge to software developers. In this paper, we characterize the overlooked phenomenon of test FLIMsiness: FLakiness Induced by Mutations to the code under test. These mutations are generated by the same operators found in standard mutation testing tools. Flimsiness has profound implications for software testing researchers. Previous studies quantified the impact of pre-existing flaky tests on mutation testing, but we reveal that mutations themselves can induce flakiness, exposing a previously neglected threat. This has serious effects beyond mutation testing, calling into question the reliability of any technique that relies on deterministic test outcomes in response to mutations.

On the other hand, flimsiness presents an opportunity to surface potential flakiness that may otherwise remain hidden. Prior work perturbed the execution environment to augment rerunning-based detection and the test code to support benchmarking. We advance these efforts by perturbing a third major source of flakiness: the code under test. We conducted an empirical study on over half a million test suite executions across 28 Python projects. Our statistical analysis on over 30 million mutant-test pairs unveiled flimsiness in 54% of projects. We found that extending the standard rerunning flaky test detection strategy with code-under-test mutations detects a substantially larger number of flaky tests (median 740 vs. 163) and uncovers many that the standard strategy is unlikely to detect.
Details

Paper

Reference
@inproceedings{Parry2026,
 author = {Owain Parry and Gregory M. Kapfhammer and Michael Hilton and Phil
McMinn},
 booktitle = {Proceedings of the 48th International Conference on Software Engineering},
 title = {Test flimsiness: Characterizing flakiness induced by mutation to the code under test},
 year = {2026}
}

Return to Paper Listing

GMK

Top