Finding flaky tests with machine learning and the FLAKE16 feature set

post

flaky tests

developer productivity

How to best find flaky tests with machine learning?

Author

Gregory M. Kapfhammer

Published

2022

Introduction

Isn’t it frustrating when you run your test suite and some tests pass and others fail, even though you haven’t changed any code? These tests are often called flaky tests — and they present a serious and pernicious challenge for software engineers. My colleagues and I recently published (Parry et al. 2022) that introduces FLAKE16, a new feature set that significantly improves the detection of flaky tests with machine learning. Keep reading to learn more about this paper!

Results

This paper’s study evaluates the performance of 54 machine learning pipelines for detecting both non-order-dependent (NOD) and order-dependent (OD) flaky tests. Using real-world Python projects, we found that FLAKE16 outperformed a well-established feature set in detecting both types of flaky tests. Specifically, FLAKE16 offered a 13% increase in the overall F1-score when detecting NOD flaky tests and a 17% increase when detecting OD flaky tests.

One of the key insights from our research is the identification of the most impactful features for detecting flaky tests. For NOD flaky tests, the peak number of concurrently running threads during test case execution was the most impactful. For OD flaky tests, the number of read- and write-related system calls had the greatest impact on the detector’s overall effectiveness.

Future

Our research is a significant step forward in the field of flaky test detection. Since there is evidence of the promise of this approach, we plan to extend our work to include test cases from projects implemented in different programming languages and evaluate the performance of machine learning models for detecting flaky tests in additional specific categories. You can explore (Parry et al. 2023) for a detailed introduction and evaluation of a both feature set that extends FLAKE16 and a hybrid approach that combines machine learning with test re-running. Guess what? It turns out that re-running and machine learning work better together!

Further Details

As my colleagues and I continue to explore how machine learning can support the detection and repair of flaky tests, your insights and suggestions are appreciated! If you have ideas or experiences related to this pervasive issue in software testing, please contact me. Or, if you want to stay informed about new developments and blog posts related to flaky tests and other software testing topics, consider subscribing to my mailing list. Finally, you can hear me discuss this paper on a podcast by listening to the episode "Dr. Gregory Kapfhammer Wants to Stop Flaky Tests"!

Return to Blog Post Listing

References

Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2022. “Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests.” In Proceedings of the 15th International Conference on Software Testing, Verification and Validation.

———. 2023. “Empirically Evaluating Flaky Test Detection Techniques Combining Test Case Rerunning and Machine Learning Models.”Empirical Software Engineering Journal 28 (72).