Is big data a big deal? Not without correct software!

database testing
position statement
Data integrity is central to big data initiatives!

Gregory M. Kapfhammer



Important Context

This post was written to support my participation in a panel at the 27th International Conference on Software Engineering and Knowledge Engineering. To view the accompanying slides for this presentation, please refer to (Kapfhammer 2015) . If you want to learn more about new work that my colleagues and students and I are conducting in the area of efficiently testing data-centric applications, then please read (Kinneer et al. 2015a) and (Kinneer et al. 2015b) , two papers that were also presented at the same conference.

Big data analytics software allows researchers and practitioners to create descriptive models and make predictions. Often characterized by the “three Vs” of volume, velocity, and variety, big data systems must respectively handle large amounts of data that arrive rapidly and take many different forms. In fields such as evidence-based medicine and the detection of financial fraud, big data software is poised to and, indeed already is, making important contributions.

However, there is an additional “V” that is often overlooked by both researchers and practitioners: veracity. That is, if there is a lack of correctness in the software and data that makes up a big data analytics system, then the data models and the resulting predictions may be compromised — with serious consequences. For instance, the Data Warehouse Institute reports that North American organizations experience a $611 billion annual loss due to poor data quality. Scott W. Ambler argues that the “virtual absence” of software and data testing is the primary cause of this loss. Although this example is not specifically tied to big data systems, it clearly illustrates the risks associated with a lack of veracity in any data-rich field.

The challenge for software testing researchers is to develop and empirically evaluate new methods that can accommodate the volume, velocity, and variety that is characteristic of big data systems. While some preliminary work (e.g., the testing of both data mining systems and database applications) has recently been published, few software engineering researchers have focused on big data testing. Since veracity is not always considered by big data researchers, the challenge for these individuals is to create and assess new techniques that, whenever possible, holistically consider all of the “four Vs”. If not already doing so, practitioners in both of these fields should start to establish a confidence in the correctness of both their software and data through the disciplined use of testing.

The title of my position statement poses the question “is big data a big deal?” Of course, the answer to this question is “yes”. With that said, the increase of big data’s importance and impact will be accelerated and even sustained if researchers and practitioners in fields such as software engineering, software testing, and big data collaborate with each other to develop efficient and effective data analytics systems that construct high-quality models and make accurate predictions. Let’s collaborate across the fields of software engineering and big data to ensure that we have a positive influence on society — thus proving to be a “bigger deal” together than we would have been on our own.

Get the Gist!
Further Details

Interested in learning more about this topic? Since this blog post was first written, my colleagues and students and I have published several additional papers about the testing of relational database schemas, with the most noteworthy one being (McMinn, Wright, and Kapfhammer 2015) . If you are interested in using SchemaAnalyst to test your own database schema, then please download and use the tool, which is now available from the GitHub site schemaanalyst/schemaanalyst.

Return to Blog Post Listing


Kapfhammer, Gregory M. 2015. “Is Big Data a Big Deal? Not Without Correct Software!” 27th International Conference on Software Engineering and Knowledge Engineering – Panel.
Kinneer, Cody, Gregory M. Kapfhammer, Chris J. Wright, and Phil McMinn. 2015a. “Automatically Evaluating the Efficiency of Search-Based Test Data Generation for Relational Database Schemas.” In Proceedings of the 27th International Conference on Software Engineering and Knowledge Engineering.
———. 2015b. “ExpOse: Inferring Worst-Case Time Complexity by Automatic Empirical Study.” In Proceedings of the 27th International Conference on Software Engineering and Knowledge Engineering.
McMinn, Phil, Chris J. Wright, and Gregory M. Kapfhammer. 2015. “The Effectiveness of Test Coverage Criteria for Relational Database Schema Integrity Constraints.” Transactions on Software Engineering and Methodology 25 (1).