The year 2016 is coined by many research analysts and industry observers as an emerging year of Big Data. It is believed that those who are already leveraging big data are sure to surge ahead while those who did not think of it will lag behind in the digital world. IDC predicts revenue from the sales of big data and business analytics applications, tools, and services will increase more than 50%, from nearly $122 billion in 2015 to more than $187 billion in 2019. The analyst firm estimates revenue by technology, industry, and geography in its Worldwide Semiannual Big Data and Analytics Spending Guide 2015.
Also as per a recent study, 76% of the organizations in the US are planning to increase or maintain their investments in Big Data implementations in the next 2-3 years. Data is mounting every second from social networks, mobile, CRM applications, IoT, etc. All of this data provides organizations with highly valuable inputs related to hidden patterns in data that can immensely help organizations to map their success story. The volume of this data is sometimes expected to be in zettabytes. Processing of such high quantities of data from various sources needs to be done at a speed that is relevant to the organizations and is projected to respective users via applications.
As in the case of many other IT Solutions, QA has a significant role in Big Data applications as well. Testing of Big Data is more about verification of ETL processes, rather than testing individual features of an application. When it comes to testing of an enterprise Big Data system, a few primary challenges needs to be addressed.
As we all know data comes from different sources, it needs live integration to make all this data useful. This is possible by an end to end testing of data sources to make sure that data is clean. Check if data sampling, data techniques are correct and also check other issues. A thoroughly tested application only would facilitate live deployment.
The tester working on Big Data systems needs to investigate deep into unstructured or semi-structured data with changing schemas as the systems cannot be tested with "sampling" as in Data Warehousing applications. Big Data applications come with massive data sets for testing through the R&D approach. This testing approach demands a unique set of skills from the tester.
The testing of Big Data systems requires the testers to verify large volumes of data from various sources by using a clustering method. The data needs to be processed systematically, real-time or in batches. Quality check of data becomes essential to check for accuracy, duplication, validity, consistency, etc. Based on the different areas of testing, we can categorize enterprise applications into three buckets:
The first bucket involves extracting the right data from right sources. This extracted data is pushed as an input into the Big Data system and is followed by source data to make sure it matches with the new system. Once it matches with the new system, it is pushed to a specified location.
In the second bucket, the tester verifies the business logic at every node and checks it again at multiple nodes. Same steps are repeated multiple times at all nodes to make sure data segregation and aggregation rules are correctly applied so that the key values are generated accurately.
In the final bucket, the output data files are created and then moved to the necessary system or Data Warehouse. The tester then checks for the data integrity to make sure data is loaded successfully into the target system and also checks for data corruption if any.
From all the discussions so far, it is clearly evident that Big Data systems hold significant promise in today's business environment. To appreciate or to unlock its full potential testers have to employ the right strategies improve test quality and identify bugs at early stages. The right strategies include testing the datasets using the various tools, techniques, and frameworks. Testing during various data processing stages like data creation, storage, retrieval, and analysis that demands different volumes and varieties in testing skills such as database skills, automation skills on a rapid note. It's a tedious effort, but on systematic & successful execution the results pay relatively large dividends.