In a world that produces more than 2.5 quintillion bytes of data per day, it is not a matter of ’if’ but ’when’ a software firm will have to face an assignment involving enormous datasets or “Big Data”. Along with the challenges it poses to mainstream programming, Big Data also demands radically new approaches to testing due to the sheer enormity and heterogeneous nature of constituent data. Traditional testing methods of simply executing read/write queries are no longer sufficient and a much more complex process needs to take their place.
The overall testing for Big Data is performed in 3 stages. Since traditional RDBMS cannot properly handle enormous data sets, we will use the Hadoop Platform in this example instead. The Hadoop platform is much more appropriate to our requirements since it was built specifically to handle large, heterogeneous, non-tabulated datasets demanding computationally intensive operations.
1. Pre-Hadoop processing
The first stage of Big Data testing involves processing and validating input data. Firstly, all necessary datasets are extracted from their respective sources. Followed by this, the entire data is compared with source data to ensure only the correct data has been pulled. Finally, the data is pushed into Hadoop Distributed File Systems. HDFS is used for storage since Big Data demands scalability - a feature that is notably absent in traditional RDBMS. Upon pushing data into HDFS, verification is performed to ensure that data has been loaded into the right HDFS location(s). Another important step at this stage is to ensure that input datasets are replicated in different data nodes. This is a check to ensure that data is not lost in case of a failure.
2. MapReduce Validation
MapReduce is a technique where input data is first converted into the form of tuples (mapping), following which the tuples are combined to form smaller sets of tuples (reducing). In the second stage of Big Data testing, MapReduce programs are run against data stored in the HDFS. Business logic validation is initially performed on a standalone node, followed by which a higher-level validation is performed once MapReduce is successfully run against multiple nodes. This stage involves ensuring that the overall MapReduce process has worked correctly, key-value pairs have been generated correctly, and data has been validated after the completion of MapReduce. Another important step in this stage includes validating the output data to ensure it has been shaped/stored in the right format.
3. HDFS Result Extraction & Validation
By this stage, output files have been generated and are ready to be moved into an Enterprise Data Warehouse. In the third and final stage, the main priorities are to ensure that transformation rules have been applied correctly on the data without any corruption. Validating successful data loading and data aggregation along with data integrity in target system are further steps that are to be performed here.
Although “Big Data Testing” is synonymous with the functional testing process discussed above, ignoring the non-functional aspects would simply result in bad QA. Some of the important non-functional testing methodologies for Big Data are discussed below.
1. Performance Testing
Due to the enormous nature of testing data, the risks of low performance are very much there and could potentially jeopardize the service level agreement (SLA). Performance testing is thus performed by setting up huge volumes of data and infrastructure similar to that of production. Metrics captured include memory utilization, throughput, and job completion time.
2. Failover Testing
Node failure in the Hadoop architecture can cause some of the components to be rendered useless. The architecture thus needs to be designed to handle these issues. Failover testing checks to ensure that node failure does not cause data corruption, data recovery process is initiated appropriately, data replication is performed in cases of failure, and metrics such as Recovery Point Objective and Recovery Time Objective are captured.
Even before testing is initiated, the test environment should be verified as possessing sufficient storage space, minimum CPU & resource utilization, and a well-defined cluster with distributed nodes. Furthermore, while the above processes take a generalized look at how testers can perform Big Data Testing, firms must also be quick to implement test automation wherever applicable. Due to the enormous size of “Big Data” datasets, an automated framework would go miles in helping achieve a faster, more comprehensive testing process.
Pyramid Consulting helps implement testing of massively scalable solutions for big data infrastructures. Our QA designers bring innovative new testing solutions to performance, security, and data quality that provide fast feedback within development iterations. Pyramid Consulting’s robust Big Data Testing Strategy (BDTS) is built to mitigate rapidly evolving data integrity challenges and ensure robust quality assurance processes for big data implementations. To address the dynamic changes in big data ecosystems, Pyramid Consulting helps organizations streamline their processes for data warehouse testing, performance testing, and test data management.
Contact us to know more about our Big Data capabilities.