What is Hadoop?
Hadoop is an Apache open source project that develops software for scalable, distributed computing. It is an Apache top-level project being built and used by a global community of contributors and users, licensed under the Apache License 2.0.
- hadoop is a framework for distributed processing of large data sets across clusters of computers using simple programming models.
- hadoop easily deals with complexities of high volume, velocity and variety of data
- hadoop scales from single servers to 1,000 of machines, each offering local computation and storage
- hadoop detects and handles failures at application layer
The processing of Big Data, and its software testing process, might be split into 3 basic components. The process is illustrated below by an example based on the open source Apache Hadoop software framework:
- Loading the initial data into the HDFS (Hadoop Distributed File System)
- Execution of Map-Reduce operations
- Rolling out the output results from the HDFS
Loading the initial data into HDFS (Hadoop Distributed File System)
In this first step, the data is retrieved from various sources (social media, web logs, social networks etc.) and uploaded into the HDFS, being split into multiple files.
Validations:
- Verifying that the required data was extracted from the original system and there was no data corruption;
- Validating that the data files were loaded into the HDFS correctly;
- Checking the files partition and copying them to different data units;
Determination of the most complete set of data that needs to be checked. For a step-by-step validation, you can use tools such as Datameer, Talend or Informatica.
Execution of Map-Reduce operations
In this step you process the initial data using a Map-Reduce operation to obtain the desired result. Map-reduce is a data processing concept for condensing large volumes of data into useful aggregated results
Validations:
- Checking of required business logic on standalone unit and then on the set of units;
- Validating the Map-Reduce process to ensure that the ‘key-value’ pair is generated correctly;
- Checking the aggregation and consolidation of data after performing ‘reduce’ operation;
- Comparing the output data with initial files to make sure that output file was generated and its format meets all the requirements.
Rolling out the output results from HDFS
This final step includes unloading the data that was generated by the second step and loading it into the downstream system, which may be a repository for data to generate reports or a transactional analysis system for further processing.
Validations:
- Conducting inspection of data aggregation to make sure that the data has been loaded into the required system and thus was not distorted;
- Validating that the reports include all the required data, all indicators are referred to concrete measures and displayed correctly while report operating the latest data.
Big data testers have to learn the components of the Big data eco system from the scratch. Till the time, the market evolves and fully automated testing tools are available for Big data validation, the tester does not have any other option but to acquire the same skill set as the Big data developer in the context of leveraging the Big data technologies like Hadoop. This requires a tremendous mindset shift for both the testers as well as the testing units within the organization.
Reference www.softwaretestingmagazine.com
