Thursday, 10 July 2014

Validating BigData

What is Hadoop?
Hadoop is an Apache open source project that develops software for scalable, distributed computing. It is an Apache top-level project being built and used by a global community of contributors and users, licensed under the Apache License 2.0.
  • hadoop is a framework for distributed processing of large data sets across clusters of computers using simple programming models.
  • hadoop easily deals with complexities of high volume, velocity and variety of data
  • hadoop scales from single servers to 1,000 of machines, each offering local computation and storage
  • hadoop detects and handles failures at application layer
The processing of Big Data, and its software testing process, might be split into 3 basic components. The process is illustrated below by an example based on the open source Apache Hadoop software framework:
  1. Loading the initial data into the HDFS (Hadoop Distributed File System)
  2. Execution of Map-Reduce operations
  3. Rolling out the output results from the HDFS
Loading the initial data into HDFS (Hadoop Distributed File System)
In this first step, the data is retrieved from various sources (social media, web logs, social networks etc.) and uploaded into the HDFS, being split into multiple files.
Validations:
  • Verifying that the required data was extracted from the original system and there was no data corruption;
  • Validating that the data files were loaded into the HDFS correctly;
  • Checking the files partition and copying them to different data units;
Determination of the most complete set of data that needs to be checked. For a step-by-step validation, you can use tools such as Datameer, Talend or Informatica.
Execution of Map-Reduce operations
In this step you process the initial data using a Map-Reduce operation to obtain the desired result. Map-reduce is a data processing concept for condensing large volumes of data into useful aggregated results
Validations:
  • Checking of required business logic on standalone unit and then on the set of units;
  • Validating the Map-Reduce process to ensure that the ‘key-value’ pair is generated correctly;
  • Checking the aggregation and consolidation of data after performing ‘reduce’ operation;
  • Comparing the output data with initial files to make sure that output file was generated and its format meets all the requirements.
Rolling out the output results from HDFS
This final step includes unloading the data that was generated by the second step and loading it into the downstream system, which may be a repository for data to generate reports or a transactional analysis system for further processing.
Validations:
  • Conducting inspection of data aggregation to make sure that the data has been loaded into the required system and thus was not distorted;
  • Validating that the reports include all the required data, all indicators are referred to concrete measures and displayed correctly while report operating the latest data.
Big data testers have to learn the components of the Big data eco system from the scratch. Till the time, the market evolves and fully automated testing tools are available for Big data validation, the tester does not have any other option but to acquire the same skill set as the Big data developer in the context of leveraging the Big data technologies like Hadoop. This requires a tremendous mindset shift for both the testers as well as the testing units within the organization.

Wednesday, 14 May 2014

BigData All About

Free Book on Big Data
https://s3.amazonaws.com/leada/handbook/Handbook_Pt1.pdf


BigData Jobs
https://datajobs.com/big-data-jobs-recruiting


Some Misconception about Jobs Profile:
  • Misconception: Only search for Ph.D Mathematics. This is a misguided approach, given that data scientists are multidisciplinary, not just experts in one field only. Also, we find that data scientists tend to be motivated autodidacts who constantly learn through experience and investigation, not necessarily through a degree program. Some even argue that five years industry experience learning data science is more valuable than five years in academia (of course, others may disagree)
  • Misconception: Must be a Hadoop expert. A data scientist definitely needs technical skills, but many recruiters confuse this with being an infrastructure engineer. A data scientist must be comfortable interacting with various types of systems including possibly the Hadoop/MapReduce framework, but it shouldn't be a filter when trying to find leads. However, data scientists should be proficient with SQL and R/SAS.
  • Caution: 'Fake' data scientists. Many people call themselves data scientists because they may have run a regression in Excel at some point, but do not have extended technical and quantitative depth. When interviewing, make sure skills are vetted thoroughly. Do they know Bayesian statistics? Can they write R? Do they have a keen eye for business strategy?


The Technology:


What is NoSQL?

NoSQL (commonly referred to as "Not Only SQL") represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale. In other words, it is a database infrastructure that as been very well-adapted to the heavy demands of big data.
The efficiency of NoSQL can be achieved because unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance. The NoSQL distributed database infrastructure has been the solution to handling some of the biggest data warehouses on the planet – i.e. the likes of Google, Amazon, and the CIA.

State of Big Data


What is Hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.
A staple of the Hadoop ecosystem is MapReduce, a computational model that basically takes intensive data processes and spreads the computation across a potentially endless number of servers (generally referred to as a Hadoop cluster). It has been a game-changer in supporting the enormous processing needs of big data; a large data procedure which might take 20 hours of processing time on a centralized relational database system, may only take 3 minutes when distributed across a large Hadoop cluster of commodity servers, all processing in parallel.