BigData: May 2014

Free Book on Big Data
https://s3.amazonaws.com/leada/handbook/Handbook_Pt1.pdf

BigData Jobs
https://datajobs.com/big-data-jobs-recruiting

Some Misconception about Jobs Profile:

Misconception: Only search for Ph.D Mathematics. This is a misguided approach, given that data scientists are multidisciplinary, not just experts in one field only. Also, we find that data scientists tend to be motivated autodidacts who constantly learn through experience and investigation, not necessarily through a degree program. Some even argue that five years industry experience learning data science is more valuable than five years in academia (of course, others may disagree)
Misconception: Must be a Hadoop expert. A data scientist definitely needs technical skills, but many recruiters confuse this with being an infrastructure engineer. A data scientist must be comfortable interacting with various types of systems including possibly the Hadoop/MapReduce framework, but it shouldn't be a filter when trying to find leads. However, data scientists should be proficient with SQL and R/SAS.
Caution: 'Fake' data scientists. Many people call themselves data scientists because they may have run a regression in Excel at some point, but do not have extended technical and quantitative depth. When interviewing, make sure skills are vetted thoroughly. Do they know Bayesian statistics? Can they write R? Do they have a keen eye for business strategy?

The Technology:

What is NoSQL?

NoSQL (commonly referred to as "Not Only SQL") represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale. In other words, it is a database infrastructure that as been very well-adapted to the heavy demands of big data.

The efficiency of NoSQL can be achieved because unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance. The NoSQL distributed database infrastructure has been the solution to handling some of the biggest data warehouses on the planet – i.e. the likes of Google, Amazon, and the CIA.

What is Hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.

A staple of the Hadoop ecosystem is MapReduce, a computational model that basically takes intensive data processes and spreads the computation across a potentially endless number of servers (generally referred to as a Hadoop cluster). It has been a game-changer in supporting the enormous processing needs of big data; a large data procedure which might take 20 hours of processing time on a centralized relational database system, may only take 3 minutes when distributed across a large Hadoop cluster of commodity servers, all processing in parallel.

BigData

Wednesday, 14 May 2014

BigData All About

What is NoSQL?

What is Hadoop?