Thursday 19 November 2015

R + HADOOP STEP BY STEP


A Quick Look at HADOOP


  • Hadoop’s parallel processing muscle is suitable for large amounts of data, it isequally useful for problems that involve large amounts of computation (sometimes known as “processor-intensive” or “CPU-intensive” work). Consider a program that,based on a handful of input values, runs for some tens of minutes or even a number of hours: if you needed to test several variations of those input values, then you would certainly benefit from a parallel solution.

  • Hadoop’s parallelism is based on the MapReduce model. To understand how Hadoop can boost your R performance, then, let’s first take a quick look at MapReduce.

 A Quick Look at MapReduce


  • The MapReduce model outlines a way to perform work across a cluster built of inexpensive commodity machines.
  • It is divided into two phases Map and Reduce.
MAP PHASE
In the Map phase, you divide that input and group the pieces into smaller, independent piles of related material.


1. Each cluster node takes a piece of the initial mountain of data and runs a Map task on each record (item) of input. You supply the code for the Map task.

2. The Map tasks all run in parallel, creating a key/value pair for each record. The key 
identifies the item’s pile for the reduce operation. The value can be the record itself or some derivation thereof.

SHUFFLE
At the end of the Map phase, the machines all pool their results. Every key/value pair is assigned to a pile, based on the key. 

REDUCE PHASE
1. The cluster machines then switch roles and run the Reduce task on each pile. You supply the code for the Reduce task, which gets the entire pile (that is, all of the key/value pairs for a given key) at once.

2. The Reduce task typically, but not necessarily, emits some output for each pile.


















No comments:

Post a Comment