Friday 20 November 2015

INTRODUCING RHADOOP


RHadoop is a collection of three R packages for providing large data operations with an R environment. It was developed by Revolution Analytics, which is the leading commercial provider of software based on R. RHadoop is available with three main R packages: 

rhdfs, rmr, and rhbase

Each of them offers different Hadoop features.


    • rhdfs is an R interface for providing the HDFS usability from the R console.As Hadoop MapReduce programs write their output on HDFS, it is very easy to access them by calling the rhdfs methods. The R programmer can easily perform read and write operations on distributed data files. Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS. 


    • rmr is an R interface for providing Hadoop MapReduce facility inside the R environment. So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr methods.After that, rmr calls the Hadoop streaming MapReduce API with several job parameters as input directory, output directory, mapper, reducer, and so on,to perform the R MapReduce job over Hadoop cluster.

    • rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server. The rhbase package is designed with several methods for initialization and read/write and table manipulation operations.
    Apart from these there are 2 other packages



    • plyrmr: This is a higher-level abstraction of MapReduce, which allows users to perform common data manipulation in a plyr-like syntax. This package greatly lowers the learning curve of big-data manipulation.

    •  ravro: This allows users to read avro files in R, or write avro files. It allows R to exchange data with HDFS.


      No comments:

      Post a Comment