Tuesday, 24 November 2015

Data Preparation & Exploration Using R

Before starting our exploration you must know about the types of variables:

TYPES OF VARIABLES:

  • An independent variable, sometimes called an experimental or predictor variable, is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes called an outcome variable.
  • Dependent Variable: Test Mark (measured from 0 to 100)
    Independent Variables: Revision time (measured in hours) Intelligence (measured using IQ score)
Categorical and Continuous Variables:
  • Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. For example, a real estate agent could classify their types of property into distinct categories such as houses, condos, co-ops or bungalows.
  • Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either "male" or "female". 
  • Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. So if you asked someone if they liked the policies of the Democratic Party and they could answer either "Not very much", "They are OK" or "Yes, a lot" then you have an ordinal variable.
DATASET:
You can download the dataset from follwing link:
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg

IMPORTING THE DATA AND EXPLORING IT


Here ‘as.is=T’ is added to read dates as character which can later be converted to dates. Also please note the variable name with spaces has been converted to variable name with dots in ‘R’ after importing. R will also convert any special characters in variable name to dot (.) after importing.One of the prominent inconsistency is missing value are represented as “No Info” as well as “blanks”. So, the first step should be to replace “No Info” and “blanks” with “NA” to get better sense of missing values in data.


While importing file many of the continuous values and dates were imported as character vector that needs to be converted respective data type
Next step would be to understand amount of missing values in data. Any variable missing more than 50% should not be used in modeling as it can give false impression of relationship with dependent and can pollute the model. Being on conservative side generally we prefer to keep variable with less than 40% missing value only for treatment and anything above 40% missing values are to be used in testing only to give additional insights.



Now we can filter out the data with more than 40% missing which we can later use for testing and additional insights and keep only less than 40% variable for modeling.


It would be better if we separate out the numeric and character variables from data frame. It would help in performing operations going forward.


We will examine the values and in consistencies in each of the variable to better understand it and rectify for the same. There are two ways to examine any variable depending on type of variable. For continuous variables you can use ‘summary’ and ‘quantile’ function and for categorical variables use ‘table’ function.



Here a big jump in values seem to be occurring after 90th percentile. It can be further examined to identify cut off point of relatively larger jump in values that can be capped before analysis. Capping means very large outlier type values will be replaced with relatively meaningful values.


Overfitting and Underfitting

  • Overfitting manifests itself when you have a model that does not generalize well. Say that you achieve a classification accuracy rate on your training data of 95 percent, but when you test its accuracy on another set of data, the accuracy falls to 50 percent. This would be considered a high variance. If we had a case of 60 percent accuracy on the train data and 59 percent accuracy on the test data, we now have a low variance but a high bias. This bias-variance trade-off is fundamental to machine learning and model complexity
  • A bias error is the difference between the value or class that we predict and the actual value or class in our training data. 
  • A variance error is the amount by which the predicted value or class in our training set differs from the predicted value or class versus the other datasets
  • Our goal is to minimize the total error (bias + variance)
  • let’s say that we are trying to predict a value and we build a simple linear model with our train data. As this is a simple model, we could expect a high bias, while on the other hand, it would have a low variance between the train and test data. Now, let’s try including polynomial terms in the linear model or build decision trees. The models are more complex and should reduce the bias. However, as the bias decreases, the variance, at some point, begins to expand and generalizability is diminished. You can see this phenomena in the following illustration. Any machine learning effort should strive to achieve the optimal trade-off between the bias and variance, which is easier said than done. 

Saturday, 21 November 2015

A Simple Recommendation System in R



What is a Recommendation System?


  • Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. The suggestions relate to various decision-making processes, such as what items to buy, what music to listen to, or what online news to read.
  • Item” is the general term used to denote what the system recommends to users.
  • A RS normally focuses on a specific type of item (e.g., CDs, or news) and accordingly its design, its graphical user interface, and the core recommendation technique used to generate the recommendations are all customized to provide useful and effective suggestions for that specific type of item.

What is function of a recommendation system?

  • Increase the number of items sold
  • Sell more diverse items.
  • Increase the user satisfaction.
  • Increase user fidelity.

Recommendation Techniques


In order to implement its core function, identifying the useful items for the user, a RS must predict that an item is worth recommending.

They model this degree of utility of the user u for the item i as a (real valued) function R(u, i), as is normally done in collaborative filtering by considering the ratings of users for items. Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items.

They can have six different approaches:

Content-based: The system learns to recommend items that are similar to the
ones that the user liked in the past. The similarity of items is calculated based on the features associated with the compared items.

Collaborative filtering:The simplest and original implementation of this approach recommends to the active user the items that other users with similar
tastes liked in the past.The similarity in taste of two users is calculated based on the similarity in the rating history of the users.


Demographic: This type of system recommends items based on the demographic profile of the user. The assumption is that different recommendations should be generated for different demographic niches.



Knowledge-based: Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and, ultimately, how the item is useful for the user.



Community-based: This type of system recommends items based on the preferences of the users friends. This technique follows the epigram Tell me who your friends are, and I will tell you who you are.



Hybrid recommender systems: These RSs are based on the combination of the above mentioned techniques. A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B.



DATASET:
The data set we’ll use is the data that was available to participants in the R package recommendation contest on Kaggle. In this contest, they provided contestants with the complete package installation record for approximately 50 R programmers. This is quite a small data set, but it was sufficient to start learning things about the popularity of different packages and their degree of similarity.

You can download dataset from following link.

https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg

PROBLEM STATEMENT:
Our Goal is  to predict whether a user would install a package for which we had withheld the installation data by exploiting the fact that you knew which other R packages the user had installed.This is called Content based recommendation.

UNDERSTANDING DATA :

As you can see, user 1 for the contest had installed the abind package, but not the AcceptanceSampling package. This raw information will give us a measure of similarity between packages after we transform it into a different form. 


Now we will convert this “long form” data into a “wide form” of data in which each row corresponds to a user and each column corresponds to a package. We can do that using the cast function from the reshape package.


Once we inspect the first column, we realize that it just stores the user IDs, so we remove it after storing that information in the row.names of our matrix. Now we have a proper user-package matrix that makes it trivial to compute similarity measures. For simplicity, we’ll use the correlation between columns as our measure of similarity.

To compute that, we can simply call the cor function from R:

Now we have the similarity between all pairs of packages. As you can see, package 1 is perfectly similar to package 1 and somewhat dissimilar from package 2. But kNN doesn’t use similarities; it uses distances. So we need to translate similarities into distances. Our approach here is to use some clever algebra so that a similarity of 1 becomes a distance of 0 and a similarity of –1 becomes an infinite distance. The code we’ve written here does this. If it’s not intuitively clear why it works, try to spend some time thinking about how the algebra moves points around.


With that computation run, we have our distance measure and can start implementing kNN. We’ll use k = 25 for example purposes, but you’d want to try multiple values of k in production to see which works best.
To make our recommendations, we’ll estimate how likely a package is to be installed by simply counting how many of its neighbors are installed. Then we’ll sort packages by this count value and suggest packages with the highest score.


Using the nearest neighbors, we’ll predict the probability of installation by simply counting how many of the neighbors are installed:



Here we see that for user 1, package 1 has probability 0.76 of being installed. So what we’d like to do is find the packages that are most likely to be installed and recommend those. We do that by looping over all of the packages, finding their installation probabilities, and displaying the most probable:


One of the things that’s nice about this approach is that we can easily justify our predictionsto end users. We can say that we recommended package P to a user because he had already installed packages X, Y, and Z. This transparency can be a real virtue in some applications.

Friday, 20 November 2015

UNDERSTANDING PCA WITH R


One place where this type of dimensionality reduction is particularly helpful is when dealing with stock market data.



Date                     ADC                              UTR
 2002-01-02                         17.7 
39.34

2002-01-03

                     23.78

 23.52  

2011-05-25                      16.1449.3



For example, we might have data that looks like the real historical prices shown above for 25 stocks over the period from January 2,2002 until May 25, 2011.


Though we’ve only shown 3 columns, there are actually 25, which is far too many columns to deal with in a thoughtful way.



We want to create a single column that tells us how the market is doing on each day by combining information in the 25 columns that we have access to;



LOADING THE DATASET


You can download the dataset from the following link

https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg



Now lets see the structure of our data frame





BASIC PREPROCESSING 

The first step is to translate all of the raw datestamps in our data set toproperly encoded date variables. To do that, we use the lubridate package from CRAN.This package provides a nice function called ymd that translates strings inyear-month-day format into date objects:

The cast function has you specify which column should be used to define the rows in the output matrix on the left hand side of the tilde, and the columns of the result are specified after the tilde. The actual entries in the result are specified using value.




CHECKING FOR MISSING VALUES AND IMPUTING THEM





 Here we check for missing values in each row and replace it with mean of that particular column.

Lets check again if the there are any missing values now




So now we have no missing values as you can check it.


CHECKING FOR CORRELATION BETWEEN VARIABLES


Table:Correlation between variables


Looking at correlation between variables we find that that most of the correlations are positive so PCA will work fine here.



APPLYING PCA







Here we get total of 25 components.First component is 32.87 means it accounts for 32% of the variance.Simillarly the last component accounts for less than 1% of the total variance.


We can look at the first component more clearly by looking at its loadings.









Looking at the loadings we find that most of loadings for the first component are negative but they are quite well distributed.


So lets use the first component to predict the market index.





So as you can see we can predict market index with PCA for different stocks.

Installing RHadoop


Based on this, to get RHadoop installed on our system we need Hadoop with either a single node or multimode installation as per the size of our data.


  • Installing the R packages:
We can install them by calling the execution of the following R command in
the R console:

install.packages(c('rJava','RJSONIO','itertools','digest','Rcpp','httr','functional','devtools', 'plyr','reshape2'))

  • Setting Environment variables
We can set this via the R console using the
following code:

## Setting HADOOP_CMD

Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")

## Setting up HADOOP_STREAMING

Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/contrib/streaming/
hadoop-streaming-1.0.3.jar")

or, we can also set the R console via the command line as follows:

export HADOOP_CMD=/usr/local/Hadoop

export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/
streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar


  • Installing RHadoop [rhdfs, rmr, rhbase]
1. Download RHadoop packages from GitHub repository of Revolution
Analytics: 

https://github.com/RevolutionAnalytics/RHadoop.

°° rmr: [rmr-2.2.2.tar.gz]
°° rhdfs: [rhdfs-1.6.0.tar.gz]
°° rhbase: [rhbase-1.2.0.tar.gz]

2. Install this packages in R by

packages -> install packages from local zip files


3. Check the Installation

Once we complete the installation of RHadoop, we can test the setup by running the MapReduce job with the rmr2 and rhdfs libraries in the RHadoop sample program as follows:

## loading the libraries


library(rhdfs)
library(rmr2)

## initializing the RHadoop

hdfs.init()

ARCHITECTURE OF RHADOOP




If we have stored our input data source at the HBase data source, we need to install rhbase; else we require rhdfs and rmr packages.


INTRODUCING RHADOOP


RHadoop is a collection of three R packages for providing large data operations with an R environment. It was developed by Revolution Analytics, which is the leading commercial provider of software based on R. RHadoop is available with three main R packages: 

rhdfs, rmr, and rhbase

Each of them offers different Hadoop features.


    • rhdfs is an R interface for providing the HDFS usability from the R console.As Hadoop MapReduce programs write their output on HDFS, it is very easy to access them by calling the rhdfs methods. The R programmer can easily perform read and write operations on distributed data files. Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS. 


    • rmr is an R interface for providing Hadoop MapReduce facility inside the R environment. So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr methods.After that, rmr calls the Hadoop streaming MapReduce API with several job parameters as input directory, output directory, mapper, reducer, and so on,to perform the R MapReduce job over Hadoop cluster.

    • rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server. The rhbase package is designed with several methods for initialization and read/write and table manipulation operations.
    Apart from these there are 2 other packages



    • plyrmr: This is a higher-level abstraction of MapReduce, which allows users to perform common data manipulation in a plyr-like syntax. This package greatly lowers the learning curve of big-data manipulation.

    •  ravro: This allows users to read avro files in R, or write avro files. It allows R to exchange data with HDFS.


      Thursday, 19 November 2015

      R + HADOOP STEP BY STEP


      A Quick Look at HADOOP


      • Hadoop’s parallel processing muscle is suitable for large amounts of data, it isequally useful for problems that involve large amounts of computation (sometimes known as “processor-intensive” or “CPU-intensive” work). Consider a program that,based on a handful of input values, runs for some tens of minutes or even a number of hours: if you needed to test several variations of those input values, then you would certainly benefit from a parallel solution.

      • Hadoop’s parallelism is based on the MapReduce model. To understand how Hadoop can boost your R performance, then, let’s first take a quick look at MapReduce.

       A Quick Look at MapReduce


      • The MapReduce model outlines a way to perform work across a cluster built of inexpensive commodity machines.
      • It is divided into two phases Map and Reduce.
      MAP PHASE
      In the Map phase, you divide that input and group the pieces into smaller, independent piles of related material.


      1. Each cluster node takes a piece of the initial mountain of data and runs a Map task on each record (item) of input. You supply the code for the Map task.

      2. The Map tasks all run in parallel, creating a key/value pair for each record. The key 
      identifies the item’s pile for the reduce operation. The value can be the record itself or some derivation thereof.

      SHUFFLE
      At the end of the Map phase, the machines all pool their results. Every key/value pair is assigned to a pile, based on the key. 

      REDUCE PHASE
      1. The cluster machines then switch roles and run the Reduce task on each pile. You supply the code for the Reduce task, which gets the entire pile (that is, all of the key/value pairs for a given key) at once.

      2. The Reduce task typically, but not necessarily, emits some output for each pile.