A Simple Recommendation System in R
What is a Recommendation System?
- Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. The suggestions relate to various decision-making processes, such as what items to buy, what music to listen to, or what online news to read.
- “Item” is the general term used to denote what the system recommends to users.
- A RS normally focuses on a specific type of item (e.g., CDs, or news) and accordingly its design, its graphical user interface, and the core recommendation technique used to generate the recommendations are all customized to provide useful and effective suggestions for that specific type of item.
What is function of a recommendation system?
- Increase the number of items sold
- Sell more diverse items.
- Increase the user satisfaction.
- Increase user fidelity.
Recommendation Techniques
In order to implement its core function, identifying the useful items for the user, a RS must predict that an item is worth recommending.
They model this degree of utility of the user u for the item i as a (real valued) function R(u, i), as is normally done in collaborative filtering by considering the ratings of users for items. Then the fundamental task of a collaborative filtering RS is to predict the value of R over pairs of users and items.
They can have six different approaches:
Content-based: The system learns to recommend items that are similar to the
ones that the user liked in the past. The similarity of items is calculated based on the features associated with the compared items.
Collaborative filtering:The simplest and original implementation of this approach recommends to the active user the items that other users with similar
tastes liked in the past.The similarity in taste of two users is calculated based on the similarity in the rating history of the users.
Demographic: This type of system recommends items based on the demographic profile of the user. The assumption is that different recommendations should be generated for different demographic niches.
Knowledge-based: Knowledge-based systems recommend items based on specific domain knowledge about how certain item features meet users needs and preferences and, ultimately, how the item is useful for the user.
Community-based: This type of system recommends items based on the preferences of the users friends. This technique follows the epigram “Tell me who your friends are, and I will tell you who you are”.
Hybrid recommender systems: These RSs are based on the combination of the above mentioned techniques. A hybrid system combining techniques A and B tries to use the advantages of A to fix the disadvantages of B.
DATASET:
The data set we’ll use is the data that was available to participants in the R package recommendation contest on Kaggle. In this contest, they provided contestants with the complete package installation record for approximately 50 R programmers. This is quite a small data set, but it was sufficient to start learning things about the popularity of different packages and their degree of similarity.
You can download dataset from following link.
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg
PROBLEM STATEMENT:
Our Goal is to predict whether a user would install a package for which we had withheld the installation data by exploiting the fact that you knew which other R packages the user had installed.This is called Content based recommendation.
UNDERSTANDING DATA :
As you can see, user 1 for the contest had installed the abind package, but not the AcceptanceSampling package. This raw information will give us a measure of similarity between packages after we transform it into a different form.
Now we will convert this “long form” data into a “wide form” of data in which each row corresponds to a user and each column corresponds to a package. We can do that using the cast function from the reshape package.
You can download dataset from following link.
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg
PROBLEM STATEMENT:
Our Goal is to predict whether a user would install a package for which we had withheld the installation data by exploiting the fact that you knew which other R packages the user had installed.This is called Content based recommendation.
UNDERSTANDING DATA :
As you can see, user 1 for the contest had installed the abind package, but not the AcceptanceSampling package. This raw information will give us a measure of similarity between packages after we transform it into a different form.
Now we will convert this “long form” data into a “wide form” of data in which each row corresponds to a user and each column corresponds to a package. We can do that using the cast function from the reshape package.
Once we inspect the first column, we realize that it just stores the user IDs, so we remove it after storing that information in the row.names of our matrix. Now we have a proper user-package matrix that makes it trivial to compute similarity measures. For simplicity, we’ll use the correlation between columns as our measure of similarity.
To compute that, we can simply call the cor function from R:
Now we have the similarity between all pairs of packages. As you can see, package 1 is perfectly similar to package 1 and somewhat dissimilar from package 2. But kNN doesn’t use similarities; it uses distances. So we need to translate similarities into distances. Our approach here is to use some clever algebra so that a similarity of 1 becomes a distance of 0 and a similarity of –1 becomes an infinite distance. The code we’ve written here does this. If it’s not intuitively clear why it works, try to spend some time thinking about how the algebra moves points around.
With that computation run, we have our distance measure and can start implementing kNN. We’ll use k = 25 for example purposes, but you’d want to try multiple values of k in production to see which works best.
To make our recommendations, we’ll estimate how likely a package is to be installed by simply counting how many of its neighbors are installed. Then we’ll sort packages by this count value and suggest packages with the highest score.
Using the nearest neighbors, we’ll predict the probability of installation by simply counting how many of the neighbors are installed:
Here we see that for user 1, package 1 has probability 0.76 of being installed. So what we’d like to do is find the packages that are most likely to be installed and recommend those. We do that by looping over all of the packages, finding their installation probabilities, and displaying the most probable:
One of the things that’s nice about this approach is that we can easily justify our predictionsto end users. We can say that we recommended package P to a user because he had already installed packages X, Y, and Z. This transparency can be a real virtue in some applications.
Now we have the similarity between all pairs of packages. As you can see, package 1 is perfectly similar to package 1 and somewhat dissimilar from package 2. But kNN doesn’t use similarities; it uses distances. So we need to translate similarities into distances. Our approach here is to use some clever algebra so that a similarity of 1 becomes a distance of 0 and a similarity of –1 becomes an infinite distance. The code we’ve written here does this. If it’s not intuitively clear why it works, try to spend some time thinking about how the algebra moves points around.
With that computation run, we have our distance measure and can start implementing kNN. We’ll use k = 25 for example purposes, but you’d want to try multiple values of k in production to see which works best.
To make our recommendations, we’ll estimate how likely a package is to be installed by simply counting how many of its neighbors are installed. Then we’ll sort packages by this count value and suggest packages with the highest score.
Using the nearest neighbors, we’ll predict the probability of installation by simply counting how many of the neighbors are installed:
Here we see that for user 1, package 1 has probability 0.76 of being installed. So what we’d like to do is find the packages that are most likely to be installed and recommend those. We do that by looping over all of the packages, finding their installation probabilities, and displaying the most probable:
One of the things that’s nice about this approach is that we can easily justify our predictionsto end users. We can say that we recommended package P to a user because he had already installed packages X, Y, and Z. This transparency can be a real virtue in some applications.
No comments:
Post a Comment