Friday 20 November 2015

UNDERSTANDING PCA WITH R


One place where this type of dimensionality reduction is particularly helpful is when dealing with stock market data.



Date                     ADC                              UTR
 2002-01-02                         17.7 
39.34

2002-01-03

                     23.78

 23.52  

2011-05-25                      16.1449.3



For example, we might have data that looks like the real historical prices shown above for 25 stocks over the period from January 2,2002 until May 25, 2011.


Though we’ve only shown 3 columns, there are actually 25, which is far too many columns to deal with in a thoughtful way.



We want to create a single column that tells us how the market is doing on each day by combining information in the 25 columns that we have access to;



LOADING THE DATASET


You can download the dataset from the following link

https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg



Now lets see the structure of our data frame





BASIC PREPROCESSING 

The first step is to translate all of the raw datestamps in our data set toproperly encoded date variables. To do that, we use the lubridate package from CRAN.This package provides a nice function called ymd that translates strings inyear-month-day format into date objects:

The cast function has you specify which column should be used to define the rows in the output matrix on the left hand side of the tilde, and the columns of the result are specified after the tilde. The actual entries in the result are specified using value.




CHECKING FOR MISSING VALUES AND IMPUTING THEM





 Here we check for missing values in each row and replace it with mean of that particular column.

Lets check again if the there are any missing values now




So now we have no missing values as you can check it.


CHECKING FOR CORRELATION BETWEEN VARIABLES


Table:Correlation between variables


Looking at correlation between variables we find that that most of the correlations are positive so PCA will work fine here.



APPLYING PCA







Here we get total of 25 components.First component is 32.87 means it accounts for 32% of the variance.Simillarly the last component accounts for less than 1% of the total variance.


We can look at the first component more clearly by looking at its loadings.









Looking at the loadings we find that most of loadings for the first component are negative but they are quite well distributed.


So lets use the first component to predict the market index.





So as you can see we can predict market index with PCA for different stocks.

No comments:

Post a Comment