UNDERSTANDING PCA WITH R
One place where this type of dimensionality reduction is particularly helpful is when dealing with stock market data.
We want to create a single column that tells us how the market is doing on each day by combining information in the 25 columns that we have access to;
You can download the dataset from the following link
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg
Now lets see the structure of our data frame
The cast function has you specify which column should be used to define the rows in the output matrix on the left hand side of the tilde, and the columns of the result are specified after the tilde. The actual entries in the result are specified using value.
CHECKING FOR CORRELATION BETWEEN VARIABLES
Looking at correlation between variables we find that that most of the correlations are positive so PCA will work fine here.
Looking at the loadings we find that most of loadings for the first component are negative but they are quite well distributed.
So lets use the first component to predict the market index.
So as you can see we can predict market index with PCA for different stocks.
Date | ADC | UTR |
---|---|---|
2002-01-02 | 17.7 | 39.34 |
2002-01-03 |
23.78 |
23.52 |
2011-05-25 | 16.14 | 49.3 |
For example, we might have data that looks like the real historical prices shown above for 25 stocks over the period from January 2,2002 until May 25, 2011.
Though we’ve only shown 3 columns, there are actually 25, which is far too many columns to deal with in a thoughtful way.
Though we’ve only shown 3 columns, there are actually 25, which is far too many columns to deal with in a thoughtful way.
We want to create a single column that tells us how the market is doing on each day by combining information in the 25 columns that we have access to;
LOADING THE DATASET
You can download the dataset from the following link
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg
Now lets see the structure of our data frame
BASIC PREPROCESSING
The first step is to translate all of the raw datestamps in our data set toproperly encoded date variables. To do that, we use the lubridate package from CRAN.This package provides a nice function called ymd that translates strings inyear-month-day format into date objects:The cast function has you specify which column should be used to define the rows in the output matrix on the left hand side of the tilde, and the columns of the result are specified after the tilde. The actual entries in the result are specified using value.
CHECKING FOR MISSING VALUES AND IMPUTING THEM
Here we check for missing values in each row and replace it with mean of that particular column.
Lets check again if the there are any missing values now
So now we have no missing values as you can check it.
Table:Correlation between variables |
Looking at correlation between variables we find that that most of the correlations are positive so PCA will work fine here.
APPLYING PCA
Here we get total of 25 components.First component is 32.87 means it accounts for 32% of the variance.Simillarly the last component accounts for less than 1% of the total variance.
We can look at the first component more clearly by looking at its loadings.
Looking at the loadings we find that most of loadings for the first component are negative but they are quite well distributed.
So lets use the first component to predict the market index.
So as you can see we can predict market index with PCA for different stocks.
No comments:
Post a Comment