Data Preparation & Exploration Using R
Before starting our exploration you must know about the types of variables:
TYPES OF VARIABLES:
- An independent variable, sometimes called an experimental or predictor variable, is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes called an outcome variable.
- Dependent Variable: Test Mark (measured from 0 to 100)Independent Variables: Revision time (measured in hours) Intelligence (measured using IQ score)
- Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. For example, a real estate agent could classify their types of property into distinct categories such as houses, condos, co-ops or bungalows.
- Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either "male" or "female".
- Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. So if you asked someone if they liked the policies of the Democratic Party and they could answer either "Not very much", "They are OK" or "Yes, a lot" then you have an ordinal variable.
DATASET:
You can download the dataset from follwing link:
https://drive.google.com/drive/folders/0B5g0WZIzJLuEQ1daVWR5aFVERjg
IMPORTING THE DATA AND EXPLORING IT
Here ‘as.is=T’ is added to read dates as character which can later be converted to dates. Also please note the variable name with spaces has been converted to variable name with dots in ‘R’ after importing. R will also convert any special characters in variable name to dot (.) after importing.One of the prominent inconsistency is missing value are represented as “No Info” as well as “blanks”. So, the first step should be to replace “No Info” and “blanks” with “NA” to get better sense of missing values in data.
While importing file many of the continuous values and dates were imported as character vector that needs to be converted respective data type
Next step would be to understand amount of missing values in data. Any variable missing more than 50% should not be used in modeling as it can give false impression of relationship with dependent and can pollute the model. Being on conservative side generally we prefer to keep variable with less than 40% missing value only for treatment and anything above 40% missing values are to be used in testing only to give additional insights.
Now we can filter out the data with more than 40% missing which we can later use for testing and additional insights and keep only less than 40% variable for modeling.
It would be better if we separate out the numeric and character variables from data frame. It would help in performing operations going forward.
We will examine the values and in consistencies in each of the variable to better understand it and rectify for the same. There are two ways to examine any variable depending on type of variable. For continuous variables you can use ‘summary’ and ‘quantile’ function and for categorical variables use ‘table’ function.
Here a big jump in values seem to be occurring after 90th percentile. It can be further examined to identify cut off point of relatively larger jump in values that can be capped before analysis. Capping means very large outlier type values will be replaced with relatively meaningful values.