Data Analysis And Visualization.
Suppose your goal is to build a model to predict which of your customers don’t have health insurance; perhaps you want to market inexpensive health insurance packages to them. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and so on.
In this assignment we’ll address issues that you can discover during the data exploration/visualization phase. First you’ll treat missing values. Then you will apply some common data transformations and when they’re appropriate: converting continuous variables to discrete; normalization and rescaling; and logarithmic transformations.
Customer data can be downloaded from : custdata.RDS
1. Load data into a data frame named custData using readRDS() function.
If you saved file custdata.RDS in the folder C:/tmp, just load data as
custData<-readRDS(“C:/tmp/custdata.RDS”)
2. Print number of rows and columns in the file. Use dim() function.
3. Print column names.
4. Print number of NAs in each column.
Hint: One way to find NAs is to use sum() and is.na() functions, by passing the column to is.na().
5. Adding New Columns to a Data Frame
The variable gas_usage mixes numeric and symbolic data: values greater than 3 are
monthly gas bills, but values from 1 to 3 are special codes. In addition, gas_usage has
some missing values.
The value 1 means “Gas bill included in rent or condo fee”.
The value 2 means “Gas bill included in electricity payment”.
The value 3 means “No charge or gas not used”.
One way to treat gas_usage is to convert all the special codes (1,2,3) to NA, and to add three new indicator variables, one for each code. For example, the indicator variable gas_with_electricity will have the value 1 whenever the original gas_usage
variable had the value 2, and the value 0 otherwise.
A) Create the three new indicator variables, gas_with_rent, gas_with_electricity, and no_gas_bill. Add these indicators to the data frame custData.
Hint: Use ifelse() function. Check texbook pages 66-67 for samples.
B) Print the column names of custData to check if these new columns are added.
6. Convert Invalid Values to NA
The variable age has the problematic value 0, which probably means that the age is unknown. In addition, there are a few customers with age greater than 100, which may also be an error. However, for this project you decide to only treat the value 0 as invalid, and to assume ages greater than one hundred years are valid.
The variable income has negative values. We’ll assume for this project those values are invalid.
A) Convert invalid age and income variables to NA, as if they were “missing variables.”
B) Convert all values of gas_usage that are less than 4 to NA. (The reason we want to do this is because we already created three new indicators for the codes 1,2 and, 3 in gas_usage column. And therefore we want to label these entries as missing variables because they don’t represent the gas bill amount.)
Hint: Use ifelse() function. Check texbook pages 66-67 for samples.
7. Barcharts, Histograms, Scatter Plots
A) Plot barcharts of the predictors num_vehicles, recent_move, health_ins, marital_status, is_employed, and housing_type.
The following is the bar chart of the housing_type:
B) Print histogram of age and income. Comment on the distribution and skewness of the data for these predictors.
C) Print the scatter plot of age versus income:
8. Density Plot and Transformation to Eliminate Skew
A) Print the density plots of income and age.
B) Is data right or left skewed?
C) If data is skewed, apply a transformation to remove the skewness as much as possible.
Hint: Check textbook page 74-75.
The following is the density plot of the income :
And the following is the density plot after log10() is used to transform income:
9. Convert Continuous Variable to Discrete
We would like to create the following ranges for the age predictor.
[0,25], (25,65], (65,130]
A) Use cut() function to cut the age predictor data into ranges given above. Add the result as a column to the data frame custData as a new predictor named ageRange.
Hint: Listing 4.6 in the textbook, page 71.
B) Plot the bar chart of the ageRange, as shown below:
10. Imputed Value for the age Predictor
You might believe that the data is missing because the data collection failed at random, independent of the situation and of the other values. In this case, you can replace the missing values with “a reasonable estimate,” or imputed value. Statistically, one commonly used estimate is the expected, or mean.
For age predictor replace all NAs by the mean of the age values that are not NAs.
Caution: The R mean() function returns a number not an integer. Make sure that you convert it to integer using as.integer() function.
A) Print the mean value you found.
B) After replacing the NAs with mean values, repeat the same process in part 10 above to print the bar chart:
In part 5) of week 4 assignment one of the question is about adding indicator variables (new columns) to the data frame. The following statement describes the indicator variable gas_with_electricity:
For example, the indicator variable gas_with_electricity will have the value 1 whenever the original gas_usage variable had the value 2, and the value 0 otherwise.
Assume that the name of data frame is custData. To add the indicator variable gas_with_electricity to the data frame custData, simply use the ifelse() as shown below:
custData$gas_with_electricity<-ifelse(custData$gas_usage==2,1,0)
The statement above adds a new column named gas_with_electricity with values 1 or 0 based on the values of gas_usage column from custData data frame. So, if the value of gas_usage is 2 it assigns 1 as the value of gas_with_electricity otherwise it assigns 0 as the value of gas_with_electricity.
The other two columns will be added similarly.
-
custdata.RDS