Monday, April 20, 2015

Dealing with NAs in your dataset (R)

Content Summary

  1. Testing for NA and NaN using is.na and is.nan
  2. Replacing the NA or NaN with mean or median.
  3. Removing rows with NA or NaN in Age from the dataset
  4. Removing rows with NA or NaN in every columns from the dataset
  5. na.omit

Introduction

In some situations, you may find yourself with a dataset littered with NAs or NaNs and they are preventing you from carrying out the next step of your analysis. NA stands for “Not Available” while NaN stands for “Not a Number”. It could be due to missing data in your survey results or maybe some illegal operations.
In this post, I will be covering some of the common ways to deal with them in a dataset. I will be using a portion of the famous Titanic dataset to illustrate my examples.

Quick look at the dataset

First we read in our data:
setwd("~/Downloads")        # setting working directory
titanic <- read.csv("test.csv", header=TRUE)

A quick glance at the first 6 rows of the dataset:
head(titanic, 6)
##   PassengerId Pclass                                         Name    Sex
## 1         892      3                             Kelly, Mr. James   male
## 2         893      3             Wilkes, Mrs. James (Ellen Needs) female
## 3         894      2                    Myles, Mr. Thomas Francis   male
## 4         895      3                             Wirz, Mr. Albert   male
## 5         896      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
## 6         897      3                   Svensson, Mr. Johan Cervin   male
##    Age SibSp Parch  Ticket    Fare Cabin Embarked
## 1 34.5     0     0  330911  7.8292              Q
## 2 47.0     1     0  363272  7.0000              S
## 3 62.0     0     0  240276  9.6875              Q
## 4 27.0     0     0  315154  8.6625              S
## 5 22.0     1     1 3101298 12.2875              S
## 6 14.0     0     0    7538  9.2250              S

Seems that everything is alright. However if we try to calculate the mean of the Age variable, we face some problems:
mean(titanic$Age)
## [1] NA

Looking at the first 100 entries of the Age variable, you can see that there are a lot of NAs inside that column. This results in the NA output when we try to calculate its mean.
titanic$Age[1:100]
##   [1] 34.5 47.0 62.0 27.0 22.0 14.0 30.0 26.0 18.0 21.0   NA 46.0 23.0 63.0
##  [15] 47.0 24.0 35.0 21.0 27.0 45.0 55.0  9.0   NA 21.0 48.0 50.0 22.0 22.5
##  [29] 41.0   NA 50.0 24.0 33.0   NA 30.0 18.5   NA 21.0 25.0   NA 39.0   NA
##  [43] 41.0 30.0 45.0 25.0 45.0   NA 60.0 36.0 24.0 27.0 20.0 28.0   NA 10.0
##  [57] 35.0 25.0   NA 36.0 17.0 32.0 18.0 22.0 13.0   NA 18.0 47.0 31.0 60.0
##  [71] 24.0 21.0 29.0 28.5 35.0 32.5   NA 55.0 30.0 24.0  6.0 67.0 49.0   NA
##  [85]   NA   NA 27.0 18.0   NA  2.0 22.0   NA 27.0   NA 25.0 25.0 76.0 29.0
##  [99] 20.0 33.0

Dealing with the NAs

  1. Testing for NA and NaN using is.na and is.nan This is a basic function to test if an object is NA or NaN. is.na can be used for NAs and NaNs. However is.nan can only be used for NaNs.
vect <- c(1,2,3, NA, NaN)
is.na(vect)
## [1] FALSE FALSE FALSE  TRUE  TRUE
is.nan(vect)
## [1] FALSE FALSE FALSE FALSE  TRUE

  1. Replacing the NA or NaN with mean or median. A common way to deal with these numbers is to replace them with the mean or median of the other available data.
# first we calculate the mean of the available data in Age
age.mean <- mean(titanic$Age, na.rm=TRUE)
titanic$Age[is.na(titanic$Age)] = age.mean
age.mean
## [1] 30.27259
mean(titanic$Age)
## [1] 30.27259
We have successfully replaced all NAs by 30.27259. As a result, the output of mean(titanic$Age) is no longer NA.

  1. Removing rows with NA or NaN in Age from the dataset While removing data from your dataset is generally not a good approach, it is still useful in some case.
NROW(titanic)               # number of rows in the titanic dataset
## [1] 418
titanic <- titanic[!is.na(titanic$Age),]
NROW(titanic)
## [1] 418
This effectively removes all the rows with NA present in the Age column. Do note the use of exclaimation mark before is.na. In total 418 - 332 = 86 rows of data were removed.

  1. Removing rows with NA or NaN in every columns from the dataset Here we use another function complete.cases which returns TRUE if there are no NA at all in the entire row.
titanic <- titanic[complete.cases(titanic),]
NROW(titanic)
## [1] 417
Notice that the remaining rows left is 331 instead of 332. There is another NA present in another column (other than Age).

  1. na.omit This function returns a data frame with the rows containing NA in any columns removed. This is basically the same as 4.
titanic <- na.omit(titanic)
NROW(titanic)
## [1] 417

No comments:

Post a Comment