Data Preparation


Missing values


The logical constant NA (not available) is used to represent missing values in R. There are various methods in R that can handle missing values directly.

As a very simple example, we consider the mean value.We create a data set with one attribute with four missing values and try to compute the mean:

x <- c(3,2,NA,4,NA,1,NA,NA,5)
mean(x)
<NA>

The mean value is in this case also a missing value, since R has no information about the missing values and how to handle them. But if we explicitly say that missing values should simply be ignored for the computation of the mean value (na.rm=T), then R returns the mean value of all nonmissing values:

mean(x,na.rm=T)
3

Note that this computation of the mean value implicitly assumes that the values are missing completely at random (MCAR).

Normalization


Normalization and standardization of numeric attributes can be achieved in the following way. The function is.factor returns true if the corresponding attribute is categorical (or ordinal), so that we can ensure with this function that normalization is only applied to all numerical attributes, but not to the categorical ones. With the following R-code, z-score standardization is applied to all numerical attributes:

iris.norm <- iris

# for loop over each coloumn
for (i in c(1:length(iris.norm))){
    if (!is.factor(iris.norm[,i])){
        attr.mean <- mean(iris.norm[,i])
        attr.sd <- sd(iris.norm[,i])
        iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
    }
}

Other normalization and standardization techniques can carried out in a similar manner. Of course, instead of the functions mean (for the mean value) and sd (for the standard deviation), other functions like min (for the minimum), max (for the maximum), median (for the median), or IQR (for the interquartile range) have to be used.