The logical constant NA
(not available) is used to represent missing values in R.
There are various methods in R that can handle missing values directly.
As a very simple example, we consider the mean value.We create a data set with one attribute with four missing values and try to compute the mean:
x <- c(3,2,NA,4,NA,1,NA,NA,5)
mean(x)
The mean value is in this case also a missing value, since R has no information
about the missing values and how to handle them. But if we explicitly say that
missing values should simply be ignored for the computation of the mean value
(na.rm=T
), then R returns the mean value of all nonmissing values:
mean(x,na.rm=T)
Note that this computation of the mean value implicitly assumes that the values are missing completely at random (MCAR).
Normalization and standardization of numeric attributes can be achieved in the following
way. The function is.factor
returns true if the corresponding attribute is
categorical (or ordinal), so that we can ensure with this function that normalization
is only applied to all numerical attributes, but not to the categorical ones. With the
following R-code, z-score standardization is applied to all numerical attributes:
iris.norm <- iris
# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
Other normalization and standardization techniques can carried out in a similar
manner. Of course, instead of the functions mean
(for the mean value) and sd
(for
the standard deviation), other functions like min
(for the minimum), max
(for the
maximum), median
(for the median), or IQR
(for the interquartile range) have to
be used.