If we can assume that missing values occur missing completely at random (MCAR), then we can use a number of imputation strategies available in the sklearn
library. To demonstrate, we create some toy data sets.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
# Creating toy data sets (numerical)
X_train = pd.DataFrame()
X_train['X'] = [3, 2, 1, 4, 5, np.nan, np.nan, 5, 2]
X_test = pd.DataFrame()
X_test['X'] = [3, np.nan, np.nan]
Here, we created data frames X_train
and X_test
, both containing a numerical data column X
. There are some missing values in these data sets, defined by np.nan
, a missing value available in the numpy
library. If we examine these data frames, the missing values are indicated by NaN
.
X_train
X_test
We also created data frames S_train
and S_test
with a string column, with some missing values indicated by np.nan
.
# Creating toy data sets (categorical)
S_train = pd.DataFrame()
S_train['S'] = ['Hi', 'Med', 'Med', 'Hi', 'Low', 'Med', np.nan, 'Med', 'Hi']
S_test = pd.DataFrame()
S_test['S'] = [np.nan, np.nan, 'Low']
S_train
S_test
We impute missing values by creating a simple imputation object SimpleImputer
available in the sklearn.impute
library. We define a numerical imputation object imp_mean
as
# Imputing numerical data with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
Here, the parameter missing_values
defines which value is considered as missing. The parameter strategy
defines the imputation method. In this example, we use 'mean'
, meaning that the mean of all non-missing values will be imputed. We use the fit_transform
method and provide the X_train
data to calculate the mean to be imputed, in addition to actually imputing missing values.
X_train_imp = imp_mean.fit_transform(X_train)
X_train_imp
As you can see, missing values are imputed by the mean. We can apply this imputation strategy (with the mean calculated on the training data) to the second data X_test
by the transform
method.
X_test_imp = imp_mean.transform(X_test)
X_test_imp
For categorical or string data, we can impute most frequent value by specifying the parameter strategy
to 'most-frequent'
in the SimpleImputer
object. Here, we define the imputation object imp_mode
imputing the most frequent category from the S_train
data to both S_train
and S_test
data sets.
# Imputing categorical data with mode
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
S_train_imp = imp_mode.fit_transform(S_train)
S_test_imp = imp_mode.transform(S_test)
S_train_imp
S_test_imp
For this example, we shall use the Iris data set again. We load the Iris data and split to training and testing data, with the testing data comprising 1/3 of all observations.
from sklearn import datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
# loading the Iris data
iris = datasets.load_iris()
X = iris.data # array for the features
y = iris.target # array for the target
feature_names = iris.feature_names # feature names
target_names = iris.target_names # target names
# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
random_state=2020)
We can normalize the data to Z-scores using the StandardScaler
transformation object, as we have seen in a previous chapter. The transformation object is trained with the testing data X_train
with the fit_transform
method. Then the trained transformation is applied to the testing data with the transformation
method.
# z-score normalization
normZ = StandardScaler()
X_train_Z = normZ.fit_transform(X_train)
X_test_Z = normZ.transform(X_test)
The resulting means and standard deviations for the normalized training data set are:
X_train_Z.mean(axis=0)
X_train_Z.std(axis=0)
Likewise, the means and standard deviations of the normalized testing data set are:
X_test_Z.mean(axis=0)
X_test_Z.std(axis=0)
To apply a min-max scaling, thus scaling all features in the [0, 1] interval, we can use the MinMaxScaler
object available in the sklearn.preprocessing
library.
# min-max normalization
normMinMax = MinMaxScaler()
X_train_MinMax = normMinMax.fit_transform(X_train)
X_test_MinMax = normMinMax.transform(X_test)
Let's examine the minimum and the maximum of the normalized training data.
X_train_MinMax.min(axis=0)
X_train_MinMax.max(axis=0)
Likewise, the minimum and the maximum of the normalized testing data. It should be noted that the minimum and the maximum used in the transformation were determined based on the training data. Thus there is no guarantee that the minimum and the maximum fall within the interval [0, 1].
X_test_MinMax.min(axis=0)
X_test_MinMax.max(axis=0)