Preprocessing Data with Scikit-Learn

Data Setup

The prediction target for this analysis is the sale price of the home, so we separate the data into X and y accordingly:

# prediction target (DV)
y = df["SalePrice"]

# features to predict sale price (IV)
X = df.drop("SalePrice", axis=1)

from sklearn.model_selection import train_test_split

# Declare relevant columns
relevant_columns = [
    'LotFrontage',  # Linear feet of street connected to property
    'LotArea',      # Lot size in square feet
    'Street',       # Type of road access to property
    'OverallQual',  # Rates the overall material and finish of the house
    'OverallCond',  # Rates the overall condition of the house
    'YearBuilt',    # Original construction date
    'YearRemodAdd', # Remodel date (same as construction date if no remodeling or additions)
    'GrLivArea',    # Above grade (ground) living area square feet
    'FullBath',     # Full bathrooms above grade
    'BedroomAbvGr', # Bedrooms above grade (does NOT include basement bedrooms)
    'TotRmsAbvGrd', # Total rooms above grade (does not include bathrooms)
    'Fireplaces',   # Number of fireplaces
    'FireplaceQu',  # Fireplace quality
    'MoSold',       # Month Sold (MM)
    'YrSold'        # Year Sold (YYYY)
]

# Reassign X_train so that it only contains relevant columns
X_train = X_train.loc[:, relevant_columns]

# separate data (X = features, y = outcome/prediction) into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Missing Indicator for LotFrontage

First, we import sklearn.impute.MissingIndicator (documentation here). The goal of using a MissingIndicator is creating a new column to represent which values were NaN (or some other “missing” value) in the original dataset, in case NaN ends up being a meaningful indicator rather than a random missing bit of data.

A MissingIndicator is a scikit-learn transformer, meaning that we will use the standard steps for any scikit-learn transformer:

  1. Identify data to be transformed (typically not every column is passed to every transformer)
  2. Instantiate the transformer object
  3. Fit the transformer object (on training data only)
  4. Transform data using the transformer object
  5. Add the transformed data to the other data that was not transformed
from sklearn.impute import MissingIndicator

# (1) Identify data to be transformed
# We only want missing indicators for LotFrontage
frontage_train = X_train[["LotFrontage"]]

# (2) Instantiate the transformer object
missing_indicator = MissingIndicator()

# (3) Fit the transformer object on frontage_train
missing_indicator.fit(frontage_train)

# (4) Transform frontage_train and assign the result
# to frontage_missing_train
frontage_missing_train = missing_indicator.transform(frontage_train)

# Visually inspect frontage_missing_train
frontage_missing_train

Imputing Missing Values for LotFrontage

Now that we have noted where missing values were originally present, let’s use a SimpleImputer (documentation here) to fill in those NaNs in the LotFrontage column.

The process is very similar to the MissingIndicator process, except that we want to replace the original LotFrontage column with the transformed version instead of just adding a new column on.

In the cell below, create and use a SimpleImputer with strategy="median" to transform the value of frontage_train (declared above).

from sklearn.impute import SimpleImputer

# (1) frontage_train was created previously, so we don't
# need to extract the relevant data again

# (2) Instantiate a SimpleImputer with strategy="median"
imputer = SimpleImputer(strategy = 'median')

# (3) Fit the imputer on frontage_train
imputer.fit(frontage_train)

# (4) Transform frontage_train using the imputer and
# assign the result to frontage_imputed_train
frontage_imputed_train = imputer.transform(frontage_train)

# Visually inspect frontage_imputed_train
frontage_imputed_train

Binary Categories

For binary categories, we will use an OrdinalEncoder (documentation here) to convert the categories of Street and LotFrontage_Missing into binary values (0s and 1s).

Just like in Step 2 when we used the MissingIndicator and SimpleImputer, we will follow these steps:

  1. Identify data to be transformed
  2. Instantiate the transformer object
  3. Fit the transformer object (on training data only)
  4. Transform data using the transformer object
  5. Add the transformed data to the other data that was not transformed

Let’s start with transforming Street:

from sklearn.preprocessing import OrdinalEncoder

# (1) Create a variable street_train that contains the
# relevant column from X_train
# (Use double brackets [[]] to get the appropriate shape)
street_train = X_train[['Street']]

# (2) Instantiate an OrdinalEncoder
encoder_street = OrdinalEncoder()

# (3) Fit the encoder on street_train
encoder_street.fit(street_train)

# (4) Transform street_train using the encoder and
# assign the result to street_encoded_train
street_encoded_train = encoder_street.transform(street_train)

# Flatten for appropriate shape
street_encoded_train = street_encoded_train.flatten()

Multiple Categories

Unlike Street and LotFrontage_Missing, FireplaceQu has more than two categories. Therefore the process for encoding it numerically is a bit more complicated, because we will need to create multiple “dummy” columns that are each representing one category.

To do this, we can use a OneHotEncoder from sklearn.preprocessing (documentation here).

The first several steps are very similar to all of the other transformers we’ve used so far, although the process of combining the data with the original data differs.

In the cells below, complete steps (0)-(4) of preprocessing the FireplaceQu column using a OneHotEncoder:

# (0) import OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
fireplace_qu_train = X_train[["FireplaceQu"]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = OneHotEncoder(categories = 'auto', sparse = False, handle_unknown = 'ignore')

# (3) Fit the encoder on fireplace_qu_train
ohe.fit(fireplace_qu_train)

# (4) Transform fireplace_qu_train using the encoder and
# assign the result to fireplace_qu_encoded_train
fireplace_qu_encoded_train = ohe.transform(fireplace_qu_train)

# (5a) Make the transformed data into a dataframe
fireplace_qu_encoded_train = pd.DataFrame(
    # Pass in NumPy array
    fireplace_qu_encoded_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# (5b) Drop original FireplaceQu column
X_train.drop("FireplaceQu", axis=1, inplace=True)

# (5c) Concatenate the new dataframe with current X_train
X_train = pd.concat([X_train, fireplace_qu_encoded_train], axis=1)

Attribution: The Flatiron School

Rebecca Frost-Brewer
Rebecca Frost-Brewer
Data Scientist