Nonparametric Machine Learning Model Example

Nonparametric ML Models - Cumulative Lab

Introduction

This demonstrates my applications of two nonparametric models — k-nearest neighbors and decision trees — to the forest cover dataset.

Here I will be using an adapted version of the forest cover dataset from the UCI Machine Learning Repository. Each record represents a 30 x 30 meter cell of land within Roosevelt National Forest in northern Colorado, which has been labeled as Cover_Type 1 for “Cottonwood/Willow” and Cover_Type 0 for “Ponderosa Pine”. (The original dataset contained 7 cover types but this has been simplified.)

The task is to predict the Cover_Type based on the available cartographic variables:

There are over 38,000 rows, each with 52 feature columns and 1 target column:

  • Elevation: Elevation in meters
  • Aspect: Aspect in degrees azimuth
  • Slope: Slope in degrees
  • Horizontal_Distance_To_Hydrology: Horizontal dist to nearest surface water features in meters
  • Vertical_Distance_To_Hydrology: Vertical dist to nearest surface water features in meters
  • Horizontal_Distance_To_Roadways: Horizontal dist to nearest roadway in meters
  • Hillshade_9am: Hillshade index at 9am, summer solstice
  • Hillshade_Noon: Hillshade index at noon, summer solstice
  • Hillshade_3pm: Hillshade index at 3pm, summer solstice
  • Horizontal_Distance_To_Fire_Points: Horizontal dist to nearest wildfire ignition points, meters
  • Wilderness_Area_x: Wilderness area designation (3 columns)
  • Soil_Type_x: Soil Type designation (39 columns)
  • Cover_Type: 1 for cottonwood/willow, 0 for ponderosa pine

This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:

1. Set Up Modeling

# Import train_test_split 
from sklearn.model_selection import train_test_split

# Split the data
X = df.drop("Cover_Type", axis=1)
y = df["Cover_Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = scaler = StandardScaler()

# Transform the training and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Build a Baseline kNN Model

Build a scikit-learn kNN model with default hyperparameters. Then use cross_val_score with scoring="neg_log_loss" to find the mean log loss for this model (passing in X_train_scaled and y_train to cross_val_score). Find the mean of the cross-validated scores, and negate the value (either put a - at the beginning or multiply by -1) so that the answer is a log loss rather than a negative log loss.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn_baseline_model = KNeighborsClassifier()

knn_baseline_log_loss = -cross_val_score(knn_baseline_model, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_baseline_log_loss
0.1255288892455634

The best logistic regression model completed previously had a log loss of 0.13, so log loss of this vanilla kNN model is better; however, it was also much slower, taking around a minute to complete the cross-validation on this machine.

3. Build Iterative Models to Find the Best kNN Model

knn_model2 = KNeighborsClassifier(n_neighbors = 25)

knn_model2_log_loss = -cross_val_score(knn_model2, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_model2_log_loss
0.06425722742416393
knn_model3 = KNeighborsClassifier(n_neighbors = 50)

knn_model3_log_loss = -cross_val_score(knn_model3, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_model3_log_loss
0.078613760394212
knn_model4 = KNeighborsClassifier(n_neighbors = 50, metric = 'manhattan')

knn_model4_log_loss = -cross_val_score(knn_model4, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_model4_log_loss
0.07621145166565102
knn_model5 = KNeighborsClassifier(n_neighbors = 25, metric = 'manhattan')

knn_model5_log_loss = -cross_val_score(knn_model5, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_model5_log_loss
0.06519186918054885

This fifth model has the smallest log loss value and thus is our best performing kNN model.

4. Build a Baseline Decision Tree Model

Now, start investigating decision tree models.

from sklearn.tree import DecisionTreeClassifier

dtree_baseline_model = DecisionTreeClassifier(random_state=42)

dtree_baseline_log_loss = -cross_val_score(dtree_baseline_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_baseline_log_loss
0.7045390124149022

This is much worse than either the logistic regression (0.1303) or the best of kNN model iterations (0.0643). We can probably assume that the model is badly overfitting, since we have not “pruned” it at all.

5. Build Iterative Models to Find the Best Decision Tree Model

Build and evaluate two more decision tree models to find the best one.

dtree_model2 = DecisionTreeClassifier(random_state=42, criterion='entropy')

dtree_model2_log_loss = -cross_val_score(dtree_model2, X_train, y_train, scoring="neg_log_loss").mean()
dtree_model2_log_loss
0.6543002106787194
from sklearn.metrics import roc_curve, auc
import numpy as np
import matplotlib.pyplot as plt

# Identify the optimal tree depth for given data
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
for max_depth in max_depths:
   dt = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth, random_state=10)
   dt.fit(X_train, y_train)
   train_pred = dt.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   # Add auc score to previous train results
   train_results.append(roc_auc)
   y_pred = dt.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   # Add auc score to previous test results
   test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(max_depths, train_results, 'b', label='Train AUC')
plt.plot(max_depths, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.legend()
plt.show()

png

dtree_model3 = DecisionTreeClassifier(random_state=42,
                                      criterion='entropy',
                                      max_depth = 8)

dtree_model3_log_loss = -cross_val_score(dtree_model3, X_train, y_train, scoring="neg_log_loss").mean()
dtree_model3_log_loss
0.1775668139810499
dtree_model4 = DecisionTreeClassifier(random_state=42,
                                      criterion='entropy',
                                      max_depth = 8,
                                      min_samples_leaf = 10)

dtree_model4_log_loss = -cross_val_score(dtree_model4, X_train, y_train, scoring="neg_log_loss").mean()
dtree_model4_log_loss
0.1407497348442239
dtree_model5 = DecisionTreeClassifier(random_state=42,
                                      criterion='entropy',
                                      max_depth = 8,
                                      min_samples_leaf = 100)

dtree_model5_log_loss = -cross_val_score(dtree_model5, X_train, y_train, scoring="neg_log_loss").mean()
dtree_model5_log_loss
0.10910816433690831
# Find the best value for optimal maximum feature size
max_features = list(range(1, X_train.shape[1]))
train_results = []
test_results = []
for max_feature in max_features:
   dt = DecisionTreeClassifier(criterion='entropy', max_features=max_feature, random_state=10)
   dt.fit(X_train, y_train)
   train_pred = dt.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   train_results.append(roc_auc)
   y_pred = dt.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(max_features, train_results, 'b', label='Train AUC')
plt.plot(max_features, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('max features')
plt.legend()
plt.show()

png

dtree_model6 = DecisionTreeClassifier(random_state=42,
                                      criterion='entropy',
                                      max_depth = 8,
                                      min_samples_leaf = 100,
                                      max_features = 25)

dtree_model6_log_loss = -cross_val_score(dtree_model6, X_train, y_train, scoring="neg_log_loss").mean()
dtree_model6_log_loss
0.10486292521511258

6. Choose and Evaluate an Overall Best Model

The kNN model with n_neighbors of 25 was the best performing model. Instantiate a variable final_model using your best model with the best hyperparameters.

# Replace None with appropriate code
final_model = KNeighborsClassifier(n_neighbors = 25)

# Fit the model on the full training data
# (scaled or unscaled depending on the model)
final_model.fit(X_train_scaled, y_train)

Evaluate the log loss, accuracy, precision, and recall.

from sklearn.metrics import accuracy_score, precision_score, recall_score, log_loss

preds = final_model.predict(X_test_scaled)
probs = final_model.predict_proba(X_test_scaled)

print("log loss: ", log_loss(y_test, probs))
print("accuracy: ", accuracy_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("recall:   ", recall_score(y_test, preds))
log loss:  0.08075852922963977
accuracy:  0.9754830666943695
precision: 0.9033989266547406
recall:    0.735080058224163

This model has 97.5% accuracy, meaning that it assigns the correct label 97.5% of the time. This is definitely an improvement over a “dummy” model, which would have about 92% accuracy.

If the model labels a given forest area a 1, there is about a 90% chance that it really is class 1, compared to about a 67% chance with the logistic regression (precision).

The recall score is also improved from the logistic regression model. If a given cell of forest really is class 1, there is about a 73.5% chance that our model will label it correctly (recall). This is better than the 48% of the logistic regression model, but still doesn’t instill a lot of confidence. If the business really cared about avoiding “false negatives” (labeling cottonwood/willow as ponderosa pine) more so than avoiding “false positives” (labeling ponderosa pine as cottonwood/willow), then we might want to adjust the decision threshold on this.

Conclusion

In this lab, I demonstrated the end-to-end machine learning process with multiple model algorithms, including tuning the hyperparameters for those different algorithms. Nonparametric models can be more flexible than linear models, potentially leading to overfitting but also potentially reducing underfitting by being able to learn non-linear relationships between variables, but there can be a tradeoff between speed and performance, with good metrics correlating with slow speeds.