Linear Regression Analysis
Business Understanding
Develop a pricing algorithm to help set a target price for new LEGO sets that are released to market. The goal is to save the company some time and to help ensure consistency in pricing between new products and past products.
The main purpose of this algorithm is predictive, meaning that the model should be able to take in attributes of a LEGO set that does not yet have a set price, and to predict a good price.
The secondary purpose of this algorithm is inferential, meaning that the model should be able to tell us something about the relationship between the attributes of a LEGO set and its price.
Data Understanding
The dataset contains over 700 LEGO sets released in the past, including attributes of those sets as well as their prices. The files have already been split into train and test sets.
train = pd.read_csv("data/lego_train.csv")
test = pd.read_csv("data/lego_test.csv")
X_train = train.drop("list_price", axis=1)
y_train = train["list_price"]
X_test = test.drop("list_price", axis=1)
y_test = test["list_price"]
X_train
Baseline Model
from sklearn.linear_model import LinearRegression
baseline_model = LinearRegression()
from sklearn.model_selection import cross_validate, ShuffleSplit
# Perform 3 separate train-test splits within the X_train and y_train sets
splitter = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)
# Find the train and test scores for each
baseline_scores = cross_validate(
estimator=baseline_model,
X=X_train[[most_correlated_feature]],
y=y_train,
return_train_score=True,
cv=splitter
)
print("Train score: ", baseline_scores["train_score"].mean())
# Train score: 0.7785726407224942
print("Validation score:", baseline_scores["test_score"].mean())
# Validation score: 0.7793473618106956
Build a Model with All Numeric Features
Numeric Feature Selection
Create a dataframe X_train_numeric
that is a copy of X_train
that only contains numeric columns.
X_train_numeric = X_train.select_dtypes("number").copy()
X_train_second_model = X_train_numeric.drop(["prod_id", "num_reviews", "star_rating"], axis=1).copy()
second_model = LinearRegression()
second_model_scores = cross_validate(
estimator=second_model,
X=X_train_second_model,
y=y_train,
return_train_score=True,
cv=splitter
)
print("Current Model")
print("Train score: ", second_model_scores["train_score"].mean())
print("Validation score:", second_model_scores["test_score"].mean())
# Current Model
# Train score: 0.7884552982196166
# Validation score: 0.755820363666055
print("Baseline Model")
print("Train score: ", baseline_scores["train_score"].mean())
print("Validation score:", baseline_scores["test_score"].mean())
# Current Model
# Train score: 0.7884552982196166
# Validation score: 0.755820363666055
Selecting Features Based on p-values
In the previous model, both piece_count
and min_age
had p-values less then 0.05. This model is built using just those two features.
significant_features = ["piece_count", "min_age"]
third_model = LinearRegression()
X_train_third_model = X_train[significant_features]
third_model_scores = cross_validate(
estimator=third_model,
X=X_train_third_model,
y=y_train,
return_train_score=True,
cv=splitter
)
print("Current Model")
print("Train score: ", third_model_scores["train_score"].mean())
print("Validation score:", third_model_scores["test_score"].mean())
# Train score: 0.7869252233899845
# Validation score: 0.7638761794341223
print("Second Model")
print("Train score: ", second_model_scores["train_score"].mean())
print("Validation score:", second_model_scores["test_score"].mean())
# Train score: 0.7884552982196166
# Validation score: 0.755820363666055
print("Baseline Model")
print("Train score: ", baseline_scores["train_score"].mean())
print("Validation score:", baseline_scores["test_score"].mean())
# Train score: 0.7785726407224942
# Validation score: 0.7793473618106956
Selecting Features with sklearn.feature_selection
This model uses RFECV
(documentation here). “RFE” stands for “recursive feature elimination”, meaning that it repeatedly scores the model, finds and removes the feature with the lowest “importance”, then scores the model again. If the new score is better than the previous score, it continues removing features until the minimum is reached. “CV” stands for “cross validation” here, and we can use the same splitter we have been using to test our data so far.
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
# Importances are based on coefficient magnitude, so
# we need to scale the data to normalize the coefficients
X_train_for_RFECV = StandardScaler().fit_transform(X_train_second_model)
model_for_RFECV = LinearRegression()
# Instantiate and fit the selector
selector = RFECV(model_for_RFECV, cv=splitter)
selector.fit(X_train_for_RFECV, y_train)
Build and Evaluate a Final Predictive Model
Create a list best_features
which contains the names of the best model features based on the findings of the previous step:
best_features = ["piece_count", "max_age", "difficulty_level"]
X_train_final = X_train[best_features]
X_test_final = X_test[best_features]
final_model = LinearRegression()
# Fit the model on X_train_final and y_train
final_model.fit(X_train_final, y_train)
# Score the model on X_test_final and y_test
# (use the built-in .score method)
final_model.score(X_test_final, y_test)
# 0.6542913715071492
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, final_model.predict(X_test_final), squared=False)
# 47.403687974333
Attribution: The Flatiron School