Using Multiple Linear Regression for Housing Sale Price Analysis
Housing Sale Price Analysis in King County, WA
Business Understanding
Emerald City Realtors serves the King County community, providing prospective home sellers with guidance on how to improve the value of their home prior to listing.
- Stakeholder: Emerald City Realtors
- Business Problem: Emerald City Realtors need to provide prospective home sellers with guidance on how to improve the value of their home prior to listing, including the predicted increase in value expected based on improvements to particular features.
- Business Question: What features of their home can prospective home sellers change or improve to increase the value of their home, and by amount could this increase be specific to certain features?
These recommendations will be valuable to Emerald City Realtors because they will help prospective home sellers confidently ascertain how they can improve the value of their home, and if the investment is worth the cost.
Data Understanding
This project uses the King County House Sales dataset because Emerald City Realtors and its prospective homesellers are all based in King County. The dataset includes all data of single-family home sales from 2014-2015. The dataset itself can be found in kc_house_data.csv
in the data folder of this GitHub repository along with the descriptions of the features, found in column_names.md
Further information about the features can be found on the King County Assessor Website
The original dataset includes sales data for 21,597 homes with 20 different features, which include:
date
- Date house was soldprice
- Sale price (prediction target)bedrooms
- Number of bedroomsbathrooms
- Number of bathroomssqft_living
- Square footage of living space in the homesqft_lot
- Square footage of the lotfloors
- Number of floors (levels) in housewaterfront
- Whether the house is on a waterfrontview
- Quality of view from housecondition
- How good the overall condition of the house is. Related to maintenance of housegrade
- Overall grade of the house. Related to the construction and design of the housesqft_above
- Square footage of house apart from basementsqft_basement
- Square footage of the basementyr_built
- Year when house was builtyr_renovated
- Year when house was renovatedzipcode
- ZIP Code used by the United States Postal Service
Data Processing
To assist with creating sound models, we completed some data cleaning including:
- Dropping unrelated features to our business question (ID, sale date, zipcode, latitude, longitude, lot size, and the lot size and living space of a home’s 15 closest neighbords)
- Dummy-encode categorical variables (
condition
andgrade
) - Create binary variables for waterfront, view, and renovation status
# Let's remove the 'price' outliers in the top 5% of the observations
kcdf = kcdf.query('price < price.quantile(.95)')
# Create our dummy variables for the categorical features
cond_dummies = pd.get_dummies(kcdf['condition'], prefix = 'cond', drop_first = True)
grade_dummies = pd.get_dummies(kcdf['grade'], prefix = 'grade', drop_first = True)
# The one that's dropped is where the coef comes from - Grage 10 to Grade 5, lose
# No diff between a grade 10 and grade 13
# Drop the original column and concatenate our dummy variable columns with our original dataframe
kcdf = kcdf.drop(['condition', 'grade'], axis = 1)
kcdf_clean = pd.concat([kcdf, cond_dummies, grade_dummies], axis = 1)
Exploratory Correlations
We are showing correlation and using regression coefficients in this analysis to be able to show the relationship between one or more features with sale price.
Using regression and interpreting correlation coefficients is effective for this business problem because it will allow for us to determine how sale price is impacted by different features and to what degree.
Building complex models with multiple features allows for us to be able to make more accurate, data-driven predictions.
# Let's check the correlations in an easy-to-read table
# 0.7-0.9 highly correlated
# 0.5-0.7 moderately correlated
kcdf_corrs = kcdf_clean.corr()['price'].map(abs).sort_values(ascending = False)
# sqft_living = 0.62
# sqft_above = 0.53
Modeling
We are showing correlation and using regression coefficients in this analysis to be able to show the relationship between one or more features with sale price.
Using regression and interpreting correlation coefficients is effective for this business problem because it will allow for us to determine how sale price is impacted by different features and to what degree.
Building complex models with multiple features allows for us to be able to make more accurate, data-driven predictions.
def reg_qq_sced(y, X, add_constant=True, qq=True, sced=True):
"""
Display a the summary output of a linear regression model, with predictors X and target y.
Also displays a QQ plot and residual plot by default. These can be toggled off.
The function will add a constant to the predictors by default, and this can be toggled off.
"""
# Run a linear regression and display the summary
if add_constant:
X_sm = sm.add_constant(X, has_constant='add')
else:
X_sm = X
model = sm.OLS(y, X_sm).fit()
display(model.summary())
# Display a QQ plot
if qq:
fig_qq = sm.graphics.qqplot(model.resid, line='45', fit=True,)
fig_qq.suptitle('QQ plot for residual normality check')
else:
pass
# Display a plot of predicted values vs. residuals
if sced:
preds = model.predict(X_sm)
residuals = model.resid
fig_resid, ax = plt.subplots(figsize=(10,5))
fig_resid.suptitle('Predicted vs. residual plot for homoscedasticity check')
ax.scatter(preds, residuals, alpha=0.2)
ax.plot(preds, [0 for i in range(len(X_sm))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");
else:
pass
lr = LinearRegression()
lr.fit(X_sm, y)
print(f'Model adjusted R-squared: {model.rsquared_adj}')
print(f'Model Mean Absolute Error: {metrics.mean_absolute_error(y, lr.predict(X_sm))}')
# Code reference: https://github.com/zshoorbajee/King-County-House-Sales-Flatiron-Project2/
# blob/main/King_County_analysis.ipynb
Regression Results
X2_preds = kcdf_clean[['sqft_living',
'sqft_basement',
'bedrooms',
'bathrooms',
'floors',
'yr_built',
'cond_Fair',
'cond_Good',
'cond_Very Good',
'grade_11 Excellent',
'grade_3 Poor',
'grade_4 Low',
'grade_5 Fair',
'grade_6 Low Average',
'grade_7 Average',
'grade_8 Good',
'grade_9 Better',
'waterfront',
'view_YES']]
reg_qq_sced(y_target, X2_preds)
# Model adjusted R-squared: 0.575158941672169
# Model Mean Absolute Error: 106248.25002570756
In our final model comprising of all features except that of cond_Poor
, grade_12 Luxury
, and reno_status
, our model’s performance based on its adjusted R-squared improved from 38.98 percent to 57.5 percent.
Further, the Mean Absolute Error improved from our baseline score of 131878.02 to 106248.25, which is good.
In our final model, all features have a statistically significant linear relationship with sale price.
- While holding all other variables constant, the addition of a bathroom increases sale price by 29,020 dollars
- While holding all other variables constant, the addition of one floor level increases sale price by 41,040 dollars
- While holding all other variables constant, improving a home’s condition from Average to Very Good increases sale price by 38,810 dollars
- While holding all other variables constant, improving a home’s grade from Better to High Quality increases sale price by 82,180 dollars
Recommendations
- Improve the grade of your home (construction quality) at a minimum to High Quality. An improvement from Better to High Quality is predicted to increase the sale price by 82,180 dollars
- Adding an additional bathroom to your home is predicted to increase its sale price by 29,020 dollars
- Each additional square foot of living space is predicted to add 81.12 dollars to the sale price; a 600-square foot addition would be predicted to increase the sale price by 48,672
Limitations and Next Steps
Our model only explains 57.5 percent of the variation in sale price, so we ought to be cautious with our predictions and conclusions. Further, our final model does have high levels of heteroscedasticity, which violates one of the assumptions of linear regression, such that our conclusions may be premature without additional manipulation of the data.
Next Steps:
- Collect more recent sales data for more accurate representation of the market
- Investigate influence of zipcode on sale price
Attribution: The Flatiron School