Improving Projects — Part I: Regressors (Price)

5 min readDec 20, 2020

In this post, I discuss how I edited feature engineering and modeling part. This post is the first part of “Redoing Projects” series. You can find Part II: Time Series (Sales) and Part III: Tweet Sentiment Extraction (NLP) here.

Brief Background

This project aims at predicting house prices in Ames, Iowa with 79 features (2006–2010). The available dataset had 1460 observations.

Updates (12/20/2020)

Engineered features for tree algorithms and linear regression separately. Used Label Encoder for trees and One-Hot Encoder for linear regression with regularization
Checked whether missing values are missing (completely) at random
Built PCA pipelines within cross-validation and gradient descent frameworks

These updates decreased RMSE from 0.42 to 0.24 with Random Forest, and from 0.35 to 0.33 with Elastic Net. If you’re interested, my code is on Github: https://github.com/QingchuanLyu/Predicting-House-Prices

Update I: missing value imputation

My first mistake is replacing missing values with summary statistics before checking whether they’re missing at random completely. There are three types of missing values:

Missing at random: whether a variable is missing or not is correlated with other variables, but not correlated with itself. For example, the variable “total areas of houses” may be missing only in a few neighborhoods. In this case, you might want to use summary statistics (mean/median/max) within these neighborhoods to fill in missing values of total areas.
Missing completely at random (ideal): whether a variable is missing or not has nothing to do with any variables, either observed or not observed. If the number of this type of missing values is small, just drop them.
Missing not at random (worst): whether a variable is missing or not depends on this variable itself. For example, the variable “areas of parking lots” is not missing when the parking area is small. In other words, all the large parking lots don’t have records of their areas. Part of reasons for this type of missing values is selection bias.

How could you check which type your missing values fall into? The answer is data exploration! Admittedly, the type of missing completely at random is almost impossible to be 100% sure, however, we can check other two types with relatively straightforward approaches.

To check “missing at random,” draw plots and create tables to compare the count of missing values in different groups (e.g., “Neighborhoods”). If a group has a very high number of missing values, or missing values only appear in a few groups, you might want to generate summary statistics within groups and fill in missing values.
To check “missing not at random,” choose a highly correlated variable with few missing values. In the above example of the missing areas of parking lots, we can choose the variable “total areas on the ground floor,” because when the total areas on the ground floor are larger, areas of parking lots are very likely to be larger (You can create a plot and compute correlation to check that). By looking at the total areas on the ground floor of houses with missing areas of parking lots, and comparing them to the total areas on ground floor of houses with non-missing areas of parking lots, we’ll get a sense of what types of parking lots have missing areas — are they parking lots associated with larger houses, i.e., they’re probably larger parking lots? If you do have this type of missing values, you might want to drop this variable if it’s not important, or use a two-stage approach: first predict areas of parking lots for houses with large ground floor areas (using regression or KNN), then use the estimated areas of parking lots as part of inputs in your model to predict house prices.

Update II: tree algorithms without one-hot encoder

My second mistake is using sparse data with tree algorithms (Random Forest and Decision Tree). Once categorical features are One-Hot encoded, especially when a feature has many categories, the sparse data will confuse tree algorithms and make them bias towards zero, i.e., trees are more likely to split a node at zero. To understand it, think about a simple and extreme case after one-hot encoding categorical features:

Without even computing Gini impurity scores, your tree will keep splitting at A = 0, B = 0, C= 0 and D= 0 because once the tree hits 1 at a node, it doesn’t have to grow anymore: all the other variables are 0. This was the case in my previous version of this project. By removing the One-Hot Encoder part and only using Label encoded categorical features, the root mean squared error of Random Forest decreased from 0.19 to 0.06.

Update III: using or not using PCA

My third mistake was using PCA without checking its impact on model performance. PCA is a common technique to reduce feature dimensionality and collinear features (collinear features are combined into one principle component). However, it decreases prediction accuracy when features have non-linear dependence. PCA is flexible as it allows users to choose how much variance of the original features to be retained by creating principle components. Well, the downside of such flexibility is important features with small variance might get ignored. Think about an extreme case:

You have three predictors x, w, z and want to predict y. Assume w = y and y has very little variance. If you use PCA on features and specify the proportion of variance explained, you probably will end up losing w in your model. This is not ideal.

By building PCA pipelines within cross-validation frameworks and testing models on out-of-sample data, I figured that tree algorithms and linear regression had slightly less Root Mean Squared Error (at 0.01 scale) without PCA in this project. PCA might just avoid overfitting by combining collinear features together in one principal component (new features).

If you’re interested, my code is on Github: https://github.com/QingchuanLyu/Predicting-House-Prices

Thanks for reading:)