Improving Projects — Part II: Time Series (Sales)

This post is part of efforts on improving my technical projects. Each post describes how I identified suspicious spots and updated them. You can find Part I: Regressors and Part III: NLP here. If you want to discuss anything, please feel free to leave a comment or message me on Linkedin.

Brief Background

Updates (12/24/2020)

  • Explored trends for each state, category, stores
  • Trained ARIMA, Decision Tree, Random Forest and Light GBM


Impacts of Holidays

Dashed vertical lines in orange are indicators of holidays

The above image shows the aggregated sum of total sales for Hobbies_1 category at a CA Walmart store (store id: CA_1). You can tell holidays overlapped with both local minima and local maxima. Furthermore, there’s seasonality and I don’t know if the local extreme values were just part of seasonality or the real impact of holidays, or both? This observation is confirmed by zooming into different part of the above graph:

Zooming into the latest part

The above graph zooms into the right most part of the first graph. It shows the most recent sales records. This zoomed-in graph confirms that both local maxima and local minima happened on sales during holidays. When holidays were very close, their total daily sales were close, too. Therefore, I decided to quit the original plan of viewing the days around holidays also as holidays and rely more on seasonality analysis.

Similar and Dissimilar Trend

By plotting the trend of daily total sales at state, category and store-level, I chose the category-level as the basis to build models on, because each trend of three categories was clearly separable from one another and had obvious seasonality:

The last three vertical lines are around Christmas holidays 2015.

In the above graph, dashed vertical lines represent holidays/special events around Thanksgiving and Christmas in 2015. You can tell that there’s a significant decrease in sales during the Christmas, especially the food category.

Train Models

I focused on Foods category.


The classic ARIMA came to my mind, because extra variables, such as holidays and prices, turned out to be not as useful as they were supposed to be. The impacts of holidays were explained in the first section. Missing prices were all related to zero demand — not sold. Since only sold items had prices recorded, it’s not proper to use prices as a predictor of demands.

I first used Dickey-Fuller Test to check the stationarity. Stationarity ensures the mean, variance and covariance of a series don’t change over time. If there’s a trend, Auto Regressive part will have omitted (time) variable bias; if averages change across time, using Moving Average with a constant overall average doesn’t make sense anymore. It turned out only the time series of Hobbies category was stationary (p-value < 0.05). All the others (Foods, Households) needed to have positive differencing orders. Which order? I created first-differencing, second-differencing and third-differencing sales variables and figured out that their p-values jumped up and down in a cycle. This led me to train ARIMA only on the Hobbies category without spending too much time on differencing orders. Below is the ACF and PACF plots of the Hobbies category. PACF plot was pretty ideal: there’s a cutoff at 1 or 2 (setting p=1 or 2 in ARIMA(p, d, q) gives similar results); however, ACF was the problem. Starting from the first lag, ACF plot was neither decreasing gradually or having a sharp cutoff.

I used AIC to select orders of ARIMA. As guessed, predictive results of ARIMA were not good:

Forecasting the ending part of the Hobbies series (ARIMA)

Does SARIMA (ARIMA with seasonality) improve prediction at all? Indeed yes! Incorporating seasonality into it improved predictive power a lot, however, was still not ideal. I decided to try more complicated models.

Forecasting the ending part of the Hobbies series (SARIMA)
  • Tree Algorithms

I love forests. They’re forgiving. You usually just need to throw a lot of features to them and they will split nodes at the appropriate points then generate good results for you. In this case, I created the first lags and the second lags of sales then threw them with other features all together.

Not surprisingly, Random Forest outperformed Decision Tree (0.25 vs 0.27 mean absolute percentage error). It’s not surprising because Random Forest is the bagged version of aggregated Decision Tree — basically an upgraded version built upon a single tree. I used mean absolute percentage error (MAPE) because I want to know the relative difference between predicted values and true values compared to the true values. The only reason I didn’t use the popular root-mean-square-error (RMSE) was that daily sales fall into a range with seasonality — there’s not really an extreme large or small predicted value to worry about.

Finally, I tuned hyperparameters of Light Gradient Boosting Tree. Its performance was very similar to Random Forest’s result (both 0.25 MAPE). LightGBM is not my favorite model, because you have to tune way more hyperparameters. I do love the idea of growing a tree at leaf-wise and focusing on the errors of the previous stage though!

Thanks for reading :)


Data Engineering, Causal Inference & Predictive Analysis