This post is part of efforts on improving my technical projects. Each post describes how I identified suspicious spots and updated them. You can find Part I: Regressors and Part III: NLP here. If you want to discuss anything, please feel free to leave a comment or message me on Linkedin.
This project predicted daily sales for food categories at Walmart stores in California with the input data (06/19/2015–06/19/2016) covered item ids, item sales, item prices, departments, product categories, store ids and holiday/special events.
- Investigated the impact of holidays on daily sales
- Explored trends for each state, category, stores
- Trained ARIMA, Decision Tree, Random Forest and Light GBM
Impacts of Holidays
Before editing this project, I believed people started storing goods a few days before holidays/special events and might keep shopping a few days afterwards, and therefore I planned to make a change on feature engineering by viewing days around holidays as “special” too. However, this is NOT the case here.
The above image shows the aggregated sum of total sales for Hobbies_1 category at a CA Walmart store (store id: CA_1). You can tell holidays overlapped with both local minima and local maxima. Furthermore, there’s seasonality and I don’t know if the local extreme values were just part of seasonality or the real impact of holidays, or both? This observation is confirmed by zooming into different part of the above graph:
The above graph zooms into the right most part of the first graph. It shows the most recent sales records. This zoomed-in graph confirms that both local maxima and local minima happened on sales during holidays. When holidays were very close, their total daily sales were close, too. Therefore, I decided to quit the original plan of viewing the days around holidays also as holidays and rely more on seasonality analysis.
Similar and Dissimilar Trend
With a massive dataset of sale records of each item per day at each Walmart store for one year, obviously I had to make a choice on the level of this prediction task. Not all levels are possible, according to my work half a year ago: predicting daily sale of each item with basic machine learning algorithms is just not proper, because of the sparsity of data (lots of items were not sold on many days), also because of the lack of pattern on item sales. What can I do with this dataset then?
By plotting the trend of daily total sales at state, category and store-level, I chose the category-level as the basis to build models on, because each trend of three categories was clearly separable from one another and had obvious seasonality:
In the above graph, dashed vertical lines represent holidays/special events around Thanksgiving and Christmas in 2015. You can tell that there’s a significant decrease in sales during the Christmas, especially the food category.
Half a year ago, the only model I was able to train using the original massive dataset was LightGBM because of the long runtime. This time, by looking at the category level (Foods, Households, Hobbies), I was able to explore more options. It’s not appropriate to use cross-validation on time-series data, because of backtracking over time. I split data into 80%-20% training-test sets.
I focused on Foods category.
The classic ARIMA came to my mind, because extra variables, such as holidays and prices, turned out to be not as useful as they were supposed to be. The impacts of holidays were explained in the first section. Missing prices were all related to zero demand — not sold. Since only sold items had prices recorded, it’s not proper to use prices as a predictor of demands.
I first used Dickey-Fuller Test to check the stationarity. Stationarity ensures the mean, variance and covariance of a series don’t change over time. If there’s a trend, Auto Regressive part will have omitted (time) variable bias; if averages change across time, using Moving Average with a constant overall average doesn’t make sense anymore. It turned out only the time series of Hobbies category was stationary (p-value < 0.05). All the others (Foods, Households) needed to have positive differencing orders. Which order? I created first-differencing, second-differencing and third-differencing sales variables and figured out that their p-values jumped up and down in a cycle. This led me to train ARIMA only on the Hobbies category without spending too much time on differencing orders. Below is the ACF and PACF plots of the Hobbies category. PACF plot was pretty ideal: there’s a cutoff at 1 or 2 (setting p=1 or 2 in ARIMA(p, d, q) gives similar results); however, ACF was the problem. Starting from the first lag, ACF plot was neither decreasing gradually or having a sharp cutoff.
I used AIC to select orders of ARIMA. As guessed, predictive results of ARIMA were not good:
Does SARIMA (ARIMA with seasonality) improve prediction at all? Indeed yes! Incorporating seasonality into it improved predictive power a lot, however, was still not ideal. I decided to try more complicated models.
- Tree Algorithms
I love forests. They’re forgiving. You usually just need to throw a lot of features to them and they will split nodes at the appropriate points then generate good results for you. In this case, I created the first lags and the second lags of sales then threw them with other features all together.
Not surprisingly, Random Forest outperformed Decision Tree (0.25 vs 0.27 mean absolute percentage error). It’s not surprising because Random Forest is the bagged version of aggregated Decision Tree — basically an upgraded version built upon a single tree. I used mean absolute percentage error (MAPE) because I want to know the relative difference between predicted values and true values compared to the true values. The only reason I didn’t use the popular root-mean-square-error (RMSE) was that daily sales fall into a range with seasonality — there’s not really an extreme large or small predicted value to worry about.
Finally, I tuned hyperparameters of Light Gradient Boosting Tree. Its performance was very similar to Random Forest’s result (both 0.25 MAPE). LightGBM is not my favorite model, because you have to tune way more hyperparameters. I do love the idea of growing a tree at leaf-wise and focusing on the errors of the previous stage though!
Thanks for reading :)