Multivariate Time Series Forecasting in Python
- supriyamalla

- Mar 6, 2023
- 3 min read
Updated: Mar 14, 2023
Are you tired of making business decisions based on guesses and assumptions? Do you wish you could have a crystal ball that could accurately predict future trends and values? Well, while we may not have a mystical crystal ball, we do have something just as powerful – time series forecasting.

Recently, I came across a problem where there was a need to forecast sales quantity for each store and product with each store/plant selling multiple products.
Here is a quick description of the sales data:
1. 40 weeks of data from July 2021 to April 2022
2. 11 stores/plants
3. 50 products per store
Objective: Forecast sales for individual store/plant and product for the next 12 weeks.
Challenge: Running an iterative for loop for 50*11=550 combinations for 40 weeks is a makeshift solution but not an efficient one. What if we want to run different ML models to train on each combination?
Solution: Is there an easy way to do this? Fortunately, yes, using Scalecast library. It is easy to use and it provides a unified interface to quickly prototype and test different models with minimal code.
Michael Keith’s posts on Scalecast library (he is the creator of Scalecast) and its implementation were super helpful! A lot of the content that I am sharing below are inspired from his posts.
In this blog post, we will further explore the Scalecast library and its capabilities for time series forecasting.
Here are the steps to get started:
1. Prepare data to eliminate any missing values
This is the first step in any time series forecasting project. This involves loading the data, checking for missing values, and converting the data to a time series format. I recommend removing spaces from column names (not a necessity but to avoid any conflict).
2. Create a Forecaster object which takes into account all variables
input_data={} #creating an empty dictionary
for plant in data.PlantID.unique():
for mat in data.Material_Code.unique():
data_slice = data.loc[(data['PlantID'] == plant)& (data['Material_Code'] == mat)]
# for missing weeks, assume 0
load_dates = pd.to_datetime(data_slice.Date)
data_load = pd.DataFrame({'Date':load_dates})
data_load['Vol'] = data_load.merge(data_slice,how='left',on='Date')['Sales_Qty'].values
f=Forecaster(y=data_load['Vol'],current_dates=data_load['Date'],PlantID=plant,Material_Code=mat) #add additional variables if you have
input_data[f"{plant}-{mat}"] = fExplanation for each line of code snippet (click on chevron icon to expand):
input_data={}: This line creates an empty dictionary called input_data. This dictionary is used to store data that will be fed into a forecasting model. for plant in data.PlantID.unique():: This line initiates a loop over all unique values of PlantID (or stores) in the data. for mat in data.Material_Code.unique():: Within the plant loop, another loop is initiated over all unique values of Material_Code (or products) as every store sells multiple products. data_slice = data.loc[(data['PlantID'] == plant)& (data['Material_Code'] == mat)]: Within the nested loops, a new DataFrame called data_slice is created. This DataFrame is a subset of the original data that contains only rows where the PlantID matches the current plant value and the Material_Code matches the current material/product value. load_dates = pd.to_datetime(data_slice.Date): The Date column of data_slice is converted to a pandas datetime format and stored in a new variable called load_dates. data_load = pd.DataFrame({'Date':load_dates}): A new DataFrame called data_load is created, containing only the load_dates column. data_load['Vol'] = data_load.merge(data_slice,how='left',on='Date')['Sales_Qty'].values: A new column called Vol is added to data_load. This column is populated by merging data_load with data_slice on the Date column and extracting the Sales_Qty values. This line assumes that missing values will be treated as 0. f=Forecaster(y=data_load['Vol'],current_dates=data_load['Date'],PlantID=plant,Material_Code=mat): A new forecasting object called f is created using the Forecaster class. The y argument is set to the Vol column of data_load, the current_dates argument is set to the Date column of data_load, and the PlantID and Material_Code arguments are set to the current plant and mat/product values, respectively. This suggests that the forecasting model is being fit separately for each combination of PlantID and Material_Code. input_data[f"{plant}-{mat}"] = f: The new f object is added to the input_data dictionary using a key that is a string concatenation of the current plant and mat values. This suggests that the input_data dictionary is being used to store separate forecasting models for each combination of PlantID and Material_Code.
3. Download template validation grids which have predefined code for the models you want to run
models = ('mlr','elasticnet','knn','rf','gbt','xgboost','mlp')
GridGenerator.get_example_grids()
GridGenerator.get_mv_grids()Explanation for each line of code snippet (click on chevron icon to expand):
models = ('mlr','elasticnet','knn','rf','gbt','xgboost','mlp'): This line defines a tuple models containing the names of several machine learning models. These models include linear regression, elastic net, k-nearest neighbors, random forest, gradient boosted trees, XGBoost, and multilayer perceptron neural networks.
4. Using forecaster object generate future dates and iterate over individual store-product combinations
for k, f in input_data.items(): #k,f are iterators of the dictionary
f.generate_future_dates(12) #Predict for the next 12 weeks
f.set_test_length(10)
f.set_validation_length(5)
f.add_ar_terms(3)
f.add_AR_terms((1,10))
if not f.adf_test(): # returns True if it thinks it's stationary, False otherwise
f.diff()
f.add_seasonal_regressors('week','month','quarter',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_time_trend()Explanation for each line of code snippet (click on chevron icon to expand):
The code iterates over each key-value pair in the input_data dictionary, where the keys are strings representing a unique combination of PlantID and Material_Code and the values are instances of the Forecaster class. For each Forecaster instance f, the following steps are performed:
f.generate_future_dates(12): This generates a sequence of future dates for the next 12 weeks that the forecaster will use to make predictions.
f.set_test_length(10): This sets the length of the test set to 10, which means that the last 10 weeks of data will be held out for evaluation purposes.
f.set_validation_length(5): This sets the length of the validation set to 5, which means that the second-to-last 5 weeks of data will be used for validation.
f.add_ar_terms(3): This adds autoregressive terms up to lag 3 to the model.
f.add_AR_terms((1,10)): This adds autoregressive terms at lags 1 and 10 to the model.
if not f.adf_test(): f.diff(): This performs an Augmented Dickey-Fuller test to check if the time series is stationary. If the test indicates that the series is not stationary, the code applies differencing to transform the series into a stationary one.
f.add_seasonal_regressors('week','month','quarter',raw=False,sincos=True): This adds seasonal regressors based on the week, month, and quarter of the year, with the option to use sine and cosine functions to capture cyclical patterns.
f.add_seasonal_regressors('year'): This adds seasonal regressors based on the year.
f.add_time_trend(): This adds a linear time trend to the model.
5. Run pre-defined models over individual plant-product combinations
from tqdm.notebook import tqdm as log_progress
models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp', 'rf')
for k, f in log_progress(input_data.items()):
#k, f are the iterators of input_data dictionary
for m in models:
f.set_estimator(m)
f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
f.auto_forecast()
# combine models and run manually specified models of other varieties
f.set_estimator('combo')
f.manual_forecast(how='weighted',models=models,determine_best_by='ValidationMetricValue',call_me='weighted')
f.manual_forecast(how='simple',models='top_5',determine_best_by='ValidationMetricValue',call_me='avg')Explanation for each line of code snippet(click on chevron icon to expand):
This code is using the tqdm library to add a progress bar to the loop that iterates over each key-value pair in the input_data dictionary. Inside the loop, the code then performs the following steps:
It sets the estimator attribute of the Forecaster instance f to each model in the models tuple.
It calls the tune() method of f to perform hyperparameter tuning for the specified estimator using the default grid of hyperparameters with the same name as the estimator.
It calls the auto_forecast() method of f to generate forecasts for the test set.
It sets the estimator attribute of f to 'combo' and calls the manual_forecast() method twice to generate two sets of forecasts:
The first call uses the models argument to specify a list of models to combine and the how argument to specify a weighted combination. The determine_best_by argument specifies the metric to use for determining the weights, and call_me specifies a name for the resulting combination.
The second call uses the models argument to specify the top 5 models to average and the how argument to specify a simple average. The determine_best_by argument specifies the metric to use for selecting the top 5 models, and call_me specifies a name for the resulting average.
Overall, this code is using the Forecaster class to automate the process of time series forecasting, with the ability to select from several different models and hyperparameters and combine them in different ways. The tqdm library is used to provide a visual progress indicator for the loop.
6. Finally extract your data!
forecast_info = pd.DataFrame()
forecast_info1=pd.DataFrame()
for k, f in input_data.items():
df = f.export(dfs=['lvl_fcsts'],determine_best_by='LevelTestSetMAPE')
df1 = f.export(dfs=['model_summaries'],determine_best_by='LevelTestSetMAPE')
df['Name'] = k
df['Plant'] = f.PlantID
df['Material Code'] = f.Material_Code
df1['Name'] = k
df1['Plant'] = f.PlantID
df1['Material Code'] = f.Material_Code
forecast_info = pd.concat([forecast_info,df],ignore_index=True)
forecast_info1 = pd.concat([forecast_info1,df1],ignore_index=True)
writer = pd.ExcelWriter('model_summaries.xlsx')
forecast_info.to_excel(writer,sheet_name='Sheet1',index=False)
forecast_info1.to_excel(writer,sheet_name='Sheet2',index=False)
writer.save()
Explanation for each line of code snippet (click on chevron icon to expand):
This code iterates through the input_data dictionary, and for each item, it exports the level forecasts and model summaries using the export method of the Forecaster class. It then appends the exported data to two Pandas dataframes forecast_info and forecast_info1.
That’s pretty much it! The final file that you extract will have both model summaries (having the best model stats) and the final forecasted values!
Find the complete notebook here.
And done!


Comments