Time series are wonderful. I love them. They are everywhere. And we even get to brag about being able to predict the future! Maybe I should put it on my Tinder profile?

As a follow-up to the article on predicting multiple time-series, I receive lots of messages asking about prediction for more than a single step.

A step can be any period of time: a day, a week, a minute, an yearâ€¦ So this is called multi-step forecasting.

I want to show you how to do it with neural networks.

This article is a simple tutorial on how to set-up a basic code/model structure to get you up and running. Feel free to add improvement suggestions on the comments!

The data we will use is the same sales data, but now we will try to predict 3 weeks in advance. Each week can be considered a “step”.

I will use steps and weeks interchangeably in the article to get you used to the idea.

Keep in mind that this method can be used to predict more steps. I chose 3 only because it’s a tutorial. In a real case you should check with stakeholders what is the number of steps most useful to them.

You will need scikit-learn, Keras, pandas and numpy. Get the dataset here.

## Let’s load the data

We have data for 811 products and 52 weeks. To make the processing easier, let’s create a column with only the numeric part of the product code.

1 2 3 4 5 6 7 8 9 |
import pandas as pd import numpy as np data = pd.read_csv("Sales_Transactions_Dataset_Weekly.csv") data = data.filter(regex="Product|W").copy() data['Product_Code_NUM'] = data['Product_Code'].str.extract("(\d+)").astype(int) data.head() |

We see that, although we have 811 products, the maximum product code is 819. We need to remember this when setting up the one-hot encoding for this information.

1 2 3 |
print("Max Product_Code: {} - Unique Product_Code: {}".format(data['Product_Code_NUM'].max(), data['Product_Code_NUM'].nunique())) #Max Product_Code: 819 - Unique Product_Code: 811 |

## Preparing Inputs and Outputs

Here we are going to use the previous 7 weeks to predict the next 3. So let’s say we are in the last day of January and want to predict the first 3 weeks of February. We get the previous 7 weeks spanning January and part of December.

These numbers are, again, took out of thin air to demonstrate the method. You could use 6, 12 weeksâ€¦ Whatever number makes sense. Oh, and it doesn’t need to be weeks. Steps can be any unit of time.

We need to transform our data. Every example in the training set (and, here, in our test set) need to have as inputs the previous 7 weeks sales values, and as outputs, the next 3 weeks.

Think about every row having 800+ zeros, with a one in the position corresponding to the respective product code, and 7 other values, being the previous steps values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
X_train = [] Y_train = [] X_test = [] Y_test = [] for ix, row in data.iterrows(): for w in range(8, 50): product_code_num = row['Product_Code_NUM'] x = row.iloc[w-7:w].values.astype(int) y = row.iloc[w:w+3].values.astype(int) product_code_num_ohe = to_categorical(product_code_num, num_classes=820) x = np.append(x, product_code_num_ohe) if w < 30: X_train.append(x) Y_train.append(y) else: X_test.append(x) Y_test.append(y) X_train = np.array(X_train) Y_train = np.array(Y_train) X_test = np.array(X_test) Y_test = np.array(Y_test) X_train.shape, X_test.shape, Y_train.shape, Y_test.shape #((17842, 827), (16220, 827), (17842, 3), (16220, 3)) |

For completion, you will find that we call these “previous steps values”, lags. Because it’s a lagged version of the time series.

But, Mario, what do I do in production, when I don’t have the next 3 weeks? You will have only your 7 previous steps and the model will generate a prediction for the next 3. The only difference is that you will need to wait 3 weeks to know how well your model predicted.

Here we use historical data to simulate this scenario.

## Setting up Training and Test

As any supervised learning task, we need a training and a test (validation) set. Here I am doing this simple split, taking about half of the weeks for training, and the other half for testing, so that we have about the same number of data points for each.

For each product, we separate FUTURE steps for testing and PAST steps for training. If we did a simple random split we would mix past and future, and it would not be a reliable simulation of what we will find in production. After all, this is not a tutorial about creating time machines!

## Preprocessing and Modeling

1 2 3 4 |
from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) |

Neural networks are very sensitive to the scale of the data, so I decided to use the RobustScaler from scikit-learn, as we have 800+ one-hot columns. It doesn’t mean it’s the best option, feel free to try more sophisticated strategies.

Keras is a fantastic library for rapid development of neural networks. I chose a simple single hidden layer network, but this could be a sequential net, like LSTMs, a Convolutional network, which is showing good results in time-series problems across papers. Or even both! Try different architectures and share the results with us.

Now we just fit the model and get the results. The model is clearly underfitting (training error is higher than validation).

1 2 3 4 5 6 7 8 9 10 11 |
from keras import Model from keras.layers import Dense, Input, Dropout inp = Input(shape=(827,)) hid1 = Dense(64, activation='relu')(inp) out = Dense(3, activation='linear')(hid1) mdl = Model(inputs=inp, outputs=out) mdl.compile(loss='mse', optimizer='adam') mdl.fit(X_train,Y_train, shuffle=True, validation_data=[X_test, Y_test], epochs=20, batch_size=32) |

To have a better idea of how well the model is predicting, let’s calculate the root mean squared error over the log of the values. This is a smooth approximation of the mean absolute percentage error.

1 2 3 4 5 |
p = mdl.predict(X_test) np.sqrt(((np.log1p(p) - np.log1p(Y_test))**2)).mean(axis=0) # array([0.29417287, 0.29790673, 0.30176874]) |

We get 29-30% for each of the steps. This is actually good, given that 90% of our output values are less than 30.

## Final Comments

Does it mean we beat the solution from the other article? We can’t really tell, as we have a different test set, but if anyone wants to check, comment below!

There are many improvements: feature engineering, scaling, neural network architecture, hyperparameter tuning, even ensembling. But this should give you an idea on how to model this type of problem.

This is not the only method for multiple step forecasting, and it will not be the best for all problems. Keep it as another tool in your set.

## Deroy

Thanks for the article. How far back should we take the steps? Like the more we take previous steps, the better the model will perform?

Also why do we need to convert the labels into a binary matrix using to_categorical()? Can I skip this?

For underfitting, will increasing the model size or using other models like LSTMs help?

## Mario Filho

1) The number of steps depends on the data. Some problems will require many steps, others are only influenced by 2 or 3. You should add them and see until which point your validation error decreases.

2) Neural nets are extremely sensitive to the data representation you use. So if you keep labels as numbers in a column, it will likely capture a relationship in the order of the numbers. This will hurt the performance.

3) Yes. Usually it’s recommended to increase the number of layers and units per layer. In my experience, for tabular data, 3-5 layer networks work well. LSTMs can help.

## Ricardo

Very nice article Mario! Would you say that NNs work well with seasonal data? I graduated last week and had a problem to solve for a company as my final assignment where we had to predict the amount of products that would be manufactured, wish I came across your website before. Great talk on the Data Hackers podcast by the way.

## Mario Filho

Thanks!

Yes, NNs work well with seasonal data, it’s all about adding the right features for it to be able to capture. I will put this on my list of articles to write.

I still prefer Tree Ensembles for this type of data but have seen some specific cases where NNs modeled it better.