When we deal with time series prediction a widely used model is linear regression. Although simple, it has proved quite useful in real applications.

A very simple way to create a model for this case is to use the previous data of the variable of interest itself to predict the current one. It is possible to create models that seek to predict these series using other attributes, which in some cases will improve their accuracy.

In this article I want to demonstrate the simplest way, using a linear regression with the historical values of the variable that we are trying to predict. The code provided is in a format suitable for the reader to understand what is happening, so it may have parts that could be optimized in a production environment, but it was a choice, according to the educational goal of the article, to leave them like they are .

The full code (prime_english.py) and the data are available here.

## Data Description

The data used correspond to the prime rate in Brazil. The prime rate is the bank interest rate for preferred customers, applied to clients with low risk of default in high value transactions. We will use the values of this rate in the last 6 months to predict the next.

We have monthly data from January 2005 to November 2014. They are originally released by the Central Bank of Brazil, but were obtained on the Quandl platform.

** Important note: Do not use the information in this article as the basis for making any decisions, including financial, or investment decisions. This is an educational article and serves only to demonstrate the use of a machine learning tool for time series forecasting. **

## Models Used as Benchmarks

To compare the performance of linear regression in this problem, I will use two other valid methods for forecasting time series:

** Last month value **: the forecast for the next month is just the value of the variable in the last month.

** Moving Average **: The forecast for the next month is the average of the last 6 months.

## Evaluation Metrics

We will use the Mean Absolute Percentage Error (MAPE). It is a metric widely used in the field of time series forecasting, and refers to the average percentage of errors in the forecasts, disregarding the direction (above or below the true value).

In addition to this error, we will also evaluate the Mean Absolute Error (MAE), which is the mean of the absolute values of the errors. This way we know how much we are deviating from the real values in the original units.

From a practical standpoint, in order to justify the use of linear regression instead of simpler methods, it should present an average error smaller than the error of the other options.

## Python Modules

To run the script of this article you need the following Python modules: Numpy, Pandas and Scikit-learn. To reproduce the plot (optional), you need Matplotlib.

## Defining the functions of evaluation metrics

For MAE we will use the implementation available in scikit-learn. Because it does not have a function to compute MAPE, we need to create it.

1 2 |
def mape(y_pred,y_true): return np.mean(np.abs((y_true - y_pred) / y_true)) * 100 |

## Loading and formatting the data

The data are in a CSV in a table format. After loading the CSV into memory using Pandas we need to organize the data to make the prediction. At the time of this writing, scikit-learn doesn’t get along well with Pandas, so in addition to preparing the data in the correct format to feed the model, we put them in numpy arrays.

1 2 3 4 5 6 7 8 9 |
data = pd.read_csv('prime.csv',header=0,index_col=0).sort_index() x_data = [] y_data = [] for d in xrange(6,data.shape[0]): x = data.iloc[d-6:d].values.ravel() y = data.iloc[d].values[0] x_data.append(x) y_data.append(y) |

The features matrix will have 6 columns, one for each month prior to the one we want to predict. The vector with the dependent variable will have the value to be predicted (next month).

For this we start with the seventh month available, which is number six in the loop because, in Python, the first element is indexed to zero.

## Training the model and making predictions

Financial data usually changes regime often, so let’s train a new model every month. The first month to be forecast will be the 31st available after data transformation. So we have at least 30 examples to train the first model. In theory, the more data to train, the better. For each month we will store the predicted value by the three methods and the true value.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
end = y_data.shape[0] for i in range(30,end): x_train = x_data[:i,:] y_train = y_data[:i] x_test = x_data[i,:] y_test = y_data[i] model = LinearRegression(normalize=True) model.fit(x_train,y_train) y_pred.append(model.predict(x_test).reshape(1,-1))[0] y_pred_last.append(x_test[-1]) y_pred_ma.append(x_test.mean()) y_true.append(y_test) |

## Evaluation of results

After the forecasts are completed, we turn the lists into numpy arrays again and compute the metrics.

1 2 3 4 5 6 7 8 9 10 |
print '\nMean Absolute Percentage Error' print 'MAPE Linear Regression', mape(y_pred,y_true) print 'MAPE Last Value Benchmark', mape(y_pred_last,y_true) print 'MAPE Moving Average Benchmark', mape(y_pred_ma,y_true) print '\nMean Absolute Error' print 'MAE Linear Regression', mean_absolute_error(y_pred,y_true) print 'MAE Last Value Benchmark', mean_absolute_error(y_pred_last,y_true) print 'MAE Moving Average Benchmark', mean_absolute_error(y_pred_ma,y_true) |

Mean Absolute Percentage Error

MAPE Linear Regression 1.87042556874

MAPE Last Value Benchmark 2.76774390378

MAPE Moving Average Benchmark 7.90386089172

Mean Absolute Error

MAE Linear Regression 0.284087187881

MAE Last Value Benchmark 0.427831325301

MAE Moving Average Benchmark 1.19851405622

We see that the best model is linear regression, followed by using the value of the last month, and a rather poor result with the moving average.

## Suggestions to improve the model

We have created the simplest machine learning model possible, based only on the historical values of the series. Below are some suggestions that can be implemented to possibly reduce the error.

### Boosting

Boosting is a technique that trains several models in the same data, the difference being that each model is trained in the residual errors of the previous ones. Although there is a risk of overfitting if good validation is not done, this technique is quite successful in practice.

### Creating more features

We used only the values of the last six months as independent variables. If we add more variables that are correlated with the dependent variable, we probably can improve the model. In addition, we can add features based on transformations, such as raising the values squared, to capture non-linear relationships.

### Use other models

Linear regression is not the only option in these cases. Models like neural networks, SVMs, and decision trees can perform better.