Categoria: Time Series

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks?

Candlestick chart patterns are one of the most widely known techniques that claim to “predict” the market direction inside technical analysis circles.

The development of this technique goes back to 18th century Japan, and it’s attributed to a Japanese rice trader.

It consists of finding patterns based on charts made of the above figure with prices over a period of time. There are many patterns, old and new, available on the internet. But the question is, do they really work as described?

Let’s use machine learning to find out.

There are multiple academic papers, books and websites testing these patterns in different instruments with statistical methods. Results are mixed.

If the patterns can really predict the market, training a machine learning model using them as features should make it able to predict the returns of a traded financial instrument.

Let’s dive into the experiment!


Getting the Data

I found this very nice module called pandas-finance that makes getting information from Yahoo Finance very easy and outputs it directly to a Pandas Dataframe.

Stock market price data is very noisy.

If we have a pattern that has an edge, it doesn’t mean that 100% of the time the price will go straight into the predicted direction. So let’s consider both returns one and three days after we see the candlestick pattern.

We have price data since 1990. We computed the returns over the Adjusted Close. Last 5 rows are removed, as we will not have the return data for these.

Computing the Features

Here I use the wonderful python version of TA-LIB. It contains a lot of functions to compute technical analysis indicators. Included are functions to compute candlestick patterns.

All the candlestick functions begin with “CDL”, so we will find all the available functions, pass the price data, and then store the resulting signals in a DataFrame.

For each day we can get generally -100, 100 or 0 as values. Indicating if it’s a bearish/bullish signal, or the pattern is not present. Some patterns can get -200/200 when they check for a “confirmation”. Looking at the code, this doesn’t seem to leak information from the future.

Besides this, there are helper functions for us to compute and plot the error. The metric chosen is Root-Mean-Square Error (RMSE).

Let’s Put a Random Forest to Work.

If there is a pattern in the data, Random Forest can very likely find it. As it’s a model that doesn’t need much tuning to work, here I just set it to create 10.000 trees, which are a lot. Enough to curb the noise and find any existing patterns.

Implementation from scikit-learn.

The training set consists of daily price data from 1990 to 2008. And the validation will be 2009 and forward. It’s important to have the validation split by time here, as we literally want to use this model in future, unseen years.

The specific date (2009) was chosen so that we have a long period of time in validation to account for noise. We are talking about a more than 100-years-old technique, so a decade should be fine.

As baselines we can take two sources:

  • predict every return as zero (base_zero)
  • predict the average return of training for each day in the test period (base_avg)

The full code can be found here.

If the model works, it will have a smaller error than these two approaches.


As the errors from the three solutions are very close, here we look at the differences from the zero prediction baseline. Negative values means the approach does a better job than predicting everything as zero.

Returns One Day After

Returns Three Days After

The Random Forest was worse than both the average and the zero prediction. This means the model actually learned something from the data, but it didn’t generalize for the validation period.

Basically, the patterns that worked on the training period didn’t seem to keep working on future years.

But let’s give it another chance…

Opportunities Don’t Show Up Every Day

About 30% of our data consists of days without any patterns showing up. So an idea is: let’s only use this model when we have a pattern. This is fine, as in real life we would know if we had a pattern today or not, and could use the model accordingly.

One of the reasons to try this idea is that we can reduce some of the noise, and help the model identify the characteristics of the patterns better, leading to a predictive edge.

So here let’s select, both on training and validation, only the days when we see any pattern showing up.

One day After

Three days After

Still bad. The model can’t beat a simple average. The model can be very powerful, but if the data doesn’t have signal, it will simply not work.

The Answer is…


Can we claim candlestick patterns don’t work at all? Again, no. We can say it’s very likely they don’t work consistently on SP500, daily data.

Before we reach a more general conclusion there are open avenues to explore:

  • What about other instruments? Can it work on low volume or low liquidity stocks?
  • What happens in the lower (or higher) time periods: seconds to months?
  • Would individual models for specific patterns work?

Just remember that we need to be careful: if we test enough ideas, some will work just by chance!

There is More to a Model than Predictions

The model predictions were bad, but we can look further into it to get more research ideas. Let’s see the feature importance for each pattern.

First, the native feature importance from Scikit-learn.

Some patterns are more important. This could mean at least two lines of work :

  • The model focused too much on patterns that did well on training, but stopped working (or were just a data fluke). In this case, we should discard the top features.
  • There are strong patterns in the data, that we can explore individually and may give an edge. In this case, more attention should go to the top.

A very good way to peek inside a model is using Shapley values.

A richer visualization changes the importance ranking slightly. Here we can see how the points with specific feature values contribute to the model output.

Higher SHAP value means it contributes to a positive return.

The Spinning Top candle appears as the most important feature. According to Wikipedia, it can be a signal of a trend reversion.

Let’s see how it relates to the model outputs

In this plot, we see a weak suggestion that, when the bearish version of this pattern appears (close less than open price), the returns on the third day may be higher than today. And the contrary when the bullish version happens.

This may or may not be an interesting investigation line.

The most important part is that we saw how we can use machine learning to validate these patterns and suggest areas for future research.

How To Use Neural Networks to Forecast Multiple Steps of a Time Series

Time series are wonderful. I love them. They are everywhere. And we even get to brag about being able to predict the future! Maybe I should put it on my Tinder profile?

As a follow-up to the article on predicting multiple time-series, I receive lots of messages asking about prediction for more than a single step.

A step can be any period of time: a day, a week, a minute, an year… So this is called multi-step forecasting.

I want to show you how to do it with neural networks.

This article is a simple tutorial on how to set-up a basic code/model structure to get you up and running. Feel free to add improvement suggestions on the comments!

The data we will use is the same sales data, but now we will try to predict 3 weeks in advance. Each week can be considered a “step”.

I will use steps and weeks interchangeably in the article to get you used to the idea.

Keep in mind that this method can be used to predict more steps. I chose 3 only because it’s a tutorial. In a real case you should check with stakeholders what is the number of steps most useful to them.

You will need scikit-learn, Keras, pandas and numpy. Get the dataset here.

Let’s load the data

We have data for 811 products and 52 weeks. To make the processing easier, let’s create a column with only the numeric part of the product code.

time series pandas

We see that, although we have 811 products, the maximum product code is 819. We need to remember this when setting up the one-hot encoding for this information.

Preparing Inputs and Outputs

Here we are going to use the previous 7 weeks to predict the next 3. So let’s say we are in the last day of January and want to predict the first 3 weeks of February. We get the previous 7 weeks spanning January and part of December.

These numbers are, again, took out of thin air to demonstrate the method. You could use 6, 12 weeks… Whatever number makes sense. Oh, and it doesn’t need to be weeks. Steps can be any unit of time.

We need to transform our data. Every example in the training set (and, here, in our test set) need to have as inputs the previous 7 weeks sales values, and as outputs, the next 3 weeks.

Think about every row having 800+ zeros, with a one in the position corresponding to the respective product code, and 7 other values, being the previous steps values.

For completion, you will find that we call these “previous steps values”, lags. Because it’s a lagged version of the time series.

But, Mario, what do I do in production, when I don’t have the next 3 weeks? You will have only your 7 previous steps and the model will generate a prediction for the next 3. The only difference is that you will need to wait 3 weeks to know how well your model predicted.

Here we use historical data to simulate this scenario.

Setting up Training and Test

As any supervised learning task, we need a training and a test (validation) set. Here I am doing this simple split, taking about half of the weeks for training, and the other half for testing, so that we have about the same number of data points for each.

For each product, we separate FUTURE steps for testing and PAST steps for training. If we did a simple random split we would mix past and future, and it would not be a reliable simulation of what we will find in production. After all, this is not a tutorial about creating time machines!

Preprocessing and Modeling

Neural networks are very sensitive to the scale of the data, so I decided to use the RobustScaler from scikit-learn, as we have 800+ one-hot columns. It doesn’t mean it’s the best option, feel free to try more sophisticated strategies.

Keras is a fantastic library for rapid development of neural networks. I chose a simple single hidden layer network, but this could be a sequential net, like LSTMs, a Convolutional network, which is showing good results in time-series problems across papers. Or even both! Try different architectures and share the results with us.

Now we just fit the model and get the results. The model is clearly underfitting (training error is higher than validation).

Captura de Tela 2019-01-16 às 11.54.12

To have a better idea of how well the model is predicting, let’s calculate the root mean squared error over the log of the values. This is a smooth approximation of the mean absolute percentage error.

We get 29-30% for each of the steps. This is actually good, given that 90% of our output values are less than 30.

Final Comments

Does it mean we beat the solution from the other article? We can’t really tell, as we have a different test set, but if anyone wants to check, comment below!

There are many improvements: feature engineering, scaling, neural network architecture, hyperparameter tuning, even ensembling. But this should give you an idea on how to model this type of problem.

This is not the only method for multiple step forecasting, and it will not be the best for all problems. Keep it as another tool in your set.

Get a notebook with the full code here.

How To Predict Multiple Time Series With Scikit-Learn (With a Sales Forecasting Example)

You got a lot of time series and want to predict the next step (or steps). What should you do now? Train a model for each series? Is there a way to fit a model for all the series together? Which is better?

I have seen many data scientists think about approaching this problem by creating a single model for each product. Although this is one of the possible solutions, it’s not likely to be the best.

Here I will demonstrate how to train a single model to predict multiple time series at the same time. This technique usually creates powerful models that help teams win machine learning competitions and can be used in your project.

Machine Learning for Sales Forecasting Using Weather Data

WalMart is a company with thousands of stores in 27 countries. It is possible to find several articles on the technological mechanisms used to manage the logistics and distribution of the products. It is the second time they offer a contest at Kaggle with the intention of finding interview candidates for data scientist jobs.

A major advantage of this type of competition is that we have access to data from large companies, and understand what problems they are trying to solve with probabilistic models.

The objective of the competition was to create a model that could predict the amount of sales of some products in specific stores in the days before and after blizzards and storms. The example given by them in the description of the task was the sale of umbrellas, which intuitively must see an increase before a great storm.


Two files were made available to train the model: one of them contained information about the identification of the stores, products, and the nearest weather stations. The other contained weather data for each station.

In total, data were available on 111 products whose sales could be affected by climatic conditions, 45 stores, and 20 weather stations. The goal was to predict the amount of each product in each store that would be sold 3 days before, 3 days later, and on the day of the weather event.

A weather event mean that more than an inch of rain was recorded, or more than two inches of snow, that day.

The combination of stores and products gave us about 4.6 million examples for training, and about 500,000 for testing. Each example referred to a day, store and product.

Evaluation Metric

The metric used was the Root Mean Square Log Error. It is basically the RMSE applied to the log (X + 1) transformation of the predictions. This means that errors in predictions that should be close to 0 would be punished more severely than errors in predictions with higher numbers. For example, predicting 5 items when it should be 0, has a greater penalty than predicting 105 when it should be 100.

Transforming the Data

Since I only had 10 days to work in this competition, I decided to check how far it was possible to come up with a model based only on the original variables and, who knows, make an ensemble.

One difference from this to other competitions is that, even though the data was organized, you were responsible for linking climate variables to the identified product and store data. It makes perfect sense because Walmart would not want a data scientist who does not know how to manipulate data.

For this I used Pandas . This Python library for data manipulation is one of the most widely used, and strongly remembers data structures available in R.

At first I used all the variables as numerical, trained an XGBoost with slight tuning, excluding the variables that coded special alerts, and used 10% of the dataset to determine the number of trees. As expected, the result was poor, about 0.1643 on LB.

Binarizing variables

After testing the first model, I coded the categorical variables with one-hot encoding. That is, for each level of the variable a column with the indicator 0 or 1 was created, if the variable was present in that example. Normally the number of columns should be the number of levels minus one, so that you do not have problems with collinearity. Since I planned to use models that were not sensitive to this problem, I did not bother deleting one of the columns.

After tuning using a subset with 10% of the data, and getting the RMSLE of 0.1216 in the validation, I sent the solution, and got the value of 0.1204 in the LB, a good improvement over the previous one.


A lot of weather data were missing, so I decided to test a simple imputation method: replace the NaNs by the mean of the column values. After tuning again, now with parameters for these new values, I obtained the RMSLE of 0.1140 in the 10% validation and 0.1095 in the LB.

Time Variables

I did not explore the temporal dimension of the data very much, in one of the attempts I added the previous day’s meteorological information, which reduced the error to 0.1083.

Data Subsets

One method that worked very well in my first competition, and that I always try to do, is to divide the training set into small subsets related to some variable and train a specific model for each one. In this case, I decided to create 19 models, one for each of the 19 weather stations present in the test set. The RMSLE in LB was 0.1101. Worse than with a model that treats the stations as variables in the entire dataset.

A serious problem with this approach is to try to use the same model, with the same parameters, for different datasets. Knowing this, I decided to make a small tuning of the parameters for each dataset, which reduced the LMS RMSLE to 0.1069.

Despite the small difference, it seems the division into individual models for each station captured some information that was not present in the model with all considered together.


Of the models I tested, two stood out: Gradient Boosted Trees (XGBoost) and Random Forests.

Random Forest

I had used Random Forest for regression only once, in a job, but never with a large amount of data. After tuning the parameters, applying the model to the imputed data, it resulted in a RMSLE of 0.1166 on the LB.


XGBoost presented the best error of an individual model. In addition to adjusting parameters, such as the depth of the trees, using a subset of the data, it was necessary to adjust the number of trees and the learning rate. Usually a small learning rate and a lot of trees is the safe recipe to improve performance, in exchange for more time to train the model.

Since XGBoost is sensitive to the seed of the RNG, I decided to make an ensemble of XGBs by just changing this value. This method marginally improved my score in other competitions, but in this the impact of it was greater because of the following: XGBoost has a function that allows to leave data separated so that it determines the number of trees that minimizes the error. In this case I decided to use the function, and I left 5% of the data separated. In addition to varying the seed to the XGB itself, I varied the seed to split the data, which made the models look more diverse, which is essential for a good ensemble.

The most I got to train was 10 XGBoosts in this scheme. Although it was a very stable model, the RMSLE of the ensemble was 0.1041, presenting a reduction compared to 0.1095 of the single model.

Final Solution

In the end, I put together all the solutions that I had submitted, and ended up getting a RMSLE of 0.1028, guaranteeing a position among the top 20%.

Possible Improvements

One day after the end of the competition I reviewed the variables of my XGBoost, and found that the variables that identified the products (item_nbr) were not in binary format, and were considered by him the most important. With the correct coding I believe that it would be possible to reduce the error more, and achieve a better final position. Although trees are very good to capture patterns even with categoricals in this format.

How to Create a Simple Machine Learning Model to Predict Time Series

When we deal with time series prediction a widely used model is linear regression. Although simple, it has proved quite useful in real applications.

A very simple way to create a model for this case is to use the previous data of the variable of interest itself to predict the current one. It is possible to create models that seek to predict these series using other attributes, which in some cases will improve their accuracy.

In this article I want to demonstrate the simplest way, using a linear regression with the historical values of the variable that we are trying to predict. The code provided is in a format suitable for the reader to understand what is happening, so it may have parts that could be optimized in a production environment, but it was a choice, according to the educational goal of the article, to leave them like they are .

The full code (prime_english.py) and the data are available here.

Data Description

The data used correspond to the prime rate in Brazil. The prime rate is the bank interest rate for preferred customers, applied to clients with low risk of default in high value transactions. We will use the values of this rate in the last 6 months to predict the next.

We have monthly data from January 2005 to November 2014. They are originally released by the Central Bank of Brazil, but were obtained on the Quandl platform.

Important note: Do not use the information in this article as the basis for making any decisions, including financial, or investment decisions. This is an educational article and serves only to demonstrate the use of a machine learning tool for time series forecasting.

Models Used as Benchmarks

To compare the performance of linear regression in this problem, I will use two other valid methods for forecasting time series:

Last month value : the forecast for the next month is just the value of the variable in the last month.

Moving Average : The forecast for the next month is the average of the last 6 months.

Evaluation Metrics

We will use the Mean Absolute Percentage Error (MAPE). It is a metric widely used in the field of time series forecasting, and refers to the average percentage of errors in the forecasts, disregarding the direction (above or below the true value).

In addition to this error, we will also evaluate the Mean Absolute Error (MAE), which is the mean of the absolute values of the errors. This way we know how much we are deviating from the real values in the original units.

From a practical standpoint, in order to justify the use of linear regression instead of simpler methods, it should present an average error smaller than the error of the other options.

Python Modules

To run the script of this article you need the following Python modules: Numpy, Pandas and Scikit-learn. To reproduce the plot (optional), you need Matplotlib.

Defining the functions of evaluation metrics

For MAE we will use the implementation available in scikit-learn. Because it does not have a function to compute MAPE, we need to create it.

Loading and formatting the data

The data are in a CSV in a table format. After loading the CSV into memory using Pandas we need to organize the data to make the prediction. At the time of this writing, scikit-learn doesn’t get along well with Pandas, so in addition to preparing the data in the correct format to feed the model, we put them in numpy arrays.

The features matrix will have 6 columns, one for each month prior to the one we want to predict. The vector with the dependent variable will have the value to be predicted (next month).

For this we start with the seventh month available, which is number six in the loop because, in Python, the first element is indexed to zero.

Training the model and making predictions

Financial data usually changes regime often, so let’s train a new model every month. The first month to be forecast will be the 31st available after data transformation. So we have at least 30 examples to train the first model. In theory, the more data to train, the better. For each month we will store the predicted value by the three methods and the true value.

Evaluation of results

After the forecasts are completed, we turn the lists into numpy arrays again and compute the metrics.

Mean Absolute Percentage Error
MAPE Linear Regression 1.87042556874
MAPE Last Value Benchmark 2.76774390378
MAPE Moving Average Benchmark 7.90386089172

Mean Absolute Error
MAE Linear Regression 0.284087187881
MAE Last Value Benchmark 0.427831325301
MAE Moving Average Benchmark 1.19851405622

We see that the best model is linear regression, followed by using the value of the last month, and a rather poor result with the moving average.

Prime Rate Linear Regression

Suggestions to improve the model

We have created the simplest machine learning model possible, based only on the historical values ​​of the series. Below are some suggestions that can be implemented to possibly reduce the error.


Boosting is a technique that trains several models in the same data, the difference being that each model is trained in the residual errors of the previous ones. Although there is a risk of overfitting if good validation is not done, this technique is quite successful in practice.

Creating more features

We used only the values ​​of the last six months as independent variables. If we add more variables that are correlated with the dependent variable, we probably can improve the model. In addition, we can add features based on transformations, such as raising the values ​​squared, to capture non-linear relationships.

Use other models

Linear regression is not the only option in these cases. Models like neural networks, SVMs, and decision trees can perform better.