fbpx

Mês: janeiro 2019

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks?

Candlestick chart patterns are one of the most widely known techniques that claim to “predict” the market direction inside technical analysis circles.

The development of this technique goes back to 18th century Japan, and it’s attributed to a Japanese rice trader.

It consists of finding patterns based on charts made of the above figure with prices over a period of time. There are many patterns, old and new, available on the internet. But the question is, do they really work as described?

Let’s use machine learning to find out.

There are multiple academic papers, books and websites testing these patterns in different instruments with statistical methods. Results are mixed.

If the patterns can really predict the market, training a machine learning model using them as features should make it able to predict the returns of a traded financial instrument.

Let’s dive into the experiment!

THIS IS NOT INVESTMENT ADVICE! THE PURPOSE OF THE ARTICLE IS TO BE AN EDUCATIONAL MACHINE LEARNING EXPERIMENT. YOU ARE THE SOLE RESPONSIBLE FOR ANY DECISIONS (AND RESULTS OF SUCH) YOU TAKE BASED ON THIS INFORMATION.

Getting the Data

I found this very nice module called pandas-finance that makes getting information from Yahoo Finance very easy and outputs it directly to a Pandas Dataframe.

Stock market price data is very noisy.

If we have a pattern that has an edge, it doesn’t mean that 100% of the time the price will go straight into the predicted direction. So let’s consider both returns one and three days after we see the candlestick pattern.

We have price data since 1990. We computed the returns over the Adjusted Close. Last 5 rows are removed, as we will not have the return data for these.

Computing the Features

Here I use the wonderful python version of TA-LIB. It contains a lot of functions to compute technical analysis indicators. Included are functions to compute candlestick patterns.

All the candlestick functions begin with “CDL”, so we will find all the available functions, pass the price data, and then store the resulting signals in a DataFrame.

For each day we can get generally -100, 100 or 0 as values. Indicating if it’s a bearish/bullish signal, or the pattern is not present. Some patterns can get -200/200 when they check for a “confirmation”. Looking at the code, this doesn’t seem to leak information from the future.

Besides this, there are helper functions for us to compute and plot the error. The metric chosen is Root-Mean-Square Error (RMSE).

Let’s Put a Random Forest to Work.

If there is a pattern in the data, Random Forest can very likely find it. As it’s a model that doesn’t need much tuning to work, here I just set it to create 10.000 trees, which are a lot. Enough to curb the noise and find any existing patterns.

Implementation from scikit-learn.

The training set consists of daily price data from 1990 to 2008. And the validation will be 2009 and forward. It’s important to have the validation split by time here, as we literally want to use this model in future, unseen years.

The specific date (2009) was chosen so that we have a long period of time in validation to account for noise. We are talking about a more than 100-years-old technique, so a decade should be fine.

As baselines we can take two sources:

  • predict every return as zero (base_zero)
  • predict the average return of training for each day in the test period (base_avg)

The full code can be found here.

If the model works, it will have a smaller error than these two approaches.

Results

As the errors from the three solutions are very close, here we look at the differences from the zero prediction baseline. Negative values means the approach does a better job than predicting everything as zero.

Returns One Day After

Returns Three Days After

The Random Forest was worse than both the average and the zero prediction. This means the model actually learned something from the data, but it didn’t generalize for the validation period.

Basically, the patterns that worked on the training period didn’t seem to keep working on future years.

But let’s give it another chance…

Opportunities Don’t Show Up Every Day

About 30% of our data consists of days without any patterns showing up. So an idea is: let’s only use this model when we have a pattern. This is fine, as in real life we would know if we had a pattern today or not, and could use the model accordingly.

One of the reasons to try this idea is that we can reduce some of the noise, and help the model identify the characteristics of the patterns better, leading to a predictive edge.

So here let’s select, both on training and validation, only the days when we see any pattern showing up.

One day After

Three days After

Still bad. The model can’t beat a simple average. The model can be very powerful, but if the data doesn’t have signal, it will simply not work.

The Answer is…

NO. 

Can we claim candlestick patterns don’t work at all? Again, no. We can say it’s very likely they don’t work consistently on SP500, daily data.

Before we reach a more general conclusion there are open avenues to explore:

  • What about other instruments? Can it work on low volume or low liquidity stocks?
  • What happens in the lower (or higher) time periods: seconds to months?
  • Would individual models for specific patterns work?

Just remember that we need to be careful: if we test enough ideas, some will work just by chance!

There is More to a Model than Predictions

The model predictions were bad, but we can look further into it to get more research ideas. Let’s see the feature importance for each pattern.

First, the native feature importance from Scikit-learn.

Some patterns are more important. This could mean at least two lines of work :

  • The model focused too much on patterns that did well on training, but stopped working (or were just a data fluke). In this case, we should discard the top features.
  • There are strong patterns in the data, that we can explore individually and may give an edge. In this case, more attention should go to the top.

A very good way to peek inside a model is using Shapley values.

A richer visualization changes the importance ranking slightly. Here we can see how the points with specific feature values contribute to the model output.

Higher SHAP value means it contributes to a positive return.

The Spinning Top candle appears as the most important feature. According to Wikipedia, it can be a signal of a trend reversion.

Let’s see how it relates to the model outputs

In this plot, we see a weak suggestion that, when the bearish version of this pattern appears (close less than open price), the returns on the third day may be higher than today. And the contrary when the bullish version happens.

This may or may not be an interesting investigation line.

The most important part is that we saw how we can use machine learning to validate these patterns and suggest areas for future research.

Can Gradient Boosting Learn Simple Arithmetic?

During a technical meeting a few weeks ago, we had a discussion about feature interactions, and how far we have to go with them so that we can capture possible relationships with our targets.

Should we create (and select) arithmetic interactions between our features?

A few years ago I remember visiting a website that showed how different models approximated these simple operations. It went from linear models to a complex Random Forest.

One of the most powerful and deployed complex model we have today is Gradient Boosted Decision Trees. We have LightGBM, XGBoost, CatBoost, SKLearn GBM, etc. Can this model find these interactions by itself?

As a rule of thumb, that I heard from a fellow Kaggle Grandmaster years ago, GBMs can approximate these interactions, but if they are very strong, we should specifically add them as another column in our input matrix.

It’s time to upgrade this rule, do an experiment and shed some light on how effective this class of models really are when dealing with arithmetic interactions.

What Do The Interactions Look Like?

Let’s use simple two-variable interactions for this experiment, and visually inspect how the model does. Here I want to try the four operations: addition, subtraction, multiplication and division.

This is what they look like on a -1 to 1 grid.

Addition
Subtraction
Multiplication
Division

As expected, Addition and Subtraction are linear. Multiplication and Division have non-linear parts. These will serve as a reference.

Bring on XGBoost

Because of its popularity and mechanism close to the original implementation of GBM, I chose XGBoost. There should not be many differences to the results using other implementations.

I generated a dataset with 10.000 numbers, that covers the grid we plotted above. As we are studying the behavior of the model here, I didn’t care about a validation split. After all, if the model can’t capture the pattern on training, test is doomed anyway.

Zero Noise

Let’s start with an “easy” scenario. Zero noise, Y is exactly the operation applied on X1 and X2. E.g.: X1 + X2 = Y.

Addition and Subtraction have some wiggles, because we are approximating the interaction, not perfectly modeling it, but it seems fine. I would say the model did a good job.

Now… what the heck is going on with Multiplication and Division?

The model was not able to capture the pattern. Predictions are constant. I tried the following, without success:

  • Changing the range
  • Multiplying Y by a big constant
  • Using XGBRegressor implementation vs xgb.train
  • Changing hyperparameters

We should find why this happens. If anyone knows, please comment. I opened an issue on GitHub.

ANSWER: Jiaming Yuan and Vadim Khotilovich from the XGB team investigated the issue, and Vadim wrote: “it’s not a bug. But it’s a nice example of data with “perfect symmetry” with unstable balance. With such perfect dataset, when the algorithm is looking for a split, say in variable X1, the sums of residuals at each x1 location of it WRT X2 are always zero, thus it cannot find any split and is only able to approximate the total average. Any random disturbance, e.g., some noise or subsample=0.8 would help to kick it off the equilibrium and to start learning.”

Thank you for the answer. Now we can keep this in mind if the model is acting stubbornly.

Is There Life Without Noise?

It’s very, very hard for us to have a dataset with noise (that we know of or not). So let’s see how the model performs when we add a bit of random normal noise.

Gaussian Noise: Mean 0, StDev 0.5

With this amount of noise we see that the model can approximate the interactions, even division and multiplication. Back to our topic, although wiggly, we see that the model indeed can capture such patterns.

Gaussian Noise: Mean 0, StDev 1

Even more wiggly! But again, it can approximate the interactions.

Updated Rule of Thumb

So, what all this means for our day-to-day modeling? Can GBMs do simple arithmetic? The answer is YES.

The original comment still applies. If you want to really capture all the performance, and you know the interactions really apply, you should EXPLICITLY create them features. This way the model doesn’t have to struggle so much to find the splits and thread through noise.

Now, if you really don’t want to create the interactions, here are some ideas that may help the model approximate it better:

  • Adding more examples
  • Tuning the hyperparameters
  • Use a model that is more adequate to the known form of the interaction

How To Use Neural Networks to Forecast Multiple Steps of a Time Series

Time series are wonderful. I love them. They are everywhere. And we even get to brag about being able to predict the future! Maybe I should put it on my Tinder profile?

As a follow-up to the article on predicting multiple time-series, I receive lots of messages asking about prediction for more than a single step.

A step can be any period of time: a day, a week, a minute, an year… So this is called multi-step forecasting.

I want to show you how to do it with neural networks.

This article is a simple tutorial on how to set-up a basic code/model structure to get you up and running. Feel free to add improvement suggestions on the comments!

The data we will use is the same sales data, but now we will try to predict 3 weeks in advance. Each week can be considered a “step”.

I will use steps and weeks interchangeably in the article to get you used to the idea.

Keep in mind that this method can be used to predict more steps. I chose 3 only because it’s a tutorial. In a real case you should check with stakeholders what is the number of steps most useful to them.

You will need scikit-learn, Keras, pandas and numpy. Get the dataset here.

Let’s load the data

We have data for 811 products and 52 weeks. To make the processing easier, let’s create a column with only the numeric part of the product code.

time series pandas

We see that, although we have 811 products, the maximum product code is 819. We need to remember this when setting up the one-hot encoding for this information.

Preparing Inputs and Outputs

Here we are going to use the previous 7 weeks to predict the next 3. So let’s say we are in the last day of January and want to predict the first 3 weeks of February. We get the previous 7 weeks spanning January and part of December.

These numbers are, again, took out of thin air to demonstrate the method. You could use 6, 12 weeks… Whatever number makes sense. Oh, and it doesn’t need to be weeks. Steps can be any unit of time.

We need to transform our data. Every example in the training set (and, here, in our test set) need to have as inputs the previous 7 weeks sales values, and as outputs, the next 3 weeks.

Think about every row having 800+ zeros, with a one in the position corresponding to the respective product code, and 7 other values, being the previous steps values.

For completion, you will find that we call these “previous steps values”, lags. Because it’s a lagged version of the time series.

But, Mario, what do I do in production, when I don’t have the next 3 weeks? You will have only your 7 previous steps and the model will generate a prediction for the next 3. The only difference is that you will need to wait 3 weeks to know how well your model predicted.

Here we use historical data to simulate this scenario.

Setting up Training and Test

As any supervised learning task, we need a training and a test (validation) set. Here I am doing this simple split, taking about half of the weeks for training, and the other half for testing, so that we have about the same number of data points for each.

For each product, we separate FUTURE steps for testing and PAST steps for training. If we did a simple random split we would mix past and future, and it would not be a reliable simulation of what we will find in production. After all, this is not a tutorial about creating time machines!

Preprocessing and Modeling

Neural networks are very sensitive to the scale of the data, so I decided to use the RobustScaler from scikit-learn, as we have 800+ one-hot columns. It doesn’t mean it’s the best option, feel free to try more sophisticated strategies.

Keras is a fantastic library for rapid development of neural networks. I chose a simple single hidden layer network, but this could be a sequential net, like LSTMs, a Convolutional network, which is showing good results in time-series problems across papers. Or even both! Try different architectures and share the results with us.

Now we just fit the model and get the results. The model is clearly underfitting (training error is higher than validation).

Captura de Tela 2019-01-16 às 11.54.12

To have a better idea of how well the model is predicting, let’s calculate the root mean squared error over the log of the values. This is a smooth approximation of the mean absolute percentage error.

We get 29-30% for each of the steps. This is actually good, given that 90% of our output values are less than 30.

Final Comments

Does it mean we beat the solution from the other article? We can’t really tell, as we have a different test set, but if anyone wants to check, comment below!

There are many improvements: feature engineering, scaling, neural network architecture, hyperparameter tuning, even ensembling. But this should give you an idea on how to model this type of problem.

This is not the only method for multiple step forecasting, and it will not be the best for all problems. Keep it as another tool in your set.

Get a notebook with the full code here.

Teste