# Ensemble Methods for E-Mini S&P 500 Futures Long/Short Strategy Harnessing the Potential of Multiple Methods

### ­­Motivation

Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. This is with the intention that ensembles will achieve better prediction accuracy than individual classifiers. In machine learning research, most research papers focus on evaluating the performance of single algorithms.

In recent years, academic research into applying ensemble methods to the sphere of financial markets has yielded promising results. Yang, Rao, Hong, and Ding (2016) studied ensemble methods’ performance on Chinese stocks and showed their ensemble model (including support vector machine (SVM), random forest (RF), and AdaBoost (AB)) obtained high prediction accuracy than SVM but not necessarily higher annualized returns. Poel, Chesterman, Koppen, and Ballings (2016) used random forest as an ensemble method to beat a buy-and-hold strategy based on technical indicators.

This post aims to explore the power of ensemble methods to predict price movement trends on CME E-Mini S&P 500 Futures, one of the most popular futures products in the world. First, 20 different indicators were chosen for predicting forward returns. PCA is applied to reduce dimensionality of the features. Second, SVM, Neural Network, Logistic Regression, Random Forest, and K-nearest neighbors were fitted on training data to forecast the direction of returns. Third, ensemble methods were constructed combining the individual models. Finally, backtesting on the testing data was conducted to compare the performances of individual algorithms and ensemble methods.

### Data and Methodology

Data

Our study uses about four years’ worth of indicator data for E-mini S&P 500 front month futures, which was obtained from Bloomberg. This comprises of daily values for 20 indicators for the period of 23rd September 2013 to 05th October 2017. Input variables are summarized below.

Fig 1 – Table of Input Variables

For the classification problem, we split the data into a 3:1 ratio, and use three years of data for training (classification) and one year (06th October 2016 to 05th October 2017) of data for testing.

Principal Component Analysis

Our study applies Principal Component Analysis (PCA) to the features to reduce dimensionality. The cumulative proportion variance graph below shows the top 6 principal components can explain 82% of the variance and they are used to train the models.

Fig 2 – Principal Component Analysis of Input Variables – Cumulative Proportion of Variance by Number of Components

Logistic Regression

Logistic regression is a technique borrowed by machine learning from statistics. It is used for classification problems. The binary logistic model estimates the odds of a binary response (whether an event happens) based on one or more predictor variables. Therefore, its output is a number between 0 and 1 inclusively. The main formulas used in logistic regression are as follows:

$logit(p) = b_{0} + b_{1} X_{1} + b_{2} X_{2} + b_{3} X_{3} + … + b_{k} X_{k}$

$logit(p) = ln( \frac{p}{1-p} )$

Where $X_{1} + X_{2} + X_{3} + … + X_{k}$ represents the predictor variables, and p is the probability that an event happens. If p is higher than a given threshold, logistic regression predicts that the event will happen. Otherwise, the prediction result is the opposite. In this paper, p indicates the probability that tomorrow’s price increases.

K-nearest Neighbors/ Random Forest/ Neural Networks/Support Vector Machine

KNN is a non-parametric algorithm for both classification and regression problems. It finds a test case’s K nearest neighbors according to certain distance measurements and record their classes. Then it uses majority voting to determine the test case’s class.

Random forest is a robust machine learning algorithm that expands on decision tree models, by averaging the output of many decision trees.

Modelled after the human brain, Neural Networks serve to capture associations on large amounts of noisy data. Neural network’s shallow layers extract simple features from the input variables. Its deep layers construct a complicated understanding of the shallow layers’ result. At last it creates a response for the input.

SVM can be used to solve both classification and regression problems. To obtain good performance, researchers usually try different kernel functions to generate the best hyperplane that separates data from different classes in classification problems.

To learn more about K-nearest Neighbours, Random Forest, Neural Networks, and Support Vector Machine, please refer to our previous paper – “Comparing Supervised Learning Methods for Hang Seng Index Futures Long/Short Strategy” and “SVM Trend Strategy on Nikkei 225 Mini Futures”.

Ensemble Methods

Ensemble methods are simple techniques intended to aggregate the outcome based on individual models. Ensemble learning is widely used in classification, prediction, function approximation problems. Ensemble methods usually produce more accurate results than a single model based on ensemble mechanisms. Two popular mechanisms are majority voting and weighted voting.

In majority voting, every single model makes an equal-weight vote for each test case and the ensemble method’s prediction is the choice that receives more than half of the votes.

Unlike majority voting, where each model has the same influence on the final choice, weighted voting assigns a weight to each single algorithm. The final output is the choice receiving the most weight.

In our experiment, two majority voting functions were implemented and analysed. Ensemble method 1 includes all five individual machine learning algorithms. Ensemble method 2 includes three: logistic regression, neural network, and k-nearest neighbors.

Classification Approach

Across the various approaches, we the data output from PCA is used as prediction variables for the classification problem, with the target variable as the sign of the return on the next trading day. In the training data, if the next day’s return is positive, we classify its target value as 1. If negative, it is classified as -1.

### Computational Results

Logistic Regression

The binomial function was used in the generalized linear model. After quick tuning, the prediction accuracy in the testing set was 53.17%.

Fig 3 – Confusion Matrix for Logistic Regression

KNN

Tuning the KNN model was simple because only the number of neighbors can be adjusted. In our study 14 neighbors are selected. The prediction accuracy was 54.76%.

Fig 4 – Confusion Matrix for KNN

Random Forest

Since the speed for the training is very fast (Random forest training does not involve complicated mathematical computations), our study tested various combinations of parameters: the number of decision trees and sample size. The best accuracy we obtained was 57.54%.

Fig 5 – Confusion Matrix for Random Forest

Neural Network

A few combinations of tuning parameters were tested. A reasonable accuracy obtained was 54.76%. The corresponding confusion matrix is shown below.

Fig 6 – Confusion Matrix for Neural Networks

Support Vector Machine

After tuning, a polynomial kernel function was found to outperform other kernel types. The accuracy was 59.13%.

Fig 7 – Confusion Matrix for SVM

Ensemble method 1

All 5 methods’ results were collected and considered to have the same voting weight in this ensemble method. The accuracy on the testing set is 59.13%.

Fig 8 – Confusion Matrix for Ensemble Method 1

Ensemble method 2

Only three algorithms (Logistic regression, neural network, and KNN) are combined in this majority voting method. The accuracy on the testing set is 59.52%.

Fig 9 – Confusion Matrix for Ensemble Method 2

### Strategy Backtests

Logistic Regression

Over the one-year testing period, the strategy posted a return of 12.7%. The daily Sharpe ratio is 0.107. The largest daily profit was 2.3% while the largest daily loss was 1.3%. A total of 88 trades were conducted, comprising of 44 longs and 44 shorts, with the average holding period at 3.3 days. Based on P&L of the trades, the win rate was 57.5% and expectancy was 12bps.

Fig 10 – Representative Logistic Regression Strategy Backtest

KNN

Over the one-year testing period, the strategy posted a return of 2.4%. The daily Sharpe ratio is 0.023. The largest daily profit was 1.7% while the largest daily loss was 2.3%. A total of 103 trades were conducted, comprising of 51 longs and 52 shorts, with the average holding period at 3.3 days. Based on P&L of the trades, the win rate was 56.3% and expectancy was 3bps.

Fig 11 – Representative KNN Strategy Backtest

Random Forest

Over the one-year testing period, the strategy posted a return of 16.3%. The daily Sharpe ratio is 0.136. The largest daily profit was 2.3% while the largest daily loss was 1.4%. A total of 82 trades were conducted, comprising of 41 longs and 41 shorts, with the average holding period at 4.3 days. Based on P&L of the trades, the win rate was 61.3% and expectancy was 16bps.

Fig 12 – Representative Random Forest Strategy Backtest

Neural Network

Over the one-year testing period, the strategy posted a return of 7.9%. The daily Sharpe ratio is 0.069. The largest daily profit was 1.6% while the largest daily loss was 2.3%. A total of 73 trades were conducted, comprising of 37 longs and 36 shorts, with the average holding period at 4.8 days. Based on P&L of the trades, the win rate was 53.4% and expectancy was 7bps.

Fig 13 – Representative Neural Network Strategy Backtest

Support Vector Machine

Over the one-year testing period, the strategy posted a return of 20.1%. The daily Sharpe ratio is 0.165. The largest daily profit was 1.3% while the largest daily loss was 2.3%. A total of 16 trades were conducted, comprising of 8 longs and 8 shorts, with the average holding period at 18.9 days. Based on P&L of the trades, the win rate was 62.5% and expectancy was 85bps.

Fig 14 – Representative SVM Strategy Backtest

Ensemble method 1

Over the one-year testing period, the strategy posted a return of 18.4%. The daily Sharpe ratio is 0.152. The largest daily profit was 1.7% while the largest daily loss was 2.3%. A total of 80 trades were conducted, comprising of 40 longs and 40 shorts, with the average holding period at 4.4 days. Based on P&L of the trades, the win rate was 59.5% and expectancy was 19bps.

Fig 15 – Representative Strategy Backtest for Ensemble Method 1

Ensemble method 2

Over the one-year testing period, the strategy posted a return of 13.3%. The daily Sharpe ratio is 0.113. The largest daily profit was 1.7% while the largest daily loss was 2.3%. A total of 102 trades were conducted, comprising of 51 longs and 52 shorts, with the average holding period at 3.5 days. Based on P&L of the trades, the win rate was 58.4% and expectancy was 14bps.

Fig 16 – Representative Strategy Backtest for Ensemble Method 2

### Analysis

Fig 17 – Performance Comparisons of All Approaches

Ensemble method 1 achieved the same accuracy as the best single algorithm (SVM) it leveraged, but its annual return was lower than SVM. The possible reason is that SVM and Random Forest algorithms’ performances were much better than other three algorithms. For example, if a majority voting method uses 3 individual algorithms with one algorithms’ accuracy being 100% and the other two being 50%, it is possible that the majority voting method’s accuracy is lower than 100%.

Ensemble method 2’s accuracy and returns were higher compared to any of its three individual algorithms. Judging by the Cumulative Equity Holdings graphs, the majority of the three algorithms incurred losses in the first half of the testing period, and gained more profit in the second half. Thus, the ensemble method performed better than them due to the voting mechanism.

Ensemble method 1 had a lower accuracy but higher annual returns compared to Ensemble method 2. When SVM and random forest algorithms were merged to Ensemble method 2, they contributed more incorrect predictions to the existing right predictions in Ensemble method 2 than incorrect predictions to the existing incorrect predictions. However, the new correct predictions gained more profit than the loss caused by the new incorrect predictions. Therefore, annual returns in Ensemble method 1 were higher.

### Conclusion

Ensemble methods can obtain better results in both prediction accuracy and annual returns if the existing individual algorithms’ performance are similar. If there is an outstanding individual algorithm, possibly using it alone could be a better idea. In the meantime, feel free to test out our code for your own research!

### References

• Yang, R. Rao, P. Hong and P. Ding, “Ensemble Model for Stock Price Movement Trend Prediction on Different Investing Periods,” 2016 12th International Conference on Computational Intelligence and Security (CIS), Wuxi, 2016, pp. 358-361.
• Van den Poel, C. Chesterman, M. Koppen and M. Ballings, “Equity price direction prediction for day trading: Ensemble classification using technical analysis indicators with interaction effects,” 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, BC, 2016, pp. 3455-3462.
• MedCalc, Logistic regression, https://www.medcalc.org/manual/logistic_regression.php
• Necati Demir, Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results, https://www.toptal.com/machine-learning/ensemble-methods-machine-learning
6 replies
1. raimund says:

Thanks for sharing this article, but without code the results are hard to follow.
If you are using code from your last example “SVM Trend Strategy on Nikkei 225 Mini Futures”
it’s likely that you have some kind of time bias though, ’cause you’re using lagged with non-lagged data.

2. Giuseppe says:

Very interesting. Would you please share the source code?

3. raimund says:

Thanks again for your reply and sharing the source. After a quick look – the attributes are now lagged behind the target without any conflicting bias.
Great stuff.

4. raimund says:

Hi again,
on closer inspection and after running the code of the svmEnsemble there still remain questions.
The tuning function with the for loop provides good results, but tuning the model with the validation
of the testset is very questionable.
The two other methods for tuning provide only poor results (I’m using the SPY instead of futures).
So please can you help me to solve the problem ?