# Statistical Arbitrage on a Cross-border Soybean Crush Spread Backtesting an Intraday Pairs Trading Strategy on Soy Futures

**Motivation**

Pairs trading is one of the simplest forms of statistical arbitrage which involves exploiting relative mispricings between two similar assets. It operates based on the assumption of the law of one price; that anomalies among securities valuation will occur in the short run but in the long run, will be corrected by market efficiency.

In academic literature, studies such as such as Bogomolov (2010) and Harlacher (2012) have shown that to make pairs trading successful, the difference or spread between the traded pair needs to follow a stochastic mean reverting relationship over time. According to Montana et al. (2008), this means the spread will fluctuate around an equilibrium level, and any temporary mispricings are likely to be corrected over time. For example, when the spread is abnormally wide, the obvious trading decision would be to short sell the overpriced asset and long the undervalued asset according to a predetermined ratio. This way, profits can be made when the spread reverts to equilibrium. If a pair of assets is cointegrated, there exists a true mean reverting process. Therefore, pairs trading requires the selection of two assets which have a significant cointegrating relationship over time.

This week, we showcase the application of pairs trading strategies on two similar commodity futures – CBOT Soybean and DCE Soymeal. We begin by testing the hypothesis of a cointegrating relationship between two assets via Augmented Dickey-Fuller test and Johansen’s tests. Next, due to the possibility of a time varying structural relationship between both futures, we use a state space model in the form of the Kalman filter to dynamically estimate parameters such as hedge ratios from the price data. Finally, backtesting is conducted to evaluate the performance of the pairs trading strategy.

* *

**Data and Hypothesis Testing**

*Data*

Our study uses approximately 6 months’ worth of 1-minute bars for CBOT soybean and DCE soymeal front month futures, which was obtained from Bloomberg. This comprises of closing prices for every minute for the period of 7^{th} February 2017 to 22^{nd} August 2017. The price data is aligned by timestamps for both assets and missing bars are discarded. In addition, prices were converted to USD/MT equivalents.

*Hypothesis Testing*

China is the world’s largest soybean importer, and its main exporter is the US, which comprises around 38% of imports. The purchase of US soybeans is likely a substantial cost driver for soymeal producers in China, therefore it is an obvious assumption that Chinese soymeal and US soybeans could be highly correlated. CBOT soybean and DCE soymeal futures are the most influential markers for their underlyings in US and China specifically. Our hypothesis is that front month futures prices are cointegrated and thus have the potential for a mean reverting pairs trading strategy.

We test the significance of this hypothesis through Augmented Dickey Fuller and Johansen tests. These allow us to find whether our pair can form a cointegrating relationship, and therefore be used in a mean reverting strategy. For more information on the theoretical basis of these tests for pairs trading, please refer to Quantstart articles on ADF and Johansen’s tests.

At first glance, it can be seen that CBOT and DCE prices have follow a similar pattern over the last 6 months.

* *

* *

In addition, a regression plot shows signs of strong positive correlation between prices.

* *

* *

Linear regression using CBOT prices as a predictor of DCE prices suggested a significant relationship with a very low p-value and an R-square of 66.8%. A hedge ratio of 1.026 was obtained.

* *

* *

Next, an ADF test on the residuals produced a low statistic of -5.68 and a p-value below 0.01. This suggests that there is sufficient evidence to reject the null hypothesis of no cointegrating relationship.

* *

* *

In addition, the first hypothesis of Johansen’s test which tests for cointegration produced a test statistic of 40.45, which exceeded the 1% significance level of 23.52 significantly.

* *

* *

Overall, we have clear evidence to support our hypothesis that cointegration exists between CBOT and DCE prices.

* *

**Time Varying Model**

Although linear regression is helpful in understanding the dynamics between the two assets, it is unrealistic to assume that this relationship would be constant over time. Therefore, we use a Kalman filter to calculate time varying slopes and intercepts. More on applying Kalman filters in pairs trading can be found in Ernie Chan’s book, Algorithmic Trading: Winning Strategies and Their Rationale.

We used the dlm package in R can be used to construct a Kalman filter. Based on the estimated model, we plot the slope and intercept to show how they change over time:

* *

* *

It can be seen that the time varying slope changes dramatically over the 6-month period, dropping from above 0.85 in February to below 0.75 in June and July. Therefore, adopting a static hedge ratio for this pairs trading strategy would not be optimal.

Now that we have a time varying model for the relationship between CBOT and DCE prices, a spread can be constructed by fitting estimated slope and intercept values on CBOT and then computing the difference between that and DCE. A histogram plot of spread values shows an approximately normal distribution.

* *

* *

We also conduct a Shapiro-Wilk test for normality on a random portion of our data as an additional check. The test statistic is very high, providing a low p-value and hence provides evidence to reject the null hypothesis that the distribution is non-normal.

* *

* *

We will now outline a trading system based on the spread obtained by our state space model.

* *

**Strategy Development**

In generating trade signals for our mean reversion trading system, we consider using a Bollinger bands style strategy based on the distribution of spread values. Bollinger bands provide signals for entering and exiting trades at standard deviation thresholds based on rolling simple moving averages and standard deviations. Since the spread is a mean reverting stochastic process, it will return to the mean after deviating from it. Z-scores of the current spread price are computed at each timestamp. Average forward returns from trading the CBOT – DCE spread by decile or z-score are presented below:

* *

* *

It is evident from the graph that extreme values capture the most positive and negative forward returns and that a mean reverting strategy is suitable for this pair. Based on this, we define the trading system as follows:

* *

*Long Entry:* $Z_{score} \leq -Z_{entry}$

*Short Entry:* $Z_{score} \geq +Z_{entry}$

*Long Close:* $Z_{score} \geq -Z_{exit}$

*Short Close:* $Z_{score} \leq < +Z_{entry}$

* *

Where is the latest standardized spread price, is the trade entry threshold and is the trade exit threshold. Based on CBOT and DCE contract values of USD$47,225 and $27,340 respectively as of 22nd August 2017 close, and using a beta hedge ratio of 0.856, a long position means buying three lots of DCE soymeal and selling two lots of CBOT soybean approximately. Proportions based on hedge ratios can be better optimized with higher unit quantities, although there would also be market impact implications associated with execution.

For our strategy, we select the lookback period for z-scores to be 30 bars. We take to be 95^{th} percentile of z-values and while to be 0.

* *

**Strategy Backtest**

Our backtest spans the final week of our dataset from August 15^{th} to 22^{nd} 2017. This does not consider transaction costs. The pause (seen by the straight diagonal line) in the time series was due to the weekend of Aug 19 to 20. Top chart shows entry (green) and exit (red) markers throughout the week.

* *

* *

Due to the high frequency of our data and trade signals, many trades were conducted with small return magnitudes each time. Average holding period was 29 minutes and average time between trades 42 minutes. The strategy posted a return of 195 bps on CBOT + DCE gross value over the time period, with the largest profit of 33 bps and largest loss of 35 bps. A total of 255 trades were conducted, with 132 longs and 123 shorts. The win rate was 57.6%, and expectancy was 6 bps. These trade statistics are not surprising given the high frequency nature of our data, but would need HFT level execution capabilities as well as broker and exchange fees to act on. Given the low return magnitude and high turnover, exchange and broker fees cannot exceed a few dollars for the strategy to work.

* *

**Conclusion**

In summary, our study finds pairs trading to be a strong possibility on CBOT soybean and DCE soymeal at the minute level. We think it would be interesting to look at this at lower frequency timeframes such as hours or days, so it could be more viable for self-directed traders to operate on. In the meantime, feel free to test out our code for your own research!

* *

**References**

- Bogomolov, Tim, Pairs Trading in the Land Down Under (November 30, 2010). Finance and Corporate Governance Conference 2011 Paper. Available at SSRN: https://ssrn.com/abstract=1717295 or http://dx.doi.org/10.2139/ssrn.1717295
- Harlacher, M. (2012). Cointegration based statistical arbitrage. Master’s thesis, ETH Zurich
- Montana, Giovanni & Triantafyllopoulos, Kostas & Tsagaris, Theodoros. (2008). Data stream mining for market-neutral algorithmic trading. Proceedings of the ACM Symposium on Applied Computing.

This is an interesting trading idea, however, I encourage you to test your strategy out of sample to see if you get just as promising results. I ran you code and got similar results to those reported here (I should note that you and I may have different pricing services for Bloomberg). I also updated the model to be trained through Aug. 14 and tested Aug. 15 to Aug. 22. These results were not as promising. Success rate of 56.2%, mean return of 0.0045%, min return of -0.35%, max return of 0.33%, and cumulative return of 0.18%.

Have you also considered a KPSS test to supplement the ADF test? Generally, both tests agree. However, here it seems they do not.

Thanks nnb0317 for your kind feedback. In addition,here are our KPSS test results.

> kpss.test(fit$residuals, null = “Trend”)

KPSS Test for Trend Stationarity

data: fit$residuals

KPSS Trend = 10.663, Truncation lag parameter = 83, p-value = 0.01

The ADF procedure tests the null hypothesis of non-stationarity, whereas the KPSS procedure tests the null hypothesis of stationarity. With the KPSS test a failure to reject the null results in a failure to reject stationarity. The results here would lead you to reject the null in favor of the alternative and conclude that the series is non-stationary at a 1% level of significance. To see this empirically, first difference the spread series, run the KPSS test again, and look at the p-value.

Great research on statistical arbitrage & thanks for generous share your thoughts.

Great appreciated for your posted R code.

Is it possible you can also post Cointegration Data.RData, so self-directed traders can reproductive your works?

Thanks.

Thanks cccnj for your kind feedback. We’re not able to share the RData file with the OHLC data as that would be a violation of data permissions. However, we can let you know the exact specifications of our data: it comprises of 1 minute OHLC data for CBOT Soybean (S 1 Comdty), DCE Soymeal (AE1 Comdty) and USDCNY Curncy (for FX conversion) 7th February 2017 to 22nd August 2017. Thanks for your understanding.