Intratrade

Synopsis:

This article is on the analysis of intra-arrival times of future contract trades, analysing market behaviour, and identification of other particpants’ trading strategy. Intra-arrival time has close relationship with the quantity of each trade. One reason behind this is that many participants use high speed algorithmic trading to reduce price impact. One method is dividing one large order into a number of smaller orders. It is highly possible that less intra-arrival time delta meaning that trades occur more frequently, comes with less quantity of each trade.

One important assumption is that if every trade comes randomly, the intra-arrival time should comply with a certain distribution. We are not assuming any distribution prior to our exploratory analysis, but we are expecting it to conform to one simple distribution, for example, Exponential distribution. Thus our analysis is constructed in two parts. The first one is to find out which distribution does intra-arrival time comply with. If we think order book as a queue, we hypothesize that intra-arrival time is exponentially distributed. Reason being because the exponential distribution describes time between events in a Poisson process. The second part is analysis on the difference between historical data and theoretical distribution. According to our assumption, the difference should be white noise. Otherwise, we suspect there may be algorithmic trading.

Table of contents:

Part 1:

1.1 Data

1.2 The reason why we exclude $\Delta t$ = 0 and 1

1.3 Historical distribution of $\Delta t$

1.3.1 $\Delta t$ plot

1.3.2 qqplot

1.4 Exploration of fitted distribution

1.4.1 exponential distribution

1.4.2 power law distribution

1.4.3 linear combination of exponential distribution and power law distribution

Part 2:

2.1 $\epsilon$ given fitted distribution

2.2 Identification of significant $\epsilon$ for each hour

Part 3:

Conclusion

Part 4:

Improvement

Part 5:

Appendix

Part 1

1.1 Data

To analyze intra-trade arrival time, we take $\Delta t$ between two consecutive trade and explore its distribution. In the following report, we will use data from 2016-09-12 to 2016-11-01 of GC Z6 as the main example. And we also use data of CL after finding the possible distribution of $\Delta t$ to test whether this distribution is suitable for other securities.

Our data is structured like the following example:

t = Workhorse()
t.select_sec('GC')
t.select_exp('Z6', '2016-09-12', '2016-11-01')
df = t.to_trades(t.df)
print ("Table 1")
df = df[df['EventType']!=12]
df.head()
Table 1
EventType Price Quantity OrderCount Flags
date
2016-09-12 00:00:01.122 10 133.31 2 3 0
2016-09-12 00:00:01.520 11 133.31 2 2 0
2016-09-12 00:00:01.520 11 133.31 1 2 0
2016-09-12 00:00:01.520 11 133.31 1 2 0
2016-09-12 00:00:01.571 10 133.32 1 2 0

One can refer to the event_dict as below that the event type 9, 10, and 11 comprises all types of trade. And the Dataframe df has information including trade date, price, quantity, order count, flag of event type 9, 10 and 11. What we need is just the trade date.

t.event_dict
{'EMPTY BOOK FINAL': 0,
 'FIXING PRICE': 13,
 'IMPLIED EMPTY BOOK FINAL': 1,
 'IMPLIED QUOTE BID': 2,
 'IMPLIED QUOTE SELL': 3,
 'OPEN INTEREST': 4,
 'OPENING PRICE': 5,
 'QUOTE BID': 6,
 'QUOTE SELL': 7,
 'SETTLEMENT PRICE': 8,
 'TRADE': 9,
 'TRADE AGRESSOR ON BUY': 10,
 'TRADE AGRESSOR ON SELL': 11,
 'TRADE VOLUME': 12}
set(df['EventType'])
{9, 10, 11}

Please take note that current df index is UTC time. After conversion of timezone, we can have trade date in terms of Singapore/Beijing time.

df.index = df.index.tz_localize('UTC').tz_convert('Asia/Singapore')
print("Table 2")
df.head()
Table 2
EventType Price Quantity OrderCount Flags
date
2016-09-12 08:00:01.122000+08:00 10 133.31 2 3 0
2016-09-12 08:00:01.520000+08:00 11 133.31 2 2 0
2016-09-12 08:00:01.520000+08:00 11 133.31 1 2 0
2016-09-12 08:00:01.520000+08:00 11 133.31 1 2 0
2016-09-12 08:00:01.571000+08:00 10 133.32 1 2 0

Next, we can calculate $\Delta t$ using df and record hour of each trade.

sample = pd.DataFrame(data={"diff": df.index.to_series().diff().dt.total_seconds().fillna(0)*1000.0,
                          "hour": df.index.hour}) 
print("Table 3")
sample.head()
Table 3
diff hour
date
2016-09-12 08:00:01.122000+08:00 0.0 8
2016-09-12 08:00:01.520000+08:00 398.0 8
2016-09-12 08:00:01.520000+08:00 0.0 8
2016-09-12 08:00:01.520000+08:00 0.0 8
2016-09-12 08:00:01.571000+08:00 51.0 8

The whole process to extract $\Delta t$ can be done using help function epcilon_data(security, expiry, startDate, endDate) through which you can choose any security with any expiry date within certain date range as long as we have data of it.

1.2 The reason why we exclude $\Delta t$ = 0 and 1

Our data is about time of each trade and it counts time in units in milliseconds. This limitation of our data causes inaccuracy of $\Delta t$ to some extent. First of all, $\Delta t = 0$ does not mean that the two trades happened simutaneously. Instead, It simply means that the two trades happened in the same millisecond and the actual $\Delta t$ can range from 0 to 1 millisecond. Similarly, $\Delta t = 1$ only tell us that the actual $\Delta t$ ranges from 0 to 2 milliseconds. And $\Delta t = 2$ tell us that the actual $\Delta t$ ranges from 1 to 3 milliseconds. In this example, $\Delta t = 2$ will lose some number of trade to $\Delta t = 1$ and mistakenly add some that belong to $\Delta t = 3$. We will see from next part that number of trades decreases as $\Delta t$ increase. These two facts lead to following conclusion:

First, $\Delta t $ = 0 or 1 gives us the most inaccurate raw data to analyze, so this report uses $\Delta t >1$.

Second, the actual number of trades may decrease faster than we get from the data.

1.3 Historical distribution of $\Delta t$

1.3.1 $\Delta t$ plot

Calculate frequency of each $\Delta t$. GC_hist is the histogram of $\Delta t$. It is obvious that number of trades decreases fast as $\Delta t$ increases.

GC_diff = epcilon_data('GC', 'Z6', '2016-09-12', '2016-11-01')
GC_hist = histo_over1(GC_diff)
print("Table 4")
GC_hist.head()
Table 4
count frequency
diff
2.0 18444 0.032634
3.0 8983 0.015894
4.0 4947 0.008753
5.0 3345 0.005918
6.0 2859 0.005059

$\Delta t$ plot

The $\Delta t$ plot shows that the frequency decreases at an non-linear rate and $\Delta t = 2$ has the largest frequency. Based on the graph of $\Delta t$ plot and theory of poisson process, we suspect that the frequency may be close to exponential distribution which describe time between events in a Poisson process or power law distribution.

We can also see from the graph that there is a spike between 20 and 30 milliseconds. After fitting the distribution, spikes as this one will become our significant epsilon.

print("Figure 1")
GC_hist['frequency'].plot()
plt.ylim(ymin = 0, ymax = 0.04)
plt.xlim(xmin = 0, xmax = 50)
plt.title('GC-Z6-160912-161101-delta t')
plt.xlabel('millisecond')
plt.ylabel('frequency')
Figure 1
<matplotlib.text.Text at 0x7f258dc6a390>

1.3.2 qqplot

Furthermore, we get the qqplot to have a brief idea of the historical distribution. qqplot is a good way to see whether sample quantile and theoretical quantile are close to each other. If they are, the plot will be 45 degree straight line. To be clear, the read line in the plot is Reference line. Since we do not know the parameter of powerlaw distribution that should be used. we can only have qqplot based on normal distribution and exponential distribution.

Compare historical distribution with Normal distribution. To make it visualizable, we also provide plot for the central part. We can see from the graph that the historical distribution is far away from normal distribution, which is not a surprise.

print ("Figure 2")
#qqplot_normal
sm.qqplot(GC_hist.iloc[:,0], line='s')
plt.title('GC-Z6-160912-161101-qqplot-normal')
plt.grid(True)

sm.qqplot(GC_hist.iloc[:,0], line='s')
plt.title('GC-Z6-160912-161101-qqplot-normal')
plt.ylim(-100,500)
plt.grid(True)
Figure 2

Compare the historical distribution with the Exponential distribution. To make it visualizable, we also provide plot for the central part. It seems that the historical distribution has a fatter tail than exponential distribution and that they are not close as we suppose and this lead us to check more possible distribution.

print("Figure 3")
#qqplot_exponential
sm.qqplot(GC_hist.iloc[:,0], dist='expon', line='s')
plt.title('GC-Z6-160912-161101-qqplot-exponential')
plt.grid(True)

sm.qqplot(GC_hist.iloc[:,0], dist='expon', line='s')
plt.title('GC-Z6-160912-161101-qqplot-exponential')
plt.ylim(0,800)
plt.grid(True)
Figure 3

1.4 Exploration of fitted distribution

In this part, we fit the historical distribution using exponential, power law and the linear combination of these two distribution. The fit is computed by maximizing a log-likelihood function, with penalty applied for samples outside of range of the distribution. The length of dataframe sample shows that we have $857650$ data points in our example, which is a large sample size. And mamximum likelihood method will give us unbiased estimators as the sample size increases.

len(sample)
857600

1.4.1 Exponential Distribution

Although we already know that exponential distribution solely may not describe the historical distribution well, we still do not want to miss any possible that exponential distribution has something to do with it. In Scipy, the probability density function for expon is: $expon.pdf(x) = e^{-x}$ for $x >= 0$.

The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, $expon.pdf(x, loc, scale)$ is identically equivalent to $expon.pdf(y) / scale$ with $y = (x – loc) / scale$. expon_fit function can be referred as in appendix. It returns fitted frequency of each $\delta t$ and parameters, loc and scale.

In our example, the fitted exponential parameters give us $pdf(\Delta t) = \frac{1}{13.74} * e^{-\frac{\Delta t – 1}{13.74}}$

ExponFit, ExponParam = expon_fit(GC_hist)
print ("Table 5")
ExponFit.head()
Table 5
fitted
diff
2.0 0.067695
3.0 0.062941
4.0 0.058521
5.0 0.054412
6.0 0.050591
ExponParam #loc, scale
(0.99999999916765892, 13.734893532550709)
# MSE of exponential
MSE_exp = np.sum((ExponFit['fitted']-GC_hist['frequency'])**2)/len(GC_hist)
MSE_exp
5.9009389854141781e-07

1.4.2 Power Law Distribution

According to the shape of $\Delta t$ plot, we suspect that it may be fitted well by power law distribution. The probability density function for powerlaw is: $powerlaw.pdf(x, a) = a * x^{(a-1)}$ for 0 <= x <= 1, a > 0. powerlaw takes a as a shape parameter.

The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, $powerlaw.pdf(x, a, loc, scale)$ is identically equivalent to $powerlaw.pdf(y, a) / scale$ with $y = (x – loc) / scale$.

In our example, powerlaw parameters give us $pdf(\Delta t) = \frac{1}{199567.82} * \Delta t ^ {\frac{\Delta t – 1}{33937}-1}$

PowerFit, PowerParam = powerlaw_fit(GC_hist)
print ("Table 6")
PowerFit.head()
Table 6
fitted
diff
2.0 0.028849
3.0 0.016229
4.0 0.011592
5.0 0.009130
6.0 0.007586
PowerParam #a, loc, scale
(0.17005246003634178, 0.99999999999999978, 33936.999368579651)
PowerParam[2]/PowerParam[0]
199567.82372526103
# MSE of Powerlaw
MSE_powerlaw = np.sum((PowerFit['fitted']-GC_hist['frequency'])**2)/len(GC_hist)
MSE_powerlaw
2.9399762460128807e-09

1.4.3 linear combination of exponential distribution and power law distribution

We also try a linear combination of exponential distribution and power law distribution comparing the three methods fitness. The linear regression use OLS method. And function power_expon_fit is defined to return result of linear combination. The logic behind this is that we treat all trades into two different types, one is high frequency trading or algorithm trading while the other is not, and that if those two different types of trading cause different $\Delta t$ distribution, say exponential distribution and powerlaw distribution, the actual $\Delta t$ may be better discribed by the linear combination of those two distributions.

The summary below shows that R square can be as high as 0.965, which means that the data is fitted well. The significant coefficients of the two fitted distribution shows that both exponential and powerlaw distribution should are useful to fit data. What’s more, the sign of exponential distribution coefficient is negative while the sign of powerlaw distribution coefficient is positive. The diffent signs of two coefficients is accordance with our prior hypothesis that $\Delta t$ of non-algorithm trading follows different distributions. From the size of coeffients, we further hypothize that $\Delta t$ algorithm trading may be dominated by powerlaw distribution and that $\Delta t$ of non-algorithm trading my be dominated by exponential distribution. Besides, among the mean squared error (MSE) of three fitted distributions, 5.9e-07 for exponential, 2.94e-09 for powerlaw, 1.93e-09 for linear combination, MSE of linear combination is the least.

Given analysis as above, we decide to use linear combination as our model to fit historical distribution.

Linear Combination for GC Z6:

fitted, param_expon, param_powerlaw, result = power_expon_fit(GC_hist)
print ("Table 7")
fitted.head()
Table 7
expon powerlaw mix
diff
2.0 0.067695 0.028849 0.033188
3.0 0.062941 0.016229 0.017094
4.0 0.058521 0.011592 0.011349
5.0 0.054412 0.009130 0.008411
6.0 0.050591 0.007586 0.006648
print ("Table 8")
result.summary()
Table 8
OLS Regression Results
Dep. Variable: frequency R-squared: 0.965
Model: OLS Adj. R-squared: 0.965
Method: Least Squares F-statistic: 5.229e+05
Date: Wed, 26 Apr 2017 Prob (F-statistic): 0.00
Time: 08:18:15 Log-Likelihood: 3.3036e+05
No. Observations: 38357 AIC: -6.607e+05
Df Residuals: 38355 BIC: -6.607e+05
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
expon -0.0634 0.000 -129.768 0.000 -0.064 -0.062
powerlaw 1.2992 0.002 609.734 0.000 1.295 1.303
Omnibus: 60234.471 Durbin-Watson: 0.261
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3842824400.477
Skew: 8.460 Prob(JB): 0.00
Kurtosis: 1553.539 Cond. No. 9.27
MSE_combo = result.mse_resid
MSE_combo
1.9344471490980199e-09
MSE_exp, MSE_powerlaw, MSE_combo
(5.9009389854141781e-07, 2.9399762460128807e-09, 1.9344471490980199e-09)

Below is the plot of three series of different fitted data and the historical data. It also shows that linear combination is the best fit.

print ("Figure 4")
fig = plt.figure(figsize=(7, 5))
plt.plot(np.log(GC_hist.iloc[:50,1]))
plt.plot(np.log(fitted.iloc[:50,:] ))
plt.grid(True)
plt.title('GC-Z6-160912-161101')
plt.xlabel('millisecond')
plt.ylabel('log(frequency)')
plt.ylim (ymax = 0, ymin = -10)
plt.legend(['data'] + list(fitted.columns), loc='upper right')
Figure 4
<matplotlib.legend.Legend at 0x7f259c2e7780>

We can also use the linear combination for other securities and expiry dates, and the result shows that it always works well.

Linear Combination for other securities & expiry date:

print ("Figure 5")
fit_plot('CL','G7','2016-11-21','2016-12-19')
Figure 5
expon_fit:  (0.99999999995716227, 4.3911945417615446)
powerlaw_fit:  (0.1868301014118921, 0.99999999999999978, 13527.179416344828)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              frequency   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                 1.480e+06
Date:                Wed, 26 Apr 2017   Prob (F-statistic):               0.00
Time:                        08:21:04   Log-Likelihood:             2.3435e+05
No. Observations:               27014   AIC:                        -4.687e+05
Df Residuals:                   27012   BIC:                        -4.687e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
expon          0.0625      0.000    181.000      0.000         0.062     0.063
powerlaw       1.1387      0.002    516.676      0.000         1.134     1.143
==============================================================================
Omnibus:                    44221.501   Durbin-Watson:                   0.583
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       4604392909.861
Skew:                           9.185   Prob(JB):                         0.00
Kurtosis:                    2025.458   Cond. No.                         16.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print ("Figure 6")
fit_plot('CL','Z6','2016-10-20','2016-11-18')
Figure 6
expon_fit:  (0.99999999986774735, 17.050170911952957)
powerlaw_fit:  (0.16426097728101163, 0.99999999999999978, 57038.776180856628)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              frequency   R-squared:                       0.992
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                 1.744e+06
Date:                Wed, 26 Apr 2017   Prob (F-statistic):               0.00
Time:                        08:21:55   Log-Likelihood:             2.3656e+05
No. Observations:               27416   AIC:                        -4.731e+05
Df Residuals:                   27414   BIC:                        -4.731e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
expon         -0.1445      0.000   -291.242      0.000        -0.145    -0.144
powerlaw       2.5402      0.002   1216.240      0.000         2.536     2.544
==============================================================================
Omnibus:                    21448.955   Durbin-Watson:                   0.201
Prob(Omnibus):                  0.000   Jarque-Bera (JB):        165717149.847
Skew:                           2.038   Prob(JB):                         0.00
Kurtosis:                     383.857   Cond. No.                         8.35
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Part 2

After fitting the historical data with a certain distribution, it is time to analyze the residuals. If intra-trade arrival time is purely random, the residual will be white noise. This part is to identify the residual that is not white noise. We will still use GC Z6 as example.

2.1 $\epsilon$ given fitted distribution

Trading does not happen with equal likelihood in any hour. On the contrary, trading happens more frequently in some hours, and less frequently in other hours. It means that intra-trade arrival time may be fitted in different parameter for different hours. So we fit data of each hour seperately and plot the residual as follows using function “epcilon_plot” and it can be referred in appendix.

print ("Figure 7")
epcilon_GCZ6 = epcilon_plot('GC', 'Z6', '2016-09-12', '2016-11-01')

2.2 Identification of significant $\epsilon$ for each hour

We can identify significant $\epsilon$ from the animated graph above for each hour. It is obvious that in any hour, the first several milliseconds has more frequency than the theoretical distribution. And about from 5 to 10 milliseconds, it has less frequency than the theoretical distribution. The large deviation from 0 tends to lead us suspect that there are many algorithms trading that give rise to little time lag between each trade. And during most of the hours, there are some significant $\epsilon$ between 10 to 100 milliseconds. A brief comparison among $\epsilon$ of different hours shows that there are less significant $\epsilon$ from Beijing time 16:00 to 23:59, corresponding to Chicago time 02:00 to 09:59.

The residual of intra-trade time larger than 10 milliseconds fluctuate around 0 and sometimes there are spikes between 10 to 1000 milliseconds. It seems that most trades happens within 1 seconds after last trade and that those spikes can be used to further analyze other’s algorithm.

Here is one example of how to analyze $\epsilon$. we are making the assumption that algorithmic trading is affecting intra-trade time. if so, volumes should be higher when $\epsilon$ is higher.

First, we need to calculate the trading quantity of different intra-trade time. Quantity_GCZ6 is a list of series for each hour and each series is the corresponding trading quantity of different intra-trade time as Table 10.

TradeInfo = pd.DataFrame(data={"diff": df.index.to_series().diff().dt.total_seconds().fillna(0)*1000.0,
                          "hour": df.index.hour,
                           "Quantity":df["Quantity"]}) 
print("Table 9")
TradeInfo.head()
Table 9
Quantity diff hour
date
2016-09-12 08:00:01.122000+08:00 2 0.0 8
2016-09-12 08:00:01.520000+08:00 2 398.0 8
2016-09-12 08:00:01.520000+08:00 1 0.0 8
2016-09-12 08:00:01.520000+08:00 1 0.0 8
2016-09-12 08:00:01.571000+08:00 1 51.0 8
Quantity_GCZ6 = []
for i in range(24):
    houri = TradeInfo[TradeInfo["hour"] == i].copy()
    houri = houri.groupby("diff").agg('sum')
    houri = houri["Quantity"]
    houri = houri[houri.index>1].copy()
    Quantity_GCZ6.append(houri)
print ("Table 10")
Quantity_GCZ6[0].head()
Table 10
diff
2.0    1437
3.0     624
4.0     342
5.0     165
6.0     197
Name: Quantity, dtype: int64

Second, we calculate and plot correlation coefficient between $\epsilon$ and trading quantity. Numpy Array corr record it of each hour in the following example. We also visualize it as Figure 8. It seems that the correlation coeffiicients are always positive, which complies with our assumption.

corr = np.array([])
for i in range(24):
    corr = np.append(corr, np.corrcoef(Quantity_GCZ6[i], epcilon_GCZ6[i])[1][0])
corr
array([ 0.23777122,  0.25621964,  0.19172442,  0.19952615,  0.28031444,
        0.15936543,  0.2386831 ,  0.20444672,  0.12279202,  0.20533769,
        0.16034664,  0.26799055,  0.20964773,  0.26711179,  0.22605318,
        0.208607  ,  0.14482114,  0.20142458,  0.35109041,  0.11231135,
        0.16494916,  0.17333267,  0.17636492,  0.17753732])
print("Figure 8")
plt.plot(corr)
plt.axhline(corr.mean(), color = 'black')
plt.title('GC-Z6-160912-161101 corrcoef between intratrade time and Quantity')
plt.xlabel('Hour')
Figure 8
<matplotlib.text.Text at 0x7f259c042be0>

Conclusion

If taking whole algorithm trading as a single system, the intra-arrival time from many algo trading may become just part of random number of certain distribuion. Among them, we want to identify the major impact of this algorithmic trading. Although we reject our initial hypothesis that the intra-trade arrival time is an exponential distribution, we surprisingly find a good model to fit the historical frequency. That is linear combination of exponential and powerlaw distribution, and powerlaw has dominant influence on historical frequency. This model may also certify that those trades do not really come randomly. Instead, they are from some algorithm. By implementing this model, the residuals can be extracted and significant residuals can be our new objects to analyze.

Improvement

  1. Still have problem of inaccuracy of owing to data limitations of time units.
  2. Not tested other possible distributions.
  3. Do not mention financial theoretical support for fit algorithm trading intra trade time as Powerlaw distribution.
  4. Do not mention implication of market behavior according to type of distribution.

Appendix

Help Function

def epcilon_data(security , expiry, startDate, endDate): #Asia/Singapore time after this function
    t = Workhorse()
    t.select_sec(security)
    t.select_exp(expiry,startDate,endDate)
    df = t.to_trades(t.df)
    df.index = df.index.tz_localize('UTC').tz_convert('Asia/Singapore')
    df = df[df['EventType']!=12]
    sample = pd.DataFrame(data={"diff": df.index.to_series().diff().dt.total_seconds().fillna(0)*1000.0,
                          "hour": df.index.hour}) 
    return sample

def histo_over1(df): 
    histo = pd.DataFrame(df.groupby('diff').size().rename('count'))
    histo = histo[histo.index>1].copy()
    histo['frequency'] = histo['count']/np.sum(histo['count'])
    return histo

def powerlaw_fit(df):
    dist = getattr(scipy.stats, 'powerlaw')
    param = dist.fit(df['count'])
    pdf_fitted = dist.pdf(df.index, *param[:-2], loc=param[-2], scale=param[-1])
    fitted = pd.DataFrame(pdf_fitted , index = df.index, columns = {'fitted'})
    return fitted, param

def expon_fit(df):
    dist = getattr(scipy.stats, 'expon')
    param = dist.fit(df['count'])
    pdf_fitted = dist.pdf(df.index, *param[:-2], loc=param[-2], scale=param[-1])
    fitted = pd.DataFrame(pdf_fitted , index = df.index, columns = {'fitted'})
    return fitted, param

def power_expon_fit(df):
    fit1, param_expon = expon_fit(df)
    fit2, param_powerlaw = powerlaw_fit(df)
    fit_single = pd.DataFrame(fit1)
    fit_single['1'] = fit2
    fit_single.index = df.index
    fit_single.columns = ['expon','powerlaw']
    result = sm.OLS(df['frequency'],fit_single).fit()
    fit_mix = pd.DataFrame(fit_single.dot(result.params), columns = ['mix'])
    fitted = pd.concat([fit_single, fit_mix], axis = 1)
    return (fitted, param_expon, param_powerlaw, result)
def fit_plot(security , expiry, startDate, endDate):
    sample = epcilon_data(security,expiry,startDate,endDate)
    Title = '%s-%s||%s||%s||'%(security,expiry,startDate,endDate)
    if (sample['diff'].max() == 0):
        return 'Sorry, no trade data for %s'%Title 
    histogram = histo_over1(sample)
    
    fig = plt.figure(figsize=(7, 5))
    plt.plot(np.log(histogram.iloc[:50,1]))
    pdf_fitted, param_expon, param_powerlaw, result = power_expon_fit(histogram)

    plt.plot(np.log(pdf_fitted.iloc[:50,:] ))
    plt.grid(True)
    ax = plt.gca()
    plt.title(Title)
    plt.xlabel('millisecond')
    plt.ylabel('log(frequency)')
    plt.ylim (ymax = 0, ymin = -10)
    plt.legend(['data'] + list(pdf_fitted.columns), loc='upper right')
    print ('expon_fit: ', param_expon)
    print ('powerlaw_fit: ', param_powerlaw)
    print (result.summary())
# pls note that x starts from 2ms to 10^8 ms, log spaced of x
# fit each hour seperately
def epcilon_plot(security, expiry, startDate, endDate):
    
    from IPython.display import display
    #animation function
    def init():
        line.set_data([], [])
        return line,
    def animate(i, data):
        line.set_data(data[i].index, data[i].values)
        Label = 'Beijing %d:00-%d:59' %(i,i)
        line.label = Label
        plt.legend([line], [Label], loc='upper left')
        return line,
    
    #plot data calculation
    sample = epcilon_data(security,expiry,startDate,endDate)
    if (sample['diff'].max() == 0):
        return 'Sorry, no trade data for %s'%Title 
    epcilon = []
    for i in range(24):
        histo_hour = histo_over1(sample[sample['hour']==i])
        if (histo_hour.empty):
            epcilon.append(histo_hour['frequency'])
        else:
            fitted, param_expon, param_powerlaw, result = power_expon_fit(histo_hour)
            epcilon.append(histo_hour['frequency'] - fitted.loc[histo_hour.index,'mix'])
    
    #animation plot
    fig = plt.figure(figsize=(7,5))
    Title = '%s-%s||%s||%s||'%(security,expiry,startDate,endDate)
    ax = plt.axes(xlim=(2,np.power(10,8)), ylim=(-0.015,0.03))
    ax.set_xscale('log')
    plt.axhline(0, color='k', ls='--', lw=2.0)
    line, = ax.plot(0,0, label = 'Beijing 0:00-0:59' )
    plt.legend(loc='upper left')
    plt.xlabel('10^x millisecond')
    plt.ylabel('epcilon of frequency')
    plt.grid(True)
    plt.title(Title)

    anim = animation.FuncAnimation(fig, animate, init_func=init,
                                   frames = 24, fargs = (epcilon,), interval=400, blit=True, repeat=True)
    chart = HTML(anim.to_html5_video())
    display(chart)
    return epcilon