Computing the Inside and Per Symbol Statistics
14 Sep 2015I have been silent for a while as I was having fun writing code. I finally implemented a simple (and hopelessly slow) book builder for the ITCH-5.0 feed. With it we can estimate the event rate for changes to the inside, as opposed to all the changes in the book that do not affect the best bid and offer.
This data is interesting as we try to design our library for time delay estimation because it is the number of changes at the inside that will (or may) trigger new delay computations. We are also interested in breaking down the statistics per symbol, because have to perform time delay analysis for only a subset of the symbols in the feed and it is trivial to parallelize the problem across different symbols in different servers.
The code is already available on
github.com
the tool called itch5inside
, clearly I lack imagination.
It outputs the
inside changes as an ASCII file (which can be compressed on the
fly), so we can use it later as a sample input into our analysis.
Optionally, it also outputs to stdout the per-symbol statistics.
I have made the statistics available
on this very blog
in case you want to analyze them.
In this post we will just make some observations about the expected
message rates.
First we load the data, you can set public.dir
to
https://coryan.github.io/public
and download the data directly from
the Internet, and then create a separate data set that does not
include the aggregate statistics.
It is a more or less well-known fact that some symbols are much more active than others, for example, if we plot a histogram of number of events at the inside and number of symbols with that many messages we get:
That graph appears exponential at first sight, but one can easily be fooled, this is what goodness of fit tests where invented for:
That p-value is extremely low, we cannot reject the Null
Hypothesis, the symbol$NSamples
values do not come from the
geometric distribution that best fits the data.
Also notice that your values might be different, in fact, they
should be different as the reference data was generated at random.
We can also visualize the distributions in log scale to make the
mismatch more apparent:
Clearly not the same distribution, but using math to verify this makes me feel better. In any case, whether the distribution is geometric or not does not matter much, clearly there is a small subset of the securities that have much higher message rates than others. We want to find out if any security has such high rates as to make them unusable, or very challenging.
The itch5inside
collects per-symbol message rates, actually it
collects message rates at different timescales (second, millisecond
and microsecond), as well as the distribution of processing times to
build the book, and the distribution of time in between messages
(which I call “inter-arrival time”, because I cannot think of a
cooler name).
Well, the tool does not exactly report the distributions, it reports
key quantiles, such as the minimum, the first quartile, and the 90th
percentile.
The actual quantiles vary per metric, but for our immediate purposes
maybe we can get the p99.9 of the event rate at the millisecond
timescale.
A brief apology for the p99.9 notation, we used this shorthand in a
previous job for “the 99.9-th percentile”, and got used to it.
Let’s plot the p99.9 events/msec rate against the total number of messages per second:
There is a clear outlier, argh, the itch5inside
tool also reports
the metrics aggregated across all symbols, that must be the
problem:
Of course it was, let’s just plot the symbols without the aggregate metrics:
Finally it would be nice to know what the top symbols are, so we filter the top50 and plot them against this data:
What Does this Mean?
We have established that inside changes are far more manageable, we probably need to track less than the top 50 symbols and those do not seem to see more than 10 messages in a millisecond very often. We still need to be fairly efficient, in those peak milliseconds we might have as little as 100 microseconds to process a single event.
What Next?
I did write a FFTW time delay estimator and benchmarked it, the results are recorded in the github issue. On my (somewhat old) workstations it takes 27 microseconds to analyze a sequence of 2,048 points, which I think would be the minimum for market data feeds. Considering that we need to analyze 4 signals per symbol (bid and offer, price and size for each), and we only have a 100 microsecond budget per symbol (see above), we find ourselves without any headroom, even if we did not have to analyze 20 or more symbols at a time to monitor all the different channels that some market feeds use.
So we need to continue exploring GPUs as a solution to the problem, I recently ran across VexCL, which may save me a lot of tricky writing of code. VexCL uses template expressions to create complex kernels and optimize the computations at a high level. I think my future involves benchmarking this library against some pretty raw OpenCL pipelines to see how those template expressions do in this case.