8-bit Assembly was Fun Ramblings of a not-so-old Programmer

Optimizing the ITCH Order Books

It has been a while since I posted something, laziness and not having anything interesting to report. Well, maybe a lot of negative reports, which are interesting and should be published, but see above about laziness.

Recently my friend Gabriel implemented a really cool optimization on the ITCH-5.0 order book I had implemented, which I thought should be documented somewhere. His work inspired me to look further into benchmarking and how to decide if a code change has actually improved anything.

But first let’s start with Gabriel’s work. Two observations, first prices are discrete, they only show in penny increments (1/100 of a penny for prices below $1.00, but still discrete). The second, and more critical, observation is that most market feed updates happen close to the inside. Gabriel and myself wrote a program to confirm this, here are the results, as percentiles of the event depth:

NSamples min p25 p50 p75 p90 p99 p99.9 p99.99 max
489064030 0 0 1 6 14 203 2135 15489331 20009799

One should expect, therefore, that 99% of the events affect a price no more than 203 levels from the inside. How can one use that? Well, building a book is basically keeping a map (it needs to be sorted, so hashes typically do not work), between prices and quantities. Since prices are basically discrete values, one starts thinking of vectors, or arrays, or many circular buffers indexed by price levels.

Since most of the changes are close to the inside, what if we kept an array with the quantities, with the inside somewhere close to the center of the array? Most updates would happen in a few elements of this array, and indexing arrays is fast! Furthermore, even as the price moves, the change will be “slow”, only affecting a few price levels at a time.

Unfortunately we cannot keep all the prices in an array, the arrays would need to be too large in some cases. We could explore options to keep everything in arrays, for example, one could grow the arrays dynamically and only a few will be very large. But a more robust approach is to just use a map for anything that does not fit the array.

Naturally prices change during the day, and sometimes the price changes will be so large that the range in the array is no longer usable. That is Okay, as long as that happens rarely enough the overall performance will be better.

Our benchmarks show that this optimization works extremely well in practice. With real data the array based order book can process its events significantly faster than the map-based book, the total CPU time decreases from 199.5 seconds to 187.50 seconds. But more interestingly, processing latencies per event are decreased up to the p99 level (times in nanoseconds):

Book Type min p10 p25 p50 p75 p90 p99 p99.9 p99.99 max N
Map 193 345 421 680 1173 1769 2457 3599 13491 65673473 17560405
Array 193 288 332 517 886 1328 2093 5807 12657 63389010 17560405

More interestingly, using a synthetic benchmark we can see that the difference is very obvious:

A boxplot plot: the X axis is labeled 'Book Type', showing 'array'
 and 'map' labels for two boxplots.  The Y axis is labeled 'Iteration
 Latency'.  The array boxplot shows much lower values from p0 to p75.
 Both boxplots have numerous
 outliers.

Finally a teaser: the array book type shows more variability, still faster than map, but more difference between quartiles. We use the less familiar (but more visually pleasing) violin plot to break down the data by book side as well as book type. There is nothing remarkable for the ‘map’ book type, but why are latencies so different for different side of the book in the ‘array’ case? The answer to that and other questions in a future post.

A violin plot: the X axis is labeled 'Book Type', showing 'array'
 and 'map' labels, each label has two plots, distinguished by color,
 one for 'buy' and another for 'sell'.
 The Y axis is labeled
 'Iteration Latency'.
 Each book type has two violin plots, the ones for The array boxplot shows much lower values from p0 to p75.
 Both boxplots have numerous
 outliers.

Notes

The script used to generate the plots is shown below. The data is available for download, and committed to my github repository. This data was generated with an specific version of the benchmark, on a Google Compute Engine (GCE) virtual machine (2 vCPUs, 13GiB RAM, Ubuntu 16.04 image), additional configuration information is captured in the CSV file. System configuration beyond the base image is captured in the Dockerfile.

require(ggplot2)
baseline.file <- 'http://coryan.github.io/public/2017-01-03-bm_order_book.baseline.csv'
data <- read.csv(
    baseline.file, header=FALSE, col.names=c('testcase', 'nanoseconds'),
    comment.char='#')
data$run <- factor('baseline')
data$microseconds <- data$nanoseconds / 1000.0
data$booktype <- factor(sapply(
    data$testcase,
    function(x) strsplit(as.character(x), split=':')[[1]][1]))
data$side <- factor(sapply(
    data$testcase,
    function(x) strsplit(as.character(x), split=':')[[1]][2]))

ggplot(data=data, aes(x=booktype, y=microseconds)) +
  geom_boxplot(color="blue") +
  ylab("Iteration Latency (us)") +
  xlab("Book Type") +
  theme(legend.position="bottom")
ggsave(filename="2017-01-03-array_vs_map.boxplot.svg",
       width=8.0, height=8.0/1.61)
ggsave(filename="2017-01-03-array_vs_map.boxplot.png",
       width=8.0, height=8.0/1.61)

ggplot(data=data, aes(x=booktype, y=microseconds, color=side)) +
  geom_violin() +
  ylab("Iteration Latency (us)") +
  xlab("Book Type") +
  theme(legend.position="bottom")
ggsave(filename="2017-01-03-array_vs_map.violin.svg",
       width=8.0, height=8.0/1.61)
ggsave(filename="2017-01-03-array_vs_map.violine.png",
       width=8.0, height=8.0/1.61)