8-bit Assembly was Fun Ramblings of a not-so-old Programmer

On Benchmarking, Part 2

This is a long series of posts where I try to teach myself how to run rigorous, reproducible microbenchmarks on Linux. You may want to start from the first one and learn with me as I go along. I am certain to make mistakes, please write be back in this bug when I do.

In my previous post I discussed a class (array_based_order_book_side<> a/k/a abobs<>) which serves as a motivation to learn how to create really good benchmarks. Because testing the class inside a full program introduces too much variation in the results, it is too slow, and it is too cumbersome I decided to go with a small purpose-built benchmark program. In that post I pointed out ([I5]) that any good description of a benchmark must include how does the benchmark measures time, why is that a good approach to measure time, and how does the benchmark control for variables that affect the results, such as system load. This poist is largely written to address those questions.

I will describe a small framework I built to write benchmarks for JayBeams. Mostly this was born with my frustration when not getting the same results if one runs the same benchmark twice. If I cannot reproduce the results myself, for the same code, on the same server, what hope do I have of creating reproducible results for others? I need to get the environment where the benchmarks run under control before I have any hope of embarking of a rigorous analysis of them.

It turns out you need to do a lot of fine tuning in the operating system to get consistency out of your benchmarks, and you also need to select your clock carefully. I find this kind of fine tuning interesting in itself, so I am taking the opportunity to talk about it.

Most folks call such small benchmarks a microbenchmark, and I like the name because it makes a clear distinction from large benchmarks as those for database systems (think TPC-E and its cousins). I plan to use that name hereafter.

Anatomy of a Microbenchmark

Most microbenchmarks follow a similar pattern. First, you setup the environment necessary to run whatever it is you are testing. This is similar to how you setup your mocks in a unit test, and in fact you may setup mocks for your benchmark too. In the case of our example (abobs<>) one wants to build a synthetic list of operations before the system starts measuring the performance of the class. Building that list is expensive, and you certainly do not want to include it the measurement of the class itself; in general, microbenchmarks should not measure or report the time required to setup their test environment.

Then, you run multiple iterations of the test and capture how long it took to run each iteration. How exactly you measure the time is a complicated question, modern operating systems and programming languages offer multiple different ways to measure time. I will discuss which one I picked, and why, later in the post. Sometimes you want to run a number of iterations at the beginning of the benchmark, and discard the results from them. The argument is usually that you are interested in the steady state of the system, not what happens while the system is “warming up”. Whether that is sensible or not depends on the context, of course.

At the end most microbenchmarks report some kind of aggregate of the results, typically the average time across all iterations. Most microbenchmarks stop there, though rarely they include additional statistics, such as the minimum, maximum, standard deviation, or the median.

One of my frustrations is that I rarely see any justification for the choice of statistics: why is the mean the right statistic to consider in the conditions observed during the benchmark? How are you going to deal with outliers if you use the mean? Why not median? Does your data show any kind of central tendency? If not, then neither median nor mean are good ideas, so why report them? Likewise, why is the standard deviation the right measurement of statistical dispersion? Is something like the interquartile range a better statistic of dispersion for your tests?

The other mistake that folks often make is to pick a number of iterations because “it seems high enough”. There are good statistical techniques to decide how many iterations are needed to draw valid conclusions, why not use them?

An even worse mistake is to not consider whether the effect observed by your benchmark even makes sense: if you results indicate that option A is better by less than one machine instruction per iteration vs. option B, do you really think that is meaningful? I think it is not, no matter how many statistical tests you have to prove that it is true, it has barely measurable impact. I want rigorous results, I do not want to join “The Cult of Statistical Significance”.

The JayBeams Microbenchmark Framework

In JayBeams microbenchmark framework the user just needs to provide a fixture template parameter. The constructor of this fixture must setup the environment for the microbenchmark. The fixture::run member function must run the test. The framework takes care of the rest: it reads the configuration parameters for the test from the command line, calls your constructor, runs the desired number of warm up and test iterations, captures the results, and finally reports all the data.

The time measurements for each iteration are captured in memory, since you do not want to contaminate the performance of test with additional I/O. All the memory necessary to capture the results is allocated before the test starts, because I do not want to contaminate the arena either.

Reporting the Results

I chose to make no assumptions in the JayBeams microbenchmark framework as to what are good statistics to report for a given microbenchmark. The choice of statistic depends on the nature of the underlying distribution, and the statistical tests that you are planning to use. Worse, even if I knew the perfect statistics to use, there are some complicated numerical and semi-numerical algorithms involved. Such tasks are best left to specialized software, such as R, or if you are a Python fan, Pandas.

In general, the JayBeams microbenchmark framework will dump all the results to stdout, and expects you to give them to a script (I use R) to perform the analysis there. However, sometimes you just want quick results to guide the modify-compile-test cycles.

The microbenchmark framework also outputs a summary of the results. This summary includes: the number of iterations, the minimum time, the maximum time, and the p25, p50, p75, p90 and p99.9 percentiles of the time across all iterations. BTW, I picked up p99.9 as a notation for “the 99.9th percentile”, or the even worse “the 0.999 quantile of the sample” in one of my jobs, not sure who invented it, and I think it is not very standard, but it should be. The choice of percentiles is based on the fact that most latency measurements are skewed to the right (so we have more percentiles above 90% than below 10%), but the choice is admittedly arbitrary. The system intentionally omits the mean, because the distributions rarely have any central tendency, which is what the mean intuitively represent, and I fear folks would draw wrong conclusions if included.

Clock selection

I mentioned that there are many clocks available in C++ on Linux, and promised to tell you how I chose. The JayBeams microbenchmark framework uses std::chrono::steady_clock, this is a guaranteed monotonic clock, the resolution depends on your hardware, but any modern x86 computer is likely to have an HPET circuit with at least 100ns resolution. The Linux kernel can also drive the time measurements using the TSC register, which has sub-nanosecond resolution (but many other problems). In short, this is a good clock for a stopwatch (monotonic), and while the resolution is not guaranteed to be sub-microseconds, it is likely to be. That meets my requirements, but why not any of the alternatives?

getrusage(2): this system call returns the resource utilization counters that the system tracks for every process (and in some systems each thread). The counters include cpu time, system time, page faults, context switches, and many others. Using CPU time instead of wall clock time is good, because the amount of CPU used should not change while the program is waiting to be scheduled. However, the precision of getrusage is too low for my purposes, traditionally it was updated 100 times a second, but even on modern Linux kernels the counters are only incremented around 1,000 times per second [1] [2]. So at best you get millisecond resolution, while the improvements I am trying to measure may be a few microseconds. This system call would introduce measurement errors many times larger than the effects I want to measure, and therefore it is not appropriate for my purposes.

std::chrono::high_resolution_clock: so C++ 11 introduced a number of different clocks, and this one has potentially higher-resolution clock than std::chrono::steady_clock. That is good, right? Unfortunately, high_resolution_clock is not guaranteed to be monotonic, it might go back in time, or some seconds may be shorter than others. I decided to check, maybe I was lucky and it was actually monotonic on my combination of operating system and compilers. No such luck, in all the Linux implementations I used this clock is based on clock_gettime(CLOCK_REALTIME,...) [3], which is subject to changes in the system clock, such as ntp adjustments. So this one is rejected because it does not make for a good stopwatch.

clock_gettime(2): is the underlying function used in the implementation of std::chrono::steady_clock. One could argue that using it directly would be more efficient, however the C++ classes around them add very little overhead, and offer a much superior interface. Candidly I would have written a wrapper to use this class, and the wrapper would have been worse than the one provided by the standard, so why bother?

gettimeofday(2) is a POSIX call with similar semantics to clock_gettime(CLOCK_REALTIME, ...). Even the POSIX standard no longer recommends using it [4], and recommends using clock_gettime instead. So this one is rejected because it is basically obsoleted, and it is also not monotonic, so not a good stopwatch.

time(2) only has second resolution, and it is not monotonic. Clearly not adequate for my purposes.

rdtscp / rdtsc: Okay, this is the one that all the low level programmers go after. It is a x86 instruction that essentially returns the number of ticks since the CPU started. You cannot ask for lower overhead than “single instruction”, right? I have used this approach in the past, but you do need to calibrate the counter; it is never correct to just take the count and divided by the clock rate of your CPU. But there are a number of other pitfalls too. Furthermore, clock_gettime is implemented using vDSO, which greatly reduces the overhead of these system calls. My little benchmark, indicates that the difference between them is about 30ns (that is nanoseconds) on my workstation. In my opinion, its use is no longer justified on modern Linux systems; it carries a lot of extra complexity that you only need if you are measuring things in the nanosecond range. I may need it eventually, if I start measuring computations that take less than one microsecond, but until I do I think std::chrono::steady_clock is much easier to use.

System Configuration

Running benchmarks on a typical Linux workstation or server can be frustrating because the results vary so much. Run the test once, it takes 20ms, run it again, it takes 50ms, run it a third time it takes 25ms, again and you get 45ms. Where is all this variation coming from? And how can we control it? I have found that you need to control at least the following to get consistent results: (1) the scheduling class for the process, (2) the percentage of the CPU reserved for non real-time processes, (3) the CPU frequency scaling governor in the system, and (4) the overall system load.

I basically tested all different combinations of these parameters, and I will remove the combinations that produce bad results until I find the one (or few ones) that works well. Well, when I say “all combinations” I do not mean that: there are 99 different real-time priorities, do I need to test all of them? What I actually mean is:

scheduling class: I ran the microbenchmark in both the default scheduling class (SCHED_OTHER), and at the maximum priority in the real-time scheduling class (SCHED_FIFO). If you want a detailed description of the scheduling classes and how they work I recommend the man page: sched(7).

non-real-time CPU reservation: this one is not as well known, so a brief intro, real-time tasks can starve the non real-time tasks if they go haywire. That might be Okay, but they can also starve the interrupts, and that can be a “Bad Thing”[tm]. So by default the Linux kernel is configured to reserve a percentage of the CPU for non real-time workloads. Most systems set this to 5%, but it can be changed by writing into /proc/sys/kernel/sched_rt_runtime_us [5]). For benchmarks this is awful, if your systems has more cores than the benchmark is going to use, why not run with 0% reserved for the non real-time tasks? So I try with both a 0% and a 5% reservation for non real-time tasks and see how this affects the predictability of the results when using real-time scheduling (it should make no difference when running in the default scheduling class, so I skipped that).

CPU frequency scaling governor: modern CPUs can change their frequency dynamically to tradeoff power efficiency against performance. The system provides different governors that offer distinct strategies to balance this tradeoff. I ran the tests with both the ondemand CPU governor, which attempts to increase the CPU frequency as soon as it is needed, and with the performance CPU frequency governor which always runs the CPU at the highest available frequency [6].

system load: we all know that performance results are affected by external load, so to simulate the effect of a loaded vs. idle system I ran the tests with the system basically idle (called ‘unloaded’ in the graphs). Then I repeated the benchmarks while I ran N processes, one for each core, each of these processes tries to consume 100% of a core.

Finally, for all the all possible combinations of the configuration parameters and load I described above I run the microbenchmark four times. I actually used mbobs<> in these benchmarks, but the results apply regardless of what you are testing. Let’s look at the pretty graph:

A series of boxplot graphs showing how the performance vary with
  scheduling parameters, load, and the system frequency scaling
  governor used.

The first thing that jumps at you is the large number of outliers in the default scheduling class when the the system is under load. That is not a surprise, the poor benchmark is competing with the load generator, and they are both the same priority. We probably do not want to run benchmarks on loaded systems at the default scheduling class, we kind of knew that, but now it is confirmed.

We can also eliminate the ondemand governor when using the real-time scheduling class. When there is no load in the system the results are quite variable under this governor. However it seems that it performs well when the system is loaded. That is weird, right? Well if you think about it, under high system load the ondemand governor pushes the CPU frequency to its highest value because the load generator is asking for the CPU all the time. That actually improves the consistency of the results because when the benchmark gets to run the CPU is already running as fast as possible. In effect, running with the ondemand governor under high load is equivalent to running under the performance governor under any load.

Before Going Further: Fix the Input

Okay, so the ondemand governor is a bad idea, and running in the default scheduling class with high load is a bad idea too. The following graph shows different results for the microbenchmark when the system is always configured to use the performance CPU frequency governor, and excluding the default scheduling class when the system is under load:

A series of boxplot graphs showing how the performance vary with
  the PRNG seed selection and system load.

I have broken down the times by the different inputs to the test. Either with the same seed to the pseudo random number generator (PRNG) used to create that synthetic sequence of operations we have discussed, labeled fixed, or using /dev/urandom to initialize the PRNG, labeled urandom, of course. Obviously some of the different in execution time can be attributed to the input, but there are noticeable differences even when the input is always the same.

Notice that I do not recommend running the benchmarks for the abobs<> with a fixed input. We want the class to work more efficiently for a large class of inputs, not simply for the one we happy to measure with. We are fixing the input in this post in an effort to better tune our operating system.

Effect of the System Load and Scheduling Limits

By now this has become a quest: how to configure the system and test to get the most consistent results possible? Let’s look at the results again, but remove the ondemand governor, and only used the same input in all tests:

A series of boxplot graphs showing how the performance vary with
  system load and the real-time scheduling system-wide limits.

Okay, we knew from the beginning that running on a loaded system was a bad idea. We can control the outliers with suitable scheduling parameters, but there is a lot more variation between the first and third quartiles (those are the end of the boxes in these graphs btw) with load than without it.

I still have not examined the effect, if any, of that /proc/sys/kernel/sched_rt_runtime_us parameter:

A series of boxplot graphs showing how the performance vary with
  system load and the real-time scheduling system-wide limits.

The effects seem to be very small, but hard to see in the graph. Time to use some number crunching, I chose the interquartile range (IQR) because it is robust and non-parametric, it captures 50% of the data no matter what the distribution or outliers are:

Governor Scheduling Class Real-time Scheduling Limits IQR
ondemand default N/A 148
performance default N/A 89
performance FIFO default (95%) 40
performance FIFO unlimited 28

That is enough to convince me that there is a clear benefit to using all the settings I have discussed above. Just for fun, here is how bad those IQRs started, even if only restricted to the same seed for all runs:

loaded seed scheduling governor Interquartile Range
loaded fixed default ondemand 602
loaded fixed rt:default ondemand 555
loaded fixed default performance 540
loaded fixed rt:default performance 500
loaded fixed rt:unlimited performance 492
loaded fixed rt:unlimited ondemand 481

I believe the improvement is quite significant, I will show in a future post that this is the difference between having to run a few hundred iterations of the test vs. tens of thousands of iterations to obtain enough statistical power in the microbenchmark.

I should mention that all these pretty graphs and tables are compelling, but do not have sufficient statistical rigor to draw quantitative conclusions. I feel comfortable asserting that changing these system configuration parameters has an effect on the consistency of the performance results. I cannot assert with any confidence what is the size of the effect, or whether the results are statistically significant, or to what population of tests they apply. Those things are possible to do, but distract us from the objective of describing benchmarks rigorously.

Summary

The system configuration seems to have a very large effect on how consistent are your benchmark results. I recommend you run microbenchmarks on the SCHED_FIFO scheduling class, at the highest available priority, on a lightly loaded system, where the CPU frequency scaling governor has been set to performance, and where the system has been configured to dedicate up to 100% of the CPU to real-time tasks.

The microbenchmark framework automatically set all these configuration parameters. Well, at the moment I use a driver script to set the CPU frequency scaling governor and to change the CPU reservation for non real-time tasks. I prefer to keep this in a separate script because otherwise one needs superuser privileges to run the benchmark. Setting the scheduling class and priority is fine, you only need the right capabilities via /etc/security/limits.conf. The script is small and easier to inspect for security problems, in fact, it just relies on sudo, so a simple grep finds all the relevant lines. If you do not like these settings, the framework can be configured to not set them. It can also be configured to either ignore errors when changing the scheduling parameters (the default), or to raise an exception if setting any of the scheduling parameters fails.

I think one should use std::chrono::steady_clock if you are running C++ microbenchmarks on Linux. Using rdtsc is probably the only option if you need to measure things in the [100,1000] nanoseconds range, but there are many pitfalls and caveats, read about the online before jumping into coding.

Even with all this effort to control the consistency of the benchmark results, and even with a very simple, purely CPU bound test as used in this post the results exhibit some variability. These benchmarks live in a world where only rigorous statistics can help you draw the right conclusions, or at least help you avoid the wrong conclusions.

As I continue to learn how to run rigorous, reproducible microbenchmarks in Linux I keep having to pick up more and more statistical tools. I would like to talk about them in my next post.

Future Work

There are a few other things about locking down the system configuration that I left out and should not go unmentioned. Page faults can introduce latency in the benchmark, and can be avoided by using the mlockall(2) system call. I do not think these programs suffer too much from it, but changing the framework to make this system call is not too hard and sounds like fun.

Similar to real-time CPU scheduling, the Linux kernel offers facilities to schedule I/O operations at different priorities via the ioprio_set(2) system calls. Since these microbenchmarks perform no I/O I am assuming this will not matter, but possibly could. I should extend the framework to make these calls, even if it is only when some configuration flag is set.

I have not found any evidence that setting the CPU affinity for the process (also known as pinning) helps in this case. It might do so, but I have no pretty graph to support that. It also wreaks havoc with non-maskable interrupts when the benchmark runs at higher priority than the interrupt priority.

In some circles it is popular to create a container restricted to a single core and move all system daemons to that core. Then you run your system (or your microbenchmark) in any of the other cores. That probably would help, but it is so tedious to setup that I have not bothered.

Linux 4.7 and higher include schedutil a CPU frequency scaling governor based on information provided by the scheduler [7]. Initial reports indicate that it performs almost as well as the performance governor. For our purposes the performance scheduler is a better choice as we are willing to forgo the power efficiency of a intelligent governor in favor of the predictability of our results.

My friend Gonzalo points out that we assume std::chrono::steady_clock has good enough resolution, we should at least check if this is the case at runtime and warn if necessary. Unfortunately, there is no guarantee in C++ as to the resolution of any of the clocks, nor is there an API to expose the expected resolution of the clock in the language. Unfortunately this means any check on resolution must be platform specific. On Linux the clock_getres(2) seems promising at first, but it turns out to always return 1ns for all the “high-resolution” clocks, regardless of their actual resolution. I do not have, at the moment, a good idea on how to approach this problem beyond relying on the documentation of the system.

Notes

The data in this post was generated using a shell script, available here, it should be executed in a directory where you have compiled JayBeams.

Likewise, the graphs for this post were generated using a R script, which is also available.

The detailed configuration of the server hardware and software used to run these scripts is included in the comments of the csv file that accompanies this post.

Updates

I completely reworded this post, the first one read like a not very good paper. Same content, just better tone I think. I beg forgiveness from my readers, I am still trying to find a good style for blogs.

On Benchmarking, Part 1

My previous post included a performance comparison between two implementations of a data structure. I have done this type of comparison many times, and have grown increasingly uncomfortable with the lack of scientific and statistical rigor we use in our field (I mean software engineering, or computer science, or computer engineering, or any other name you prefer) when reporting such results.

Allow me to criticize my own post to illustrate what I mean. The post reported on the results of benchmarking two implementations of a data structure.

I1: I did not provide a context: what domain those this problem arise in? Why is this data structure interesting at all?

I2: I did not provide a description of the problem the data structure is trying to solve. Maybe there is a much better solution that I am not aware of, one of the readers may point this potential solution to me, and all the analysis is a waste of time.

I3: I did not provide any justification for why the data structure efficiency would be of interest to anybody.

I4: I did not include a description of the data structure. With such a description the reader can understand why is it likely that the data structure will perform better. Better yet, they can identify the weakness in the data structure, and evaluate if the analysis considers those weaknesses.

I assumed that a reader of this blog, where I largely write about JayBeams, should be familiar with these topics, but that is not necessarily the case. The gentle reader may land on this post as the result of a search, or because a friend recommends a link. In any case, it would have to be a very persistent reader if they were to find the actual version of the JayBeams project used to prepare the post, and that information was nowhere to be found either.

Furthermore, and this is less forgivable, my post did not answer any of these other questions:

I5: How exactly is the benchmark constructed? How does it measure time? Why is that a good choice for time measurement? What external variables impact the results, such as system load, impact the results and how did I control for them?

I6: What exactly is the definition of success and do the results meet that definition? Or in the language of statistics: Was the effect interesting or too small to matter?

I7:What exactly do I mean by “faster”, or “better performance”, or “more efficient”? How is that operationalized?

I8: At least I was explicit in declaring that I only expected the array_based_order_book_side<> to the faster for the inputs that one sees in a normal feed, but that it was worse with suitable constructed inputs. However, this leaves other questions unanswered: How do you characterize those inputs for which it is expected to be faster? Even from the brief description, it sounds there is a very large space of such inputs. Why do I think the results apply for most of the acceptable inputs if you only tested with a few? How many inputs would you need to sample to be confident in the results?

I9: How unlucky would I have to be to miss the effect if it is there? Or if you prefer: how likely is it that we will detect the effect if it is there? In the language of statistics: Did the test have enough statistical power to measure what I wanted to measure? Is that number of iterations (a/k/a samples) high enough? For that matter, how many iterations of the test did I ran?

I10: How confident am I that the results cannot be explained by luck alone? Or in the language of statistics: Was the result statistically significant? A what level?

I11: I reported the median latency (and a number of other percentiles): why are those statistics relevant or appropriate for this problem?

I12: How could anybody reproduce these tests? What was the hardware and software platform used for it? What version of JayBeams did I use for the test? What version of the compiler, libraries, kernel, etc. did I use?

If those questions are of no interest to you, then this is not the series of posts that you want to read. If you think those questions are important, or peak your interest, or you think they may be applicable to similar benchmarks you have performed in the past, or plan to perform in the future, then I hope to answer them in this series of posts.

I propose to use measure the performance improvements of array_based_order_book_side<> vs map_based_order_book_side<> as an example of how to rigorously answer those questions. Later it will become evident that this example is too easy, so I will also use measure the some small improvement of array_based_order_book_side<> as a further example of how to rigorously benchmark and compare code changes. As I go along, I will describe the pitfalls that I try to avoid, and how I avoid them. And finally to report the results in enough detail that readers can decide if they agree with my conclusions.

It is possible, indeed likely, that readers more knowledgeable than myself will find fault in my methods, my presentation, or my interpretation of the results are incorrect. They may be able to point out other pitfalls that I did not consider, or descriptions that are imprecise, or places where my math was wrong, or even simple spelling or grammar mistakes. If so, I invite them to enter their comments into a bug I have created for this purpose.

The Problem of Building an Order Book

In this section I will try to address the issues raised in I1, I2, and I3, and give the reader some context about the domain where the problems of building a book arise, what specifically is this problem, and why is it important to solve this problem very efficiently.

If you are familiar with market feeds and order books you may want to skip tho the next section, it will be either boring or irritating to you. If you are not familiar, then this is a very rushed overview, there is an earlier post with a slightly longer introduction to the topic, but the content here should suffice.

Market data feeds are network protocols used to describe features of interest in an exchange (or more generally a trading venue, but that distinction is irrelevant). These protocols are often custom to each exchange, though their specification is public, and have large differences in the information they provide, as well as the actual protocol used. Having said that, I have found that many of them can be modeled as a stream of messages that tell you whether a order was added, modified, or deleted in the exchange. The messages include a lot of information about the order, but here we will just concern about with the most salient bits, namely: the side of the order – is this an order to buy or sell securities; the price of the order – what is the maximum (for buy) or minimum (for sell) price that the investor placing the order will accept, and the quantity of the order – the number of shares (or more generally units) the investor is willing to trade.

With this stream of messages one can build a full picture of all the activity in the exchange, how many orders are active at any point, at what prices, how much liquidity is available at each price point, which orders arrived first, sometimes “where is my order in the queue”, and much more. The process of building this picture of the current state of the exchange is called Building the Book, and your Book is the data structure (or collection of data structures) that maintains that picture. One of the most common questions your book must be able to answer are: what is the price of the highest buy order active right now in a security? And also: how many total shares are available at that best buy price for that security? And obviously: what about the best sell price and the number of shares at that best price for that security?

If those are all the questions you are going to ask, and this is often the case, you only need to keep a tally of how many shares are available to buy at each price level, and how many shares are available to sell at each price level. The tally must be sorted by price (in opposite orders for buy and sells), because when the best price level disappears (all active orders at the best price are modified or canceled) you need to quickly find the next best price. You must naturally keep a different instance of such tally for each security, after all you are being asked about the best price for GOOG, not for MSFT.

The challenge is that these feeds can have very high throughput requirements, it is not rare to see hundreds of thousands of messages in a second. That would not be too bad if one could shard the problem across multiple machines, or at least multiple cores. Alas! this is not always the case, for example, the Nasdaq ITCH-5.0 feed does not include symbol information in neither the modify nor delete messages, so one much process such messages in order until the order they refer to is found, the symbol is identified, at which point one may shard to a different server or core.

Furthermore, the messages are not nicely and evenly spaced in these feeds. In my measurements I have seen that up to 1% of the messages arrived in less than 300ns after you receive the previous one. Yes, that is nanoseconds. If you want to control the latency of your book builder to the p99 level – and you almost always want to do in the trading space – it is desirable to completely process 99% of the messages before the next one arrives. In other words, you roughly have 300ns to process most messages.

Suffice is to say that the data structure involved in processing a book must be extremely efficient to deal with peak throughput without introducing additional lately.

Detailed Design

In this section I will try to solve the problems identified in (I4) and give a more detailed description of the classes involved, so the reader can understand the tradeoffs they make and any weaknesses they might have. If the reader finds this section too detailed they can skip it.

The Map Based Order Book

In JayBeams, a relatively straightforward implementation of the order book tally I described above is provided by map_based_order_book_side<> (hereafter mbobs<>) [1]. It is implemented using a std::map<>, indexed by price and containing the total quantity available for that price.

The class has two operations that are used to process all add, modify, and delete messages from the market feed. reduce_order() processes both delete and modify messages, because in ITCH-5.0 all modifications reduce the size of an order. Naturally add_order() processes messages that create a new order. The class provides member functions to access the best price and quantity, which simply call the begin() and rbegin() in the internal std::map<> member variable.

Therefore, finding the best and worst prices are O(1) operations, as C++ makes this complexity guarantee for the begin() and rbegin() member functions. Likewise, the add_order() and reduce_order() member functions are O(log(N)) on the number of existing price levels, since those are the complexity guarantees for the find and erase member functions. Alternatively, we can bound this with O(log(M)) where M is the number of previous orders received, as each order creates at most one price level.

The Array Based Order Book

In the previous post, I made a number of observations on the characteristics of a market feed: (1) I reported that market data messages exhibit great locality of references, with the vast majority of add/modify/delete messages affecting prices very close to the current best price, (2) prices only appear in discrete increments, which make them suitable to represent by integers, (3) Integers that appear in a small range can represent indices into an array, (4) and access to such arrays is O(1), and also makes efficient use of the cache. One can try to exploit all of this by keeping an array for the prices contiguous with with the best price. Because the total range of prices is rather large, we cannot keep all prices in an array, but we can fallback on the std::map<> implementation for the less common case.

My friend Gabriel used all the previous observations to design and implement array_based_order_book_side<> (hereafter abobs<>). The class offers an identical interface to mbobs<>, with add_order() and remove_order() member functions, and with member functions to access the best and worst available prices.

The implementation is completely different though. The class maintains two data structures: (1) a vector representing the best (highest in the case of buy) prices, and (2) a std::map<> – which is typically some kind of balanced tree – to represent the prices that do not fit in the vector. In addition, it maintains tk_begin_top the tick level of the first price price in the vector, and tk_inside the index of the best price. All elements in the vector past that index have zero values. The vector size is initialized when a class instance is constructed, and does not change dynamically.

A diagram representing the abobs<> template
 class.

The value of tk_inside changes when a new order with a better price than the current best is received. Analyzing the complexity of this is better done by cases:

Complexity of add_order()

There are basically three cases:

  1. We expect the majority of the calls to the add_order() member function to affects a price that is close to the current tk_inside. In this case the member function simply updates one of the values of the array, a O(1) operation.
  2. Sometimes the new price will be past the capacity assigned to the array. In this case the current values in the array need to be flushed into the map. This is a O(K) operation, where K is the capacity assigned to the array, because all K insertions happen at the same location, and std::map::emplace_hint is guaranteed to be amortized constant time.
  3. Sometimes the price is a price worse than tk_begin_top, in which the complexity is O(log(N)) same as the map-based class.

Complexity of reduce_order()

There are also basically three cases:

  1. We also expect the majority of the calls to reduce_order() to member function to affects a price that is close to the current tk_inside. As long as the new total quantity is not zero simply updates one of the values of the array, a O(1) operation.
  2. If the new quantity is zero the class needs to find the new best price. In most cases this is a short search backwards through the array. But if the search through the array is exhausted the implementation needs to move vector.size()/2 prices from the map to the vector. Again this is a O(vector.size()) operation because erasing a range of prices in the map is guaranteed to be linear on the number of elements erased.
  3. If the price is worse than tk_begin_top then this is similar to the map-based class and the complexity is O(log(N)).

Summary of the Complexity

In short, we expect abobs<> to behave as a O(1) data structure for the majority of the messages. This expectation is based on receiving very few messages affecting prices far away from the inside. In those cases the class behaves actually worse than mbobs<>.

Benchmark Design

I tried running an existing program, such as itch5inside, and measuring how its performance changes with each implementation. I ran into a number of problems:

  1. The program needs to read the source data from a file, this file is about 6 GiB compressed, so any execution of the full program is also measuring the performance of the I/O subsystem.

  2. The program needs to decompress the file, parse it, and filter the events by symbol, directing different events to different instances of array_order_book_side<>. So once more, an execution of the full program is measuring many other components beyond the one I want to optimize.

  3. The program takes around 35 minutes to process a day worth of market data. We expect that multiple runs will be needed to “average out” the variation caused by all the operating system noise, so completing a test could take hours.

Such long elapsed times will severely limit my ability to iterate quickly. More importantly, running the full program is poor experimental design. There are too many variables that we will need to control: the I/O subsystem, the size of the buffer cache, any programs that are competing for that cache, any changes to the decompression and/or I/O libraries, etc.

I would prefer to run an experiment that isolates only the component that I want to measure. I do not see any way to do this other than using a synthetic benchmark. Admittedly this is a proxy for the performance of the overall program, but at least I can iterate quickly and test the big program again at the end.

This is more or less standard practice, and I do not think too controversial.

Obtaining Representative Data for the Benchmark

The tricky bit here is that the abobs<> class was specifically designed to take advantage of the distribution characteristics of the input data. I need to create test data that has this locality property. And it must have some kind of goldi-locks medium: not so much locality of prices that it unfairly favors the abobs<> design, but not so little that it unfairly disadvantages that design either.

I thought about two alternatives for this problem:

  • I could use traces from the existing data set. Somehow save a trace, load it into memory and run the something-something-based-book against this data. However, we would need to capture multiple traces to ensure we have a representative sample of the data. And I have a limited amount of market data at my disposal, so this is all very limiting.

  • I can try to generate data with similar statistical distribution. The data may not have the same fidelity of a trace, but it can be easily varied by regenerating the data.

Because it is easier to setup and run the test with synthetic data I chose the second approach. In addition to the lack of fidelity, there may be characteristics about the data that I did not think about, and make the class behave differently in production. I think I am addressing those limitations by planning to run the full test separate.

By the way, if the reader disagrees with my assumptions about how this (and other) decisions impact the validity of the results, please do let me know in the bug. I am trying to be super transparent about my assumptions, if I am doing my work right you should be able to reproduce the experiment with different assumptions, show why I am wrong, and we both learn something. That sounds suspiciously close to the scientific method.

Next Up

The next post will include a more detailed description of the benchmark, to test these classes.

Notes

In this post I have set all links to a specific version (eabc035fc23db078e7e6b6adbc26c08e762f37b3) of the JayBeams project. Links to the current version are more succinct, and future readers may find that bugs in the old code have been fixed in newer versions. We prefer to use links to fixed versions because it makes the references and links reproducible, partially addressing the problem I highlighted earlier.

Updates: Gabriel caught a mistake in my complexity analysis, the O() bounds were correct, but I was playing fast and lose with the constant factors. Updates: Major rewording so I do not sound like a pompous ass.

Optimizing the ITCH Order Books

It has been a while since I posted something, laziness and not having anything interesting to report. Well, maybe a lot of negative reports, which are interesting and should be published, but see above about laziness.

Recently my friend Gabriel implemented a really cool optimization on the ITCH-5.0 order book I had implemented, which I thought should be documented somewhere. His work inspired me to look further into benchmarking and how to decide if a code change has actually improved anything.

But first let’s start with Gabriel’s work. Two observations, first prices are discrete, they only show in penny increments (1/100 of a penny for prices below $1.00, but still discrete). The second, and more critical, observation is that most market feed updates happen close to the inside. Gabriel and myself wrote a program to confirm this, here are the results, as percentiles of the event depth:

NSamples min p25 p50 p75 p90 p99 p99.9 p99.99 max
489064030 0 0 1 6 14 203 2135 15489331 20009799

One should expect, therefore, that 99% of the events affect a price no more than 203 levels from the inside. How can one use that? Well, building a book is basically keeping a map (it needs to be sorted, so hashes typically do not work), between prices and quantities. Since prices are basically discrete values, one starts thinking of vectors, or arrays, or many circular buffers indexed by price levels.

Since most of the changes are close to the inside, what if we kept an array with the quantities, with the inside somewhere close to the center of the array? Most updates would happen in a few elements of this array, and indexing arrays is fast! Furthermore, even as the price moves, the change will be “slow”, only affecting a few price levels at a time.

Unfortunately we cannot keep all the prices in an array, the arrays would need to be too large in some cases. We could explore options to keep everything in arrays, for example, one could grow the arrays dynamically and only a few will be very large. But a more robust approach is to just use a map for anything that does not fit the array.

Naturally prices change during the day, and sometimes the price changes will be so large that the range in the array is no longer usable. That is Okay, as long as that happens rarely enough the overall performance will be better.

Our benchmarks show that this optimization works extremely well in practice. With real data the array based order book can process its events significantly faster than the map-based book, the total CPU time decreases from 199.5 seconds to 187.50 seconds. But more interestingly, processing latencies per event are decreased up to the p99 level (times in nanoseconds):

Book Type min p10 p25 p50 p75 p90 p99 p99.9 p99.99 max N
Map 193 345 421 680 1173 1769 2457 3599 13491 65673473 17560405
Array 193 288 332 517 886 1328 2093 5807 12657 63389010 17560405

More interestingly, using a synthetic benchmark we can see that the difference is very obvious:

A boxplot plot: the X axis is labeled 'Book Type', showing 'array'
 and 'map' labels for two boxplots.  The Y axis is labeled 'Iteration
 Latency'.  The array boxplot shows much lower values from p0 to p75.
 Both boxplots have numerous
 outliers.

Finally a teaser: the array book type shows more variability, still faster than map, but more difference between quartiles. We use the less familiar (but more visually pleasing) violin plot to break down the data by book side as well as book type. There is nothing remarkable for the ‘map’ book type, but why are latencies so different for different side of the book in the ‘array’ case? The answer to that and other questions in a future post.

A violin plot: the X axis is labeled 'Book Type', showing 'array'
 and 'map' labels, each label has two plots, distinguished by color,
 one for 'buy' and another for 'sell'.
 The Y axis is labeled
 'Iteration Latency'.
 Each book type has two violin plots, the ones for The array boxplot shows much lower values from p0 to p75.
 Both boxplots have numerous
 outliers.

Notes

The script used to generate the plots is shown below. The data is available for download, and committed to my github repository. This data was generated with an specific version of the benchmark, on a Google Compute Engine (GCE) virtual machine (2 vCPUs, 13GiB RAM, Ubuntu 16.04 image), additional configuration information is captured in the CSV file. System configuration beyond the base image is captured in the Dockerfile.

require(ggplot2)
baseline.file <- 'http://coryan.github.io/public/2017-01-03-bm_order_book.baseline.csv'
data <- read.csv(
    baseline.file, header=FALSE, col.names=c('testcase', 'nanoseconds'),
    comment.char='#')
data$run <- factor('baseline')
data$microseconds <- data$nanoseconds / 1000.0
data$booktype <- factor(sapply(
    data$testcase,
    function(x) strsplit(as.character(x), split=':')[[1]][1]))
data$side <- factor(sapply(
    data$testcase,
    function(x) strsplit(as.character(x), split=':')[[1]][2]))

ggplot(data=data, aes(x=booktype, y=microseconds)) +
  geom_boxplot(color="blue") +
  ylab("Iteration Latency (us)") +
  xlab("Book Type") +
  theme(legend.position="bottom")
ggsave(filename="2017-01-03-array_vs_map.boxplot.svg",
       width=8.0, height=8.0/1.61)
ggsave(filename="2017-01-03-array_vs_map.boxplot.png",
       width=8.0, height=8.0/1.61)

ggplot(data=data, aes(x=booktype, y=microseconds, color=side)) +
  geom_violin() +
  ylab("Iteration Latency (us)") +
  xlab("Book Type") +
  theme(legend.position="bottom")
ggsave(filename="2017-01-03-array_vs_map.violin.svg",
       width=8.0, height=8.0/1.61)
ggsave(filename="2017-01-03-array_vs_map.violine.png",
       width=8.0, height=8.0/1.61)

Computing the Inside and Per Symbol Statistics

I have been silent for a while as I was having fun writing code. I finally implemented a simple (and hopelessly slow) book builder for the ITCH-5.0 feed. With it we can estimate the event rate for changes to the inside, as opposed to all the changes in the book that do not affect the best bid and offer.

This data is interesting as we try to design our library for time delay estimation because it is the number of changes at the inside that will (or may) trigger new delay computations. We are also interested in breaking down the statistics per symbol, because have to perform time delay analysis for only a subset of the symbols in the feed and it is trivial to parallelize the problem across different symbols in different servers.

The code is already available on github.com the tool called itch5inside, clearly I lack imagination. It outputs the inside changes as an ASCII file (which can be compressed on the fly), so we can use it later as a sample input into our analysis. Optionally, it also outputs to stdout the per-symbol statistics. I have made the statistics available on this very blog in case you want to analyze them.

In this post we will just make some observations about the expected message rates. First we load the data, you can set public.dir to https://coryan.github.io/public and download the data directly from the Internet, and then create a separate data set that does not include the aggregate statistics.

raw <- read.csv(paste0(public.dir, '/NASDAQ-ITCH.csv'))
symbols <- subset(raw, Name != '__aggregate__')

It is a more or less well-known fact that some symbols are much more active than others, for example, if we plot a histogram of number of events at the inside and number of symbols with that many messages we get:

require(ggplot2)
range <- range(symbols$NSamples)
bw <- (range[2] - range[1]) / 300
ggplot(data=symbols, aes(x=NSamples)) + geom_histogram(binwidth=bw) +
  theme(legend.position="bottom") +
  ylab("Number of Symbols") +
  xlab("Total Event Count")

A histogram plot.  The X axis is labeled 'Total Event Count' and varies from 0 to almost 3,000,000.  The Y axis is labeled 'Number of Symbols' and varies from 0 to 5000.  The highest values are at the beginning, and the values drop in a seemingly exponential curve.

That graph appears exponential at first sight, but one can easily be fooled, this is what goodness of fit tests where invented for:

require(MASS)
fit <- fitdistr(symbols$NSamples, "geometric")
ref <- rgeom(length(symbols$NSamples), prob=fit$estimate)
control <- abs(rnorm(length(symbols$NSamples)))
chisq.test(symbols$NSamples, p = ref/sum(ref))
	Chi-squared test for given probabilities

data:  symbols$NSamples
X-squared = 6330400000, df = 8171, p-value < 2.2e-16

Warning message:
In chisq.test(symbols$NSamples, p = ref/sum(ref)) :
  Chi-squared approximation may be incorrect

That p-value is extremely low, we cannot reject the Null Hypothesis, the symbol$NSamples values do not come from the geometric distribution that best fits the data. Also notice that your values might be different, in fact, they should be different as the reference data was generated at random. We can also visualize the distributions in log scale to make the mismatch more apparent:

df.ref <- data.frame(count=log(ref, 10))
df.ref$variable <- 'Reference'
df.actual <- data.frame(count=log(symbols$NSamples, 10))
df.actual$variable <- 'Actual'
df <- rbind(df.ref, df.actual)
ggplot(data=df, aes(x=count, fill=variable)) +
  geom_histogram(alpha=0.5) +
  scale_fill_brewer(palette="Set1") +
  theme(legend.position="bottom") +
  ylab("Number of Symbols") +
  xlab("Total Event Count")

Two overlapping histograms. The X axis is labeled 'Total Event Count' and varies from 0 to 6.  The Y axis is labeled 'Number of Symbols' and varies from 0 to 2,000.  The histogram labeled 'Reference' has a much higher peak than the
histogram labeled 'Actual'.

Clearly not the same distribution, but using math to verify this makes me feel better. In any case, whether the distribution is geometric or not does not matter much, clearly there is a small subset of the securities that have much higher message rates than others. We want to find out if any security has such high rates as to make them unusable, or very challenging.

The itch5inside collects per-symbol message rates, actually it collects message rates at different timescales (second, millisecond and microsecond), as well as the distribution of processing times to build the book, and the distribution of time in between messages (which I call “inter-arrival time”, because I cannot think of a cooler name). Well, the tool does not exactly report the distributions, it reports key quantiles, such as the minimum, the first quartile, and the 90th percentile. The actual quantiles vary per metric, but for our immediate purposes maybe we can get the p99.9 of the event rate at the millisecond timescale. A brief apology for the p99.9 notation, we used this shorthand in a previous job for “the 99.9-th percentile”, and got used to it.

Let’s plot the p99.9 events/msec rate against the total number of messages per second:

ggplot(data=raw, aes(x=NSamples, y=p999RatePerMSec)) +
  geom_point(alpha=0.5) +
  theme(legend.position="bottom") +
  ylab("Events/msec @ p99.9") +
  xlab("Total Event Count")

A scatter plot.  The X axis is labeled 'Total Event Count' and varies from 0 to 200,000,000.  The Y axis is labeled 'Events/msec @ p99.9' and varies from 0 to 250.  All the points but one are clustered around the origin.

There is a clear outlier, argh, the itch5inside tool also reports the metrics aggregated across all symbols, that must be the problem:

top5.raw <- raw[head(order(raw$NSamples, decreasing=TRUE), n=5),]
top5.raw[,c('Name', 'NSamples')]
          Name  NSamples
8173 __aggregate__ 206622013
6045           QQQ   2604819
7814           VXX   2589224
6840           SPY   2457820
3981           IWM   1638988

Of course it was, let’s just plot the symbols without the aggregate metrics:

ggplot(data=symbols, aes(x=NSamples, y=p999RatePerMSec)) +
  geom_point(alpha=0.5) +
  theme(legend.position="bottom") +
  ylab("Events/msec @ p99.9") +
  xlab("Total Event Count")

A scatter plot.  The X axis is labeled 'Total Event Count' and varies from 0 to around 2,500,000.  The Y axis is labeled 'Events/msec @ p99.9' and varies from 0 to 10.  All the points but one are clustered around the origin.

Finally it would be nice to know what the top symbols are, so we filter the top50 and plot them against this data:

top50 <- symbols[head(order(symbols$NSamples, decreasing=TRUE), n=50),]
ggplot(data=symbols, aes(x=NSamples, y=p999RatePerMSec)) +
  geom_point(alpha=0.5, size=1) +
  geom_text(data=top50, aes(x=NSamples, y=p999RatePerMSec, label=Name),
            angle=45, color="DarkRed", size=2.8) +
  theme(legend.position="bottom") +
  ylab("Events/msec @ p99.9") +
  xlab("Total Event Count")

A scatter plot.  The X axis is labeled 'Total Event Count' and varies from 0 to 2,500,000.  The Y axis is labeled 'Events/msec @ p99.9' and varies from 0 to 10.  The top points are labeled VXX, SPY, QQQ, .

What Does this Mean?

We have established that inside changes are far more manageable, we probably need to track less than the top 50 symbols and those do not seem to see more than 10 messages in a millisecond very often. We still need to be fairly efficient, in those peak milliseconds we might have as little as 100 microseconds to process a single event.

What Next?

I did write a FFTW time delay estimator and benchmarked it, the results are recorded in the github issue. On my (somewhat old) workstations it takes 27 microseconds to analyze a sequence of 2,048 points, which I think would be the minimum for market data feeds. Considering that we need to analyze 4 signals per symbol (bid and offer, price and size for each), and we only have a 100 microsecond budget per symbol (see above), we find ourselves without any headroom, even if we did not have to analyze 20 or more symbols at a time to monitor all the different channels that some market feeds use.

So we need to continue exploring GPUs as a solution to the problem, I recently ran across VexCL, which may save me a lot of tricky writing of code. VexCL uses template expressions to create complex kernels and optimize the computations at a high level. I think my future involves benchmarking this library against some pretty raw OpenCL pipelines to see how those template expressions do in this case.

Where I Try to Justify the Complexity I am About to Introduce

Before we start worrying about computing cross-correlations in GPUs we would like to have a well-tested version using the CPU. It is much easier to debug these, and they serve as reference data when testing the GPU version.

I chose the FFTW library because it is freely available and it is well-known and highly regarded FFT implementation. It is a C library, which requires us to write some wrappers to avoid common programming errors.

My goal is to implement a simple example showing how to estimate the delay between two simple signals, something like this would be close to ideal:

std::size_t nsamples = ...;
std::size_t actual_delay = ...;
jb::fftw::delay_estimator estimator = ...;
auto signal_a = create_test_signal(nsamples);
auto signal_b = delay_signal(a, actual_delay);

double estimated_delay = estimator.handle(signal_a, signal_b);
// assert std::abs(estimade_delay - actual_delay) < something_small

We want the estimator to be some kind of class because there is need to keep state. For example, most FFT libraries store “execution plans” for the transforms. Likewise, we anticipate (or know, because I have prototyped this) that an OpenCL-based estimator would need to keep buffers to pass data to the GPU, the computational kernels that will be sent to the GPU, and many other data structures.

Finally, we expect that real-time estimators will combine the estimates obtained by analyzing across multiple securities and/or across different attributes – bid vs. offer for example – of the same security.

We also expect that an estimator may have multiple different compute units at its disposal. For example, my workstation at home has 3 distint GPU devices: two discrete video cards and an APU which combines a CPU and GPU into the same die. Such configuration may be uncommon, but it is common to have multiple cores in the same server, and it would be desirable to schedule some of the FFT transforms in different cores to increase the processing throughput.

You Are Not Going To Need It

Despite the challenge throughput numbers in a market feed we have not yet definitely concluded that this level of complexity is necessary. FFT after all stands for “Fast Fourier Transform”, why worry about GPUs, multicore programming, etc.?

I think I need to gather two pieces of information to demonstrate this complexity is required. First, we have established the throughput requirements for a feed such as ITCH-5.0, but that feed includes all changes in the Nasdaq book, across all symbols and across all price levels. One can reasonably argue that the throughput requirements for a single security (or a small number of securities) would be sufficiently low to estimate the time delay in real-time. And even if that does not reduce the throughput enough, the inside prices and quantities provide some more relief.

Alas! Experience shows that most changes do occur in the inside quantities. And, while the number of messages in a single security is hardly ever as high as the messages in all securities, some securities have message rates several orders of magnitude higher than others. This wide variation in message rates can be used to our advantage, as we can select securities that are not so active as to slow down our system, but active enough to provide a constant signal to the delay estimator.

Furthermore, experience shows that typical latencies for the consolidated feeds vary between one and eight milliseconds. To secure a long enough baseline for the delay estimator we will need at least 16 milliseconds of data. We need to determine the right value for our sampling rate. Higher sampling rates should produce more accurate results, but increase the computational costs.

For example, if we chose to take one sample per microsecond each FFTs would require at least operations. A cross correlation requires at least three FFTs, so we are looking at a lower bound of 680,000 operations. Assuming your server or workstation can handle 3 billion operations per second, and that we can use all these operations, the time delay estimation would require something in the order of 229 microseconds. That could easily prove too expensive if we want to read just 1% of the ITCH-5.0 market feed.

None of these figures are to be taken as extremely accurate predictions of performance. They just illustrate that the problem is not trivial and will require significant complexity to manage the computations efficiently.

What is Next?

We need to implement four different programs or components: (1) we need a way to compute the inside for the ITCH-5.0 feed (issue #8), (2) we need a way to compute per-symbol throughput statistics of just the inside changes (issue #9), (3) we need to prototype the time delay estimation based on cross-correlations and using the GPU and (4) we need to prototype the time delay estimation based on (issue #10), cross-correlations and the FFTW library (issue #7).

I have decided to start with the last one, because I have been doing too much market data processing lately and too much posting latety.