This is a long series of posts where I try to teach myself how to
run rigorous, reproducible microbenchmarks on Linux. You may
want to start from the first one
and learn with me as I go along.
I am certain to make mistakes, please write be back in
this bug when
I do.
In my previous post I discussed a class
(array_based_order_book_side<> a/k/a abobs<>)
which serves as a motivation to learn how to create really good
benchmarks.
Because testing the class inside a full program introduces too much
variation in the results, it is too slow, and it is too cumbersome I
decided to go with a small purpose-built benchmark program.
In that post I pointed out ([I5]) that any good description
of a benchmark must include how does the benchmark measures time, why
is that a good approach to measure time, and how does the benchmark
control for variables that affect the results, such as system load.
This poist is largely written to address those questions.
I will describe a small framework I built to write
benchmarks for
JayBeams.
Mostly this was born with my frustration when not getting the same
results if one runs the same benchmark twice.
If I cannot reproduce the results myself, for the same code, on the
same server, what hope do I have of creating reproducible results for
others?
I need to get the environment where the benchmarks run under control
before I have any hope of embarking of a rigorous analysis of them.
It turns out you need to do a lot of fine tuning in the operating
system to get consistency out of your benchmarks, and you also
need to select your clock carefully.
I find this kind of fine tuning interesting in itself, so I am taking
the opportunity to talk about it.
Most folks call such small benchmarks a
microbenchmark,
and I like the name because it makes a clear distinction from large
benchmarks as those for database systems (think TPC-E and its
cousins).
I plan to use that name hereafter.
Anatomy of a Microbenchmark
Most microbenchmarks follow a similar pattern.
First, you setup the environment necessary to run whatever it is you
are testing.
This is similar to how you setup your mocks in a unit test, and in
fact you may setup mocks for your benchmark too.
In the case of our example (abobs<>) one wants to build a
synthetic list of operations before the system starts measuring the
performance of the class.
Building that list is expensive, and you certainly do not want to include
it the measurement of the class itself; in general,
microbenchmarks should not measure or report the time required
to setup their test environment.
Then, you run multiple iterations of the test and capture how long it
took to run each iteration.
How exactly you measure the time is a
complicated question, modern operating systems and programming
languages offer multiple different ways to measure time.
I will discuss which one I picked, and why, later in the post.
Sometimes you want to run a number of iterations at the beginning of
the benchmark, and discard the results from them.
The argument is usually that you are interested in the steady state of
the system, not what happens while the system is “warming up”.
Whether that is sensible or not depends on the context, of course.
At the end most microbenchmarks report some kind of aggregate of the
results, typically the average time across all iterations.
Most microbenchmarks stop there, though rarely they include additional
statistics, such as the minimum,
maximum, standard deviation, or the median.
One of my frustrations is that I rarely see any justification for the
choice of statistics:
why is the mean the right statistic to consider in the
conditions observed during the benchmark?
How are you going to deal with outliers if you use the mean?
Why not median?
Does your data show any kind of
central tendency?
If not, then neither median nor mean are good ideas, so why report them?
Likewise, why is the standard deviation the right measurement of
statistical dispersion?
Is something like the
interquartile range
a better statistic of dispersion for your tests?
The other mistake that folks often make is to pick a number of
iterations because “it seems high enough”.
There are good statistical techniques to decide how many iterations
are needed to draw valid conclusions, why not use them?
An even worse mistake is to not consider whether the effect observed
by your benchmark even makes sense: if you results indicate that
option A is better by less than one machine instruction per iteration
vs. option B, do you really think that is meaningful?
I think it is not, no matter how many statistical tests you have to
prove that it is true, it has barely measurable impact.
I want rigorous results, I do not want to join
“The Cult of Statistical Significance”.
The JayBeams Microbenchmark Framework
In JayBeams microbenchmark
framework
the user just needs to provide a fixture template parameter.
The constructor of this fixture must setup the environment for the
microbenchmark.
The fixture::run member function must run the test.
The framework takes care of the rest:
it reads the configuration parameters for the
test from the command line, calls your constructor,
runs the desired number of warm up and test iterations,
captures the results, and finally reports all the data.
The time measurements for each iteration are captured in memory,
since you do not want to contaminate the performance of test with
additional I/O.
All the memory necessary to capture the results is allocated before
the test starts, because I do not want to contaminate the arena
either.
Reporting the Results
I chose to make no assumptions in the JayBeams microbenchmark
framework as to what are good
statistics to report for a given microbenchmark.
The choice of statistic depends on the nature of the underlying
distribution, and the statistical tests that you are planning to use.
Worse, even if I knew the perfect statistics to use, there are some
complicated numerical and semi-numerical algorithms involved.
Such tasks are best left to specialized software, such as
R, or if you are a Python fan,
Pandas.
In general, the JayBeams microbenchmark framework will dump all the
results to stdout, and expects you to give them to a script (I use R)
to perform the analysis there.
However, sometimes you just want quick results to guide the
modify-compile-test cycles.
The microbenchmark framework also outputs a summary of the results.
This summary includes: the number of
iterations, the minimum time, the maximum time, and the p25, p50, p75,
p90 and p99.9 percentiles of the time across all iterations.
BTW, I picked up p99.9 as a notation for “the
99.9th percentile”, or the even worse “the 0.999 quantile of the
sample” in one of my jobs, not sure who invented it, and I think
it is not very standard, but it should be.
The choice of percentiles is based on the fact that most latency
measurements are skewed to the right (so we have more percentiles
above 90% than below 10%), but the choice is admittedly arbitrary.
The system intentionally omits the mean, because the distributions
rarely have any central tendency, which is what the mean intuitively
represent, and I fear folks would draw wrong conclusions if included.
Clock selection
I mentioned that there are many clocks available in C++ on Linux, and
promised to tell you how I chose.
The JayBeams microbenchmark framework uses std::chrono::steady_clock, this
is a guaranteed monotonic clock, the resolution depends on your
hardware, but any modern x86 computer is likely to have an
HPET
circuit with at least 100ns resolution.
The Linux kernel can also drive the time measurements using the
TSC register,
which has sub-nanosecond resolution (but many other problems).
In short, this is a good clock for a stopwatch (monotonic), and while
the resolution is not guaranteed to be sub-microseconds, it is likely
to be. That meets my requirements, but why not any of the alternatives?
getrusage(2): this system call returns the resource utilization
counters that the system tracks for every process (and in some
systems each thread).
The counters include cpu time, system time, page faults, context
switches, and many others.
Using CPU time instead of wall clock time is good,
because the amount of CPU used should not change while the program is
waiting to be scheduled.
However, the precision of getrusage is too low for my purposes,
traditionally it was updated 100 times a second, but even on modern
Linux kernels the counters are only incremented around 1,000 times
per second
[1][2].
So at best you get millisecond resolution,
while the improvements I am trying to measure may be a few
microseconds.
This system call would introduce measurement errors many times
larger than the effects I want to measure, and therefore it is not
appropriate for my purposes.
std::chrono::high_resolution_clock: so C++ 11 introduced a number of
different clocks, and this one has potentially higher-resolution
clock than std::chrono::steady_clock.
That is good, right?
Unfortunately, high_resolution_clock is not guaranteed to be
monotonic, it might go back in time, or some seconds may be shorter
than others.
I decided to check, maybe I was lucky and it was actually monotonic
on my combination of operating system and compilers.
No such luck, in all the Linux implementations I used this clock is
based on
clock_gettime(CLOCK_REALTIME,...)[3],
which is subject to changes in the system clock, such as ntp
adjustments.
So this one is rejected because it does not make for a good
stopwatch.
clock_gettime(2): is the underlying function used in the
implementation of std::chrono::steady_clock.
One could argue that using it directly would be more efficient,
however the C++ classes around them add very little overhead, and
offer a much superior interface.
Candidly I would have written a wrapper to use this class, and the
wrapper would have been worse than the one provided by the standard,
so why bother?
gettimeofday(2) is a POSIX call with similar semantics to
clock_gettime(CLOCK_REALTIME, ...).
Even the POSIX standard no longer recommends using it
[4],
and recommends using clock_gettime instead.
So this one is rejected because it is basically obsoleted, and it is
also not monotonic, so not a good stopwatch.
time(2) only has second resolution, and it is not monotonic.
Clearly not adequate for my purposes.
rdtscp / rdtsc: Okay, this is the one that all the low level
programmers go after.
It is a x86 instruction that essentially returns the number of
ticks since the CPU started.
You cannot ask for lower overhead than “single instruction”, right?
I have used this approach in the past, but you do need to calibrate the
counter; it is never correct to just take the count and divided by
the clock rate of your CPU.
But there are a number of other
pitfalls too.
Furthermore, clock_gettime is implemented using
vDSO,
which greatly reduces the overhead of these system calls.
My little
benchmark,
indicates that the difference between them is about 30ns (that is
nanoseconds) on my workstation.
In my opinion, its use is no longer justified on modern Linux
systems; it carries a lot of extra complexity that you only need if you are
measuring things in the nanosecond range.
I may need it eventually, if I start measuring computations that
take less than one microsecond, but until I do I think
std::chrono::steady_clock is much easier to use.
System Configuration
Running benchmarks on a typical Linux workstation or server can be
frustrating because the results vary so much.
Run the test once, it takes 20ms, run it again, it takes 50ms, run
it a third time it takes 25ms, again and you get 45ms.
Where is all this variation coming from? And how can we control it?
I have found that you need to control at least the following to
get consistent results:
(1) the scheduling class for the process,
(2) the percentage of the CPU reserved for non real-time processes,
(3) the CPU frequency scaling governor in the system,
and (4) the overall system load.
I basically tested all different combinations of these parameters, and
I will remove the combinations that produce bad results until I
find the one (or few ones) that works well.
Well, when I say “all combinations” I do not mean that: there are 99
different real-time priorities, do I need to test all of them?
What I actually mean is:
scheduling class: I ran the microbenchmark in both the default
scheduling class (SCHED_OTHER), and at the maximum priority in the
real-time scheduling class (SCHED_FIFO).
If you want a detailed description of the scheduling classes and how
they work I recommend the man page:
sched(7).
non-real-time CPU reservation: this one is not as well known, so a
brief intro, real-time tasks can starve the non real-time tasks if
they go haywire. That might be Okay, but they can also starve the
interrupts, and that can be a “Bad Thing”[tm].
So by default the Linux kernel is configured to reserve a percentage
of the CPU for non real-time workloads.
Most systems set this to 5%, but it can be changed by writing
into /proc/sys/kernel/sched_rt_runtime_us[5]).
For benchmarks this is awful, if your systems has more cores than the
benchmark is going to use, why not run with 0% reserved for the non
real-time tasks?
So I try with both a 0% and a 5% reservation for non real-time tasks
and see how this affects the predictability of the results when using
real-time scheduling (it should make no difference when running in the
default scheduling class, so I skipped that).
CPU frequency scaling governor: modern CPUs can change their
frequency dynamically to tradeoff power efficiency against
performance. The system provides different governors that offer
distinct strategies to balance this tradeoff.
I ran the tests with both the ondemand CPU governor, which attempts
to increase the CPU frequency as soon as it is needed,
and with the performance CPU frequency governor which always runs
the CPU at the highest available frequency
[6].
system load: we all know that performance results are affected by
external load, so to simulate the effect of a loaded vs. idle system I
ran the tests with the system basically idle (called ‘unloaded’ in the
graphs). Then I repeated the benchmarks while I ran N processes, one
for each core, each of these processes tries to consume 100% of a core.
Finally, for all the
all possible combinations of the configuration parameters and load I
described above I run the microbenchmark four times.
I actually used mbobs<> in these benchmarks, but
the results apply regardless of what you are testing.
Let’s look at the pretty graph:
The first thing that jumps at you is the large number of outliers in
the default scheduling class when the the system is under load.
That is not a surprise, the poor benchmark is competing with the load
generator, and they are both the same priority.
We probably do not want to run benchmarks on loaded systems at the
default scheduling class, we kind of knew that, but now it is confirmed.
We can also eliminate the ondemand governor when using the real-time
scheduling class.
When there is no load in the system the results are quite variable
under this governor.
However it seems that it performs well when the system is loaded.
That is weird, right?
Well if you think about it, under high system load the ondemand
governor pushes the CPU frequency to its highest value because the
load generator is asking for the CPU all the time.
That actually improves the consistency of the results because when the
benchmark gets to run the CPU is already running as fast as possible.
In effect, running with the ondemand governor under high load is
equivalent to running under the performance governor under any load.
Before Going Further: Fix the Input
Okay, so the ondemand governor is a bad idea, and running in the
default scheduling class with high load is a bad idea too.
The following graph shows different results for the microbenchmark
when the system is always configured to use the performance CPU
frequency governor, and excluding the default scheduling class when
the system is under load:
I have broken down the times by the different inputs to the test.
Either with the same seed to the pseudo random number generator (PRNG)
used to create that synthetic sequence of operations we have
discussed, labeled fixed, or using
/dev/urandom to initialize the PRNG, labeled urandom, of course.
Obviously some of the different in execution time can be attributed to
the input, but there are noticeable differences even when the input is
always the same.
Notice that I do not recommend running the benchmarks for the
abobs<> with a fixed input.
We want the class to work more efficiently for a large class of
inputs, not simply for the one we happy to measure with.
We are fixing the input in this post in an effort to better tune our
operating system.
Effect of the System Load and Scheduling Limits
By now this has become a quest: how to configure the system and test
to get the most consistent results possible?
Let’s look at the results again, but remove the ondemand governor,
and only used the same input in all tests:
Okay, we knew from the beginning that running on a loaded system was a
bad idea.
We can control the outliers with suitable scheduling parameters, but
there is a lot more variation between the first and third quartiles
(those are the end of the boxes in these graphs btw) with load than
without it.
I still have not examined the effect, if any, of that
/proc/sys/kernel/sched_rt_runtime_us parameter:
The effects seem to be very small, but hard to see in the graph.
Time to use some number crunching, I chose the interquartile range
(IQR) because it is robust and non-parametric, it captures 50% of the data
no matter what the distribution or outliers are:
Governor
Scheduling Class
Real-time Scheduling Limits
IQR
ondemand
default
N/A
148
performance
default
N/A
89
performance
FIFO
default (95%)
40
performance
FIFO
unlimited
28
That is enough to convince me that there is a clear benefit to using
all the settings I have discussed above.
Just for fun, here is how bad those IQRs started, even if only
restricted to the same seed for all runs:
loaded
seed
scheduling
governor
Interquartile Range
loaded
fixed
default
ondemand
602
loaded
fixed
rt:default
ondemand
555
loaded
fixed
default
performance
540
loaded
fixed
rt:default
performance
500
loaded
fixed
rt:unlimited
performance
492
loaded
fixed
rt:unlimited
ondemand
481
I believe the improvement is quite significant, I will show in a
future post that this is the difference between having to run a few
hundred iterations
of the test vs. tens of thousands of iterations to obtain enough
statistical power in the microbenchmark.
I should mention that all these pretty graphs and tables are
compelling, but do not have sufficient statistical rigor to draw
quantitative conclusions.
I feel comfortable asserting that changing these system configuration
parameters has an effect on the consistency of the performance
results.
I cannot assert with any confidence what is the size of the effect, or
whether the results are statistically significant, or to what
population of tests they apply.
Those things are possible to do, but distract us from the objective of
describing benchmarks rigorously.
Summary
The system configuration seems to have a very large effect on how
consistent are your benchmark results.
I recommend you run microbenchmarks on the SCHED_FIFO scheduling class,
at the highest available priority, on a lightly loaded system,
where the CPU frequency scaling governor has been set to
performance, and where the system has been configured to dedicate up
to 100% of the CPU to real-time tasks.
The microbenchmark framework automatically
set all these configuration parameters.
Well, at the moment I use a driver script to set the CPU frequency
scaling governor and to change the CPU reservation for non real-time
tasks.
I prefer to keep this in a separate script because otherwise one needs
superuser privileges to run the benchmark.
Setting the scheduling class and priority is fine,
you only need the right capabilities via
/etc/security/limits.conf.
The script is small and easier to inspect for security problems,
in fact, it just relies on sudo, so a simple grep finds all the
relevant lines.
If you do not like these settings, the framework can be configured to
not set them.
It can also be configured to either ignore errors when changing the
scheduling parameters (the default), or to raise an exception if
setting any of the scheduling parameters fails.
I think one should use std::chrono::steady_clock if you are running C++
microbenchmarks on Linux. Using rdtsc is probably the only option
if you need to measure things in the [100,1000] nanoseconds range,
but there are many pitfalls and caveats, read about the online before
jumping into coding.
Even with all this effort to control the consistency of the benchmark
results, and even with a very simple, purely CPU bound test as used in
this post the results exhibit some variability.
These benchmarks live in a world where only rigorous statistics can
help you draw the right conclusions, or at least help you avoid the
wrong conclusions.
As I continue to learn how to run rigorous, reproducible
microbenchmarks in Linux I keep having to pick up more and more
statistical tools.
I would like to talk about them in my next post.
Future Work
There are a few other things about locking down the system
configuration that I left out and should not go unmentioned.
Page faults can introduce latency in the benchmark, and
can be avoided by using the mlockall(2) system call.
I do not think these programs suffer too much from it, but changing
the framework to make this system call is not too hard and sounds like
fun.
Similar to real-time CPU scheduling, the Linux kernel offers
facilities to schedule I/O operations at different priorities via the
ioprio_set(2) system calls.
Since these microbenchmarks perform no I/O I am assuming this will not
matter, but possibly could.
I should extend the framework to make these calls, even if it is only
when some configuration flag is set.
I have not found any evidence that setting the CPU affinity for the
process (also known as pinning) helps in this case.
It might do so, but I have no pretty graph to support that.
It also wreaks havoc with non-maskable interrupts when the benchmark
runs at higher priority than the interrupt priority.
In some circles it is popular to create a container restricted to a
single core and move all system daemons to that core.
Then you run your system (or your microbenchmark) in any of the other
cores.
That probably would help, but it is so tedious to setup that I have
not bothered.
Linux 4.7 and higher include schedutil a CPU frequency scaling
governor based on information provided by the scheduler
[7].
Initial reports indicate that it performs almost as well as the
performance governor.
For our purposes the performance scheduler is a
better choice as we are willing to forgo the power efficiency of a
intelligent governor in favor of the predictability of our results.
My friend Gonzalo points out that we
assume std::chrono::steady_clock has good enough resolution,
we should at least check if this is the case at runtime and warn if
necessary.
Unfortunately, there is no guarantee in C++ as to the resolution of
any of the clocks, nor is there an API to expose the expected
resolution of the clock in the language.
Unfortunately this means any check on resolution must be platform
specific.
On Linux the
clock_getres(2)
seems promising at first, but it turns out to always return 1ns for
all the “high-resolution” clocks, regardless of their actual
resolution.
I do not have, at the moment, a good idea on how to approach this
problem beyond relying on the documentation of the system.
Notes
The data in this post was generated using a shell script,
available
here,
it should be executed in a directory where you have compiled
JayBeams.
Likewise, the graphs for this post were generated using a R script,
which is also
available.
The detailed configuration of the server hardware and software used to
run these scripts is included in the comments of the
csv file that
accompanies this post.
Updates
I completely reworded this post, the first one read like a not very
good paper. Same content, just better tone I think. I beg
forgiveness from my readers, I am still trying to find a good style
for blogs.
My previous post
included a performance comparison between two implementations of a
data structure.
I have done this type of comparison many times, and have grown
increasingly uncomfortable with the lack of scientific and statistical
rigor we use in our field (I mean software engineering, or computer
science, or computer engineering, or any other name you prefer)
when reporting such results.
Allow me to criticize my own post to illustrate what I mean.
The post reported on the results of benchmarking two implementations
of a data structure.
I1: I did not provide a context: what
domain those this problem arise in? Why is this data structure
interesting at all?
I2: I did not provide a
description of the problem the data structure is trying to solve.
Maybe there is a much better solution that I am not aware of, one of
the readers may point this potential solution to me, and all
the analysis is a waste of time.
I3: I did not provide any
justification for why the data structure efficiency would be of
interest to anybody.
I4: I did not include a
description of the data structure. With such a description the reader
can understand why is it likely that the data structure will perform
better. Better yet, they can identify the weakness in the data
structure, and evaluate if the analysis considers those weaknesses.
I assumed that a reader of this blog, where I largely write about
JayBeams,
should be familiar with these topics, but that is
not necessarily the case.
The gentle reader may land on this post as the result of a search, or
because a friend recommends a link.
In any case, it would have to be a very persistent reader if they were
to find the actual version of the JayBeams project used to prepare
the post, and that information was nowhere to be found either.
Furthermore, and this is less forgivable, my post did not answer any
of these other questions:
I5: How exactly is the
benchmark constructed?
How does it measure time?
Why is that a good choice for time measurement?
What external variables impact the results, such as system load,
impact the results and how did I control for them?
I6: What exactly is the
definition of success and do the results meet that definition? Or in
the language of statistics:
Was the effect interesting or too small to matter?
I7:What exactly do I mean
by “faster”, or “better performance”, or “more efficient”? How is
that operationalized?
I8: At least I was
explicit in declaring that I only expected the
array_based_order_book_side<> to the faster for the inputs that one
sees in a normal feed, but that it was worse with suitable constructed
inputs. However, this leaves other questions unanswered:
How do you characterize those inputs for which it is expected to be
faster?
Even from the brief description, it sounds there is a very large
space of such inputs. Why do I think the results apply
for most of the acceptable inputs if you only tested with a few?
How many inputs would you need to sample to be confident in the results?
I9: How unlucky would I have to
be to miss the effect if it is there? Or if you prefer: how likely
is it that we will detect the effect if it is there? In the
language of statistics:
Did the test have enough statistical power to measure what I
wanted to measure?
Is that number of iterations (a/k/a samples) high enough?
For that matter, how many iterations of the test did I ran?
I10: How confident
am I that the results cannot be explained by luck alone? Or in the
language of statistics: Was the result statistically significant?
A what level?
I11: I reported the median
latency (and a number of other percentiles):
why are those statistics relevant or appropriate for this problem?
I12: How could anybody reproduce
these tests?
What was the hardware and software platform used for it?
What version of JayBeams did I use for the test?
What version of the compiler, libraries, kernel, etc. did I use?
If those questions are of no interest to you, then this is not the
series of posts that you want to read.
If you think those questions are important, or peak your interest,
or you think they may be applicable to similar benchmarks you have
performed in the past, or plan to perform in the future,
then I hope to answer them in this series of posts.
I propose to use
measure the performance improvements of
array_based_order_book_side<> vs map_based_order_book_side<> as an
example of how to rigorously answer those questions.
Later it will become evident that this example is too easy, so I will
also use
measure the some small improvement of
array_based_order_book_side<> as a further example of how to
rigorously benchmark and compare code changes.
As I go along, I will describe the pitfalls that I try to avoid,
and how I avoid them.
And finally to report the results in enough detail that readers can
decide if they agree with my conclusions.
It is possible, indeed likely, that readers more knowledgeable than
myself will find fault in my methods, my presentation, or my
interpretation of the results are incorrect.
They may be able to point out other pitfalls that I did not consider,
or descriptions that are imprecise, or places where my math was wrong,
or even simple spelling or grammar mistakes.
If so, I invite them to enter their comments into a
bug
I have created for this purpose.
The Problem of Building an Order Book
In this section I will try to address the issues raised in
I1, I2, and
I3, and give the reader some
context about the domain where the problems of building a book arise,
what specifically is this problem, and why is it important to solve
this problem very efficiently.
If you are familiar with market feeds and order books you may want to
skip tho the next section,
it will be either boring or irritating to you.
If you are not familiar, then this is a very rushed overview, there is
an earlier post with
a slightly longer introduction to the topic, but the content here
should suffice.
Market data feeds are network protocols used to describe features of
interest in an exchange (or more generally a trading venue, but that
distinction is irrelevant).
These protocols are often custom to each exchange, though their
specification is public, and have large differences in the information
they provide, as well as the actual protocol used.
Having said that, I have found that many of them can be modeled as a
stream of messages that tell you whether a order was added,
modified, or deleted in the exchange.
The messages include a lot of information about the order, but here we
will just concern about with the most salient bits, namely:
the side of the order – is this an order to buy or sell securities;
the price of the order – what is the maximum (for buy) or minimum (for
sell) price that the investor placing the order will accept,
and the quantity of the order – the number of shares (or more
generally units) the investor is willing to trade.
With this stream of messages one can build a full picture of all the
activity in the exchange, how many orders are active at any point, at
what prices, how much liquidity is available at each price point,
which orders arrived first, sometimes “where is my order in the
queue”, and much more.
The process of building this picture of the current state of the
exchange is called Building the Book, and your Book is the data
structure (or collection of data structures) that maintains that
picture.
One of the most common questions your book must be
able to answer are: what is the price of the highest buy order active
right now in a security? And also: how many total shares are
available at that best buy price for that security? And obviously:
what about the best sell price and the number of shares at that best
price for that security?
If those are all the questions you are going to ask, and this is often
the case, you only need to keep a tally of how many
shares are available to buy at each price level, and how many shares
are available to sell at each price level.
The tally must be sorted by price (in opposite orders for buy
and sells), because when the best price level disappears (all active
orders at the best price are modified or canceled) you need to quickly
find the next best price.
You must naturally keep a different instance of such tally for each
security, after all you are being asked about the best price for GOOG,
not for MSFT.
The challenge is that these feeds can have very high throughput
requirements, it is
not rare to see hundreds of thousands of messages in a second.
That would not be too bad if one could shard the problem
across multiple machines, or at least multiple cores.
Alas! this is not always the case, for example, the Nasdaq ITCH-5.0
feed does not include symbol
information in neither the modify nor delete messages, so one much
process such messages in order until the order they refer to is found,
the symbol is identified, at which point one may shard to a different
server or core.
Furthermore, the messages are not nicely and evenly spaced in these
feeds.
In my measurements I have seen that up to 1% of the messages arrived
in less than 300ns after you receive the previous one.
Yes, that is nanoseconds.
If you want to control the latency of your book builder to the p99
level – and you almost always want to do in the trading space – it
is desirable to completely process 99% of the messages before
the next one arrives.
In other words, you roughly have 300ns to process most messages.
Suffice is to say that the data structure involved in processing a
book must be extremely efficient to deal with peak throughput without
introducing additional lately.
Detailed Design
In this section I will try to solve the problems identified in
(I4)
and give a more detailed description of the
classes involved, so the reader can understand the tradeoffs they make
and any weaknesses they might have.
If the reader finds this section too detailed they can
skip it.
The Map Based Order Book
In JayBeams, a relatively straightforward implementation of the order
book tally I described above
is provided by map_based_order_book_side<> (hereafter mbobs<>)
[1].
It is implemented using a std::map<>, indexed by price and
containing the total quantity available for that price.
The class has two operations that are used to process all add, modify,
and delete messages from the market feed.
reduce_order() processes both delete and modify messages, because in
ITCH-5.0 all modifications reduce the size of an order. Naturally
add_order() processes messages that create a new order.
The class provides member functions to access the best price and
quantity, which simply call the begin() and rbegin() in the
internal std::map<> member variable.
Therefore, finding the best and worst prices are O(1) operations, as
C++ makes this
complexity guarantee
for the begin() and rbegin() member functions.
Likewise, the add_order() and reduce_order() member functions are
O(log(N)) on the number of existing price levels, since those are
the
complexity guarantees
for the find and erase member functions.
Alternatively, we can bound this with O(log(M)) where M is the
number of previous orders received, as each order creates at most one
price level.
The Array Based Order Book
In the previous post, I made a number of observations on the
characteristics of a market feed:
(1) I reported that market data messages
exhibit great locality of references, with the vast majority of
add/modify/delete messages affecting prices very close to the current
best price,
(2) prices only appear in discrete increments, which make them suitable to
represent by integers,
(3) Integers that appear in a small range can represent indices into an
array,
(4) and access to such arrays is O(1), and also makes efficient use
of the cache.
One can try to exploit all of this by keeping an array for the prices
contiguous with with the best price.
Because the total range of prices is rather large, we cannot keep all
prices in an array, but we can fallback on the std::map<>
implementation for the less common case.
My friend Gabriel used all the previous observations to design and
implement array_based_order_book_side<> (hereafter abobs<>).
The class offers an identical interface to
mbobs<>,
with add_order() and remove_order()
member functions, and with member functions to access the best and
worst available prices.
The implementation is completely different though.
The class maintains two data structures:
(1) a vector representing the best (highest in the case of
buy) prices,
and (2) a std::map<> – which is typically some kind of balanced
tree – to represent the prices that do not fit in the vector.
In addition, it maintains
tk_begin_top the tick level of the first price price in the vector,
and tk_inside the index of the best price.
All elements in the vector past that index have zero values.
The vector size is initialized when a class instance is constructed,
and does not change dynamically.
The value of tk_inside changes when a new order with a better price
than the current best is received. Analyzing the complexity of this
is better done by cases:
Complexity of add_order()
There are basically three cases:
We expect the majority of the calls to the add_order()
member function to affects a price that is close to the current
tk_inside. In this case the member function simply updates one
of the values of the array, a O(1) operation.
Sometimes the new price will be past the capacity assigned to the
array. In this case the current values in the array need to be
flushed into the map. This is a O(K) operation, where K is the
capacity assigned to the array, because all K
insertions happen at the same location, and
std::map::emplace_hint is guaranteed to be amortized constant
time.
Sometimes the price is a price worse than tk_begin_top, in which
the complexity is O(log(N)) same as the map-based class.
Complexity of reduce_order()
There are also basically three cases:
We also expect the majority of the calls to reduce_order() to
member function to affects a price that is close to the current
tk_inside. As long as the new total quantity is not zero simply
updates one of the values of the array, a O(1) operation.
If the new quantity is zero the class needs to find the new best
price. In most cases this is a short search backwards through the
array. But if the search through the array is exhausted the
implementation needs to move vector.size()/2 prices from the map
to the vector. Again this is a O(vector.size()) operation
because erasing a range of prices in the map is guaranteed to be
linear on the number of elements erased.
If the price is worse than tk_begin_top then this is similar to
the map-based class and the complexity is O(log(N)).
Summary of the Complexity
In short, we expect abobs<> to behave as a
O(1) data structure for the majority of the messages.
This expectation is based on receiving very few messages affecting
prices far away from the inside.
In those cases the class behaves actually worse than
mbobs<>.
Benchmark Design
I tried running an existing program,
such as
itch5inside,
and measuring how its performance changes with each implementation.
I ran into a number of problems:
The program needs to read the source data from a file, this file is
about 6 GiB compressed, so any execution of the full program is also
measuring the performance of the I/O subsystem.
The program needs to decompress the file, parse it, and filter the
events by symbol, directing different events to different instances of
array_order_book_side<>. So once more, an execution of the full
program is measuring many other components beyond the one I want to
optimize.
The program takes around 35 minutes to process a day worth of market
data. We expect that multiple runs will be needed to “average out”
the variation caused by all the
operating system noise, so completing a test could take hours.
Such long elapsed times will severely limit my ability to iterate
quickly.
More importantly, running the full program is poor experimental
design.
There are too many variables that we will need to control: the I/O
subsystem, the size of the buffer cache, any programs that are
competing for that cache, any changes to the decompression and/or I/O
libraries, etc.
I would prefer to run an experiment that isolates only the component
that I want to measure.
I do not see any way to do this other than using a synthetic
benchmark.
Admittedly this is a proxy for the performance of the overall
program, but at least I can iterate quickly and test the big program
again at the end.
This is more or less standard practice, and I do not think too
controversial.
Obtaining Representative Data for the Benchmark
The tricky bit here is that the abobs<> class was specifically
designed to take advantage of the distribution characteristics of the
input data.
I need to create test data that has this locality property.
And it must have some kind of goldi-locks medium:
not so much locality of prices that it unfairly favors the abobs<>
design, but not so little that it unfairly disadvantages that design
either.
I thought about two alternatives for this problem:
I could use traces from the existing data set. Somehow
save a trace, load it into memory and run the
something-something-based-book against this data.
However, we would need to capture multiple traces to
ensure we have a representative sample of the data.
And I have a limited amount of market data at my disposal, so this
is all very limiting.
I can try to generate data with similar statistical distribution. The
data may not have the same fidelity of a trace, but it can be easily
varied by regenerating the data.
Because it is easier to setup and run the test with synthetic data I
chose the second approach.
In addition to the lack of fidelity, there may be characteristics
about the data that I did not think about, and make the class behave
differently in production.
I think I am addressing those limitations by planning to run the full
test separate.
By the way, if the reader disagrees with my assumptions about how this
(and other) decisions impact the validity of the results,
please do let me know in the
bug.
I am trying to be super transparent about my assumptions,
if I am doing my work right you should be able to reproduce the
experiment with different assumptions, show why I am wrong, and we
both learn something.
That sounds suspiciously close to the scientific method.
Next Up
The next post will include a more detailed description of the
benchmark,
to test these classes.
Notes
In this post I have set all links to a specific version
(eabc035fc23db078e7e6b6adbc26c08e762f37b3)
of the JayBeams project.
Links to the current version are more succinct, and future readers may
find that bugs in the old code have been fixed in newer versions.
We prefer to use links to fixed versions because it makes the
references and links reproducible, partially addressing
the problem I highlighted earlier.
Updates: Gabriel caught a mistake in
my complexity analysis, the O() bounds were correct, but I was
playing fast and lose with the constant factors.
Updates: Major rewording so I do not sound like a pompous ass.
It has been a while since I posted something, laziness and not having
anything interesting to report. Well, maybe a lot of negative
reports, which are interesting and should be published, but see above
about laziness.
Recently my friend Gabriel
implemented a really cool optimization on the ITCH-5.0 order book I
had implemented, which I thought should be documented somewhere.
His work inspired me to look further into
benchmarking and how to decide if a code change has actually improved
anything.
But first let’s start with Gabriel’s work. Two observations, first
prices are discrete, they only show in penny increments (1/100 of a
penny for prices below $1.00, but still discrete).
The second, and more critical, observation is that most market feed
updates happen close to the inside.
Gabriel and myself wrote a
program
to confirm this,
here are the results, as percentiles of the event depth:
NSamples
min
p25
p50
p75
p90
p99
p99.9
p99.99
max
489064030
0
0
1
6
14
203
2135
15489331
20009799
One should expect, therefore, that 99% of the events affect a price no
more than 203 levels from the inside.
How can one use that? Well, building a book is basically
keeping a map (it needs to be sorted, so hashes typically do not
work), between prices and quantities.
Since prices are basically discrete values, one starts thinking of
vectors, or arrays, or many circular buffers indexed by price levels.
Since most of the changes are close to the inside, what if we kept an
array with the quantities, with the inside somewhere close to the
center of the array?
Most updates would happen in a few elements of this array, and
indexing arrays is fast!
Furthermore, even as the price moves, the change will be “slow”, only
affecting a few price levels at a time.
Unfortunately we cannot keep all the prices in an array, the
arrays would need to be too large in some cases.
We could explore options to keep everything in arrays, for example,
one could grow the arrays dynamically and only a few will be very
large.
But a more robust approach is to just use a map for anything that does
not fit the array.
Naturally prices change during the day, and sometimes the price
changes will be so large that the range in the array is no longer
usable.
That is Okay, as long as that happens rarely enough the overall
performance will be better.
Our benchmarks show that this optimization works extremely well in
practice.
With real data the array based order book can process its events
significantly faster than the map-based book, the total CPU time
decreases from 199.5 seconds to 187.50 seconds. But more
interestingly, processing latencies per event are decreased up to the
p99 level (times in nanoseconds):
Book Type
min
p10
p25
p50
p75
p90
p99
p99.9
p99.99
max
N
Map
193
345
421
680
1173
1769
2457
3599
13491
65673473
17560405
Array
193
288
332
517
886
1328
2093
5807
12657
63389010
17560405
More interestingly, using a synthetic benchmark we can see that the
difference is very obvious:
Finally a teaser: the array book type shows more variability, still
faster than map, but more difference between quartiles.
We use the less familiar (but more visually pleasing) violin plot to
break down the data by book side as well as book type. There
is nothing remarkable for the ‘map’ book type, but why are latencies
so different for different side of the book in the ‘array’ case?
The answer to that and other questions in a future post.
Notes
The script used to generate the plots is shown below. The data is
available for download, and committed to my github
repository. This data was
generated with an specific
version
of the
benchmark,
on a Google Compute Engine (GCE) virtual machine (2 vCPUs, 13GiB RAM,
Ubuntu 16.04 image), additional configuration information is captured
in the CSV file.
System configuration beyond the base image is captured in the
Dockerfile.
I have been silent for a while as I was having fun writing code. I
finally implemented a simple (and hopelessly slow) book builder for
the ITCH-5.0 feed. With it we can estimate the event rate for
changes to the inside, as opposed to all the changes in the book
that do not affect the best bid and offer.
This data is interesting as we try to design our library for time
delay estimation because it is the number of changes at the inside
that will (or may) trigger new delay computations. We are also
interested in breaking down the statistics per symbol, because have
to perform time delay analysis for only a subset of the symbols in
the feed and it is trivial to parallelize the problem across
different symbols in different servers.
The code is already available on
github.com
the tool called itch5inside, clearly I lack imagination.
It outputs the
inside changes as an ASCII file (which can be compressed on the
fly), so we can use it later as a sample input into our analysis.
Optionally, it also outputs to stdout the per-symbol statistics.
I have made the statistics available
on this very blog
in case you want to analyze them.
In this post we will just make some observations about the expected
message rates.
First we load the data, you can set public.dir to
https://coryan.github.io/public and download the data directly from
the Internet, and then create a separate data set that does not
include the aggregate statistics.
It is a more or less well-known fact that some symbols are much more
active than others, for example, if we plot a histogram of number of
events at the inside and number of symbols with that many messages
we get:
That graph appears exponential at first sight, but one can easily be
fooled, this is what goodness of fit tests where invented for:
That p-value is extremely low, we cannot reject the Null
Hypothesis, the symbol$NSamples values do not come from the
geometric distribution that best fits the data.
Also notice that your values might be different, in fact, they
should be different as the reference data was generated at random.
We can also visualize the distributions in log scale to make the
mismatch more apparent:
Clearly not the same distribution, but using math to verify this
makes me feel better.
In any case, whether the distribution is geometric or not does not
matter much, clearly there is a small subset of the securities that
have much higher message rates than others.
We want to find out if any security has such high rates as to make
them unusable, or very challenging.
The itch5inside collects per-symbol message rates, actually it
collects message rates at different timescales (second, millisecond
and microsecond), as well as the distribution of processing times to
build the book, and the distribution of time in between messages
(which I call “inter-arrival time”, because I cannot think of a
cooler name).
Well, the tool does not exactly report the distributions, it reports
key quantiles, such as the minimum, the first quartile, and the 90th
percentile.
The actual quantiles vary per metric, but for our immediate purposes
maybe we can get the p99.9 of the event rate at the millisecond
timescale.
A brief apology for the p99.9 notation, we used this shorthand in a
previous job for “the 99.9-th percentile”, and got used to it.
Let’s plot the p99.9 events/msec rate against the total number of
messages per second:
There is a clear outlier, argh, the itch5inside tool also reports
the metrics aggregated across all symbols, that must be the
problem:
Of course it was, let’s just plot the symbols without the aggregate
metrics:
Finally it would be nice to know what the top symbols are, so we
filter the top50 and plot them against this data:
What Does this Mean?
We have established that inside changes are far more manageable, we
probably need to track less than the top 50 symbols and those do not
seem to see more than 10 messages in a millisecond very often.
We still need to be fairly efficient, in those peak milliseconds we
might have as little as 100 microseconds to process a single event.
What Next?
I did write a FFTW time delay estimator and benchmarked it, the
results are recorded in the github
issue.
On my
(somewhat old) workstations it takes 27 microseconds to analyze a
sequence of 2,048 points, which I think would be the minimum for
market data feeds. Considering that we need to analyze 4 signals
per symbol (bid and offer, price and size for each), and we only
have a 100 microsecond budget per symbol (see above), we find
ourselves without any headroom, even if we did not have to
analyze 20 or more symbols at a time to monitor all the different
channels that some market feeds use.
So we need to continue exploring GPUs as a solution to the problem,
I recently ran across VexCL,
which may save me a lot of tricky writing of code.
VexCL uses template expressions to create complex kernels and
optimize the computations at a high level.
I think my future involves benchmarking this library against some
pretty raw OpenCL pipelines to see how those template expressions do
in this case.
Before we start worrying about computing cross-correlations in GPUs we
would like to have a well-tested version using the CPU. It is much
easier to debug these, and they serve as reference data when testing
the GPU version.
I chose the FFTW library because it is freely
available and it is well-known and highly regarded FFT
implementation. It is a C library, which requires us to write some
wrappers to avoid common programming errors.
My goal is to implement a simple example showing how to estimate the
delay between two simple signals, something like this would be close
to ideal:
We want the estimator to be some kind of class because there is need
to keep state. For example, most FFT libraries store “execution
plans” for the transforms. Likewise, we anticipate (or know, because
I have prototyped this) that an
OpenCL-based estimator would need to keep buffers to pass data to the
GPU, the computational kernels that will be sent to the GPU, and many
other data structures.
Finally, we expect that real-time estimators will combine the
estimates obtained by analyzing across multiple securities and/or
across different attributes – bid vs. offer for example – of the
same security.
We also expect that an estimator may have multiple different compute
units at its disposal. For example, my workstation at home has 3
distint GPU devices: two discrete video cards and an
APU
which combines a CPU and GPU into the same die.
Such configuration may be uncommon, but it is common to have multiple
cores in the same server, and it would be desirable to schedule some
of the FFT transforms in different cores to increase the processing
throughput.
You Are Not Going To Need It
Despite the challenge throughput numbers in a market feed we have not
yet definitely concluded that this level of complexity is necessary.
FFT after all stands for “Fast Fourier Transform”, why worry about
GPUs, multicore programming, etc.?
I think I need to gather two pieces of information to demonstrate this
complexity is required.
First, we have established the throughput requirements for a feed such
as ITCH-5.0, but that feed includes all changes in the Nasdaq book,
across all symbols and across all price levels.
One can reasonably argue that the throughput requirements for a single
security (or a small number of securities) would be sufficiently low
to estimate the time delay in real-time.
And even if that does not reduce the throughput enough, the inside
prices and quantities provide some more relief.
Alas! Experience shows that most changes do occur in the inside
quantities.
And, while the number of messages in a single security is hardly ever
as high as the messages in all securities, some securities have
message rates several orders of magnitude higher than others.
This wide variation in message rates can be used to our advantage, as
we can select securities that are not so active as to slow down our
system, but active enough to provide a constant signal to the delay
estimator.
Furthermore, experience shows that typical latencies for the
consolidated feeds vary between one and eight milliseconds.
To secure a long enough baseline for the delay estimator we will need
at least 16 milliseconds of data.
We need to determine the right value for our sampling rate.
Higher sampling rates should produce more accurate results, but
increase the computational costs.
For example, if we chose to take one sample per microsecond each FFTs
would require at least operations.
A cross correlation requires at least three FFTs, so we are looking at
a lower bound of 680,000 operations.
Assuming your server or workstation can handle 3 billion operations
per second, and that we can use all these operations, the time delay
estimation would require something in the order of 229
microseconds.
That could easily prove too expensive if we want to read just 1% of
the ITCH-5.0 market feed.
None of these figures are to be taken as extremely accurate
predictions of performance. They just illustrate that the problem is
not trivial and will require significant complexity to manage the
computations efficiently.
What is Next?
We need to implement four different programs or components:
(1) we need a way to compute the inside for the ITCH-5.0 feed
(issue #8),
(2) we need a way to compute per-symbol throughput statistics of just
the inside changes
(issue #9),
(3) we need to prototype the time delay estimation based on
cross-correlations and using the GPU
and (4) we need to prototype the time delay estimation based on
(issue #10),
cross-correlations and the FFTW library
(issue #7).
I have decided to start with the last one, because I have been doing
too much market data processing lately and too much posting latety.