On Benchmarking, Part 3
15 Jan 2017- W. Edwards Deming
This is a long series of posts where I try to teach myself how to run rigorous, reproducible microbenchmarks on Linux. You may want to start from the first one and learn with me as I go along. I am certain to make mistakes, please write be back in this bug when I do.
In my previous post I discussed the
JayBeams microbenchmark framework and how to configure a system to
produce consistent results when benchmarking a CPU-bound component
like array_based_order_book_side<>
(a/k/a abobs<>
).
In our first post we raised a number of issues that will be addressed
now, specifically:
I6: What exactly is the definition of success and do the results meet that definition? Or in the language of statistics: Was the effect interesting or too small to matter?
I7:What exactly do I mean by “faster”, or “better performance”, or “more efficient”? How is that operationalized?
I8: At least I was
explicit in declaring that I only expected the
array_based_order_book_side<>
to the faster for the inputs that one
sees in a normal feed, but that it was worse with suitable constructed
inputs. However, this leaves other questions unanswered:
How do you characterize those inputs for which it is expected to be
faster?
Even from the brief description, it sounds there is a very large
space of such inputs. Why do I think the results apply
for most of the acceptable inputs if you only tested with a few?
How many inputs would you need to sample to be confident in the results?
How Big of a Change do we Need
If I told you that my benchmark “proves” that performance improved by one picosecond you would be correct in doubting me. That is such a small change that can be more easily attributed to measurement error than anything, and even if real is so small as to be uninteresting: a thousand such improvements would amount to a nanosecond, and those are pretty small already.
The question of how big a change needs to be to be “scientifically interesting” is, I think, one that depends on the opinion and objectives of the researcher. In the example I have been using, I think we are only interested in changes that improve the processing of add / modify / delete messages by at least a few machine instructions per message. That is a fairly minor improvement, but if present we have good reason to believe it is real.
Because instruction execution time is variable, and modern CPUs can execute multiple instructions per cycle I am going to require that the improvement be at least one cycle per message. That is intuitively close to “a few instructions”, but with much simpler math.
Success Criteria: we declare success, or that the
abobs<>
is faster thanmbobs<>
, ifmbobs<>
takes at least one more cycle to process handle theadd_order()
andreduce_order()
calls generated by a sequence of order book messages.
Written like that this is starting to sound like a classical statistical hypothesis testing problem. There are all sorts of great tools to compute the results of these tests. That is the easy part, the hard part is to agree on things like the accepted level of error, the population, etc.
In fact, some experts recommend that one starts with even simpler questions such as:
What do we need to decide? And Why?
This is one of those questions that is trying to challenge you to think if you really need to spend the money to run a rigorous test. In my case I want to run the statistics because it is fun, but there are generally good reasons to do benchmarking for a component like this. This is a component in the critical path for high-performance systems. If we make mistakes in accepting changes we may both miss opportunities to accept good changes because the data was weak, or accept changes with poor evidence, and slowly degrade the performance of the system over time.
Is a Data Driven Approach to Decision Making Necessary?
This is another one of those questions challenging the need to do all this statistical work. If we can make the decision through other means, say based on how readable is the code, then why collect all the data and spend the effort analyzing it?
I like data-driven decision making, and I think most software
engineers prefer it too.
I think the reader has likely witnessed or participated in debates
between software engineers where the merits of two designs for a
system are bandied back and forth (think: vi
vs. emacs
),
having data can stop such debates before they start.
Another thing I like about a data-driven approach is that one must get
specific about things like “better performance”,
and very specific about “these are the assumptions about the system”.
With clear definitions such questions the conversations inside a team
are also more productive.
What do we do if have no data?
This is another question challenging us to think about why and how are we doing this. Implicitly, this question is asking “What will you do if the data is inconclusive”? And also: “If you answer ‘proceed anyway’”, then why should we work hard to collect valid statistics? Why waste the effort? My answer tries to strike a balance between system performance considerations and other engineering constraints, such as readability:
Default Decision: Changes that make the code more readable or easier to understand are accepted, unless there is compelling evidence that they decrease performance. On the other hand, changes that bring no benefits in readability, or that actually make the code more complex or difficult to understand are rejected unless there is compelling evidence that they improve performance.
How do we define “better performance”?
Effectively our design of the microbenchmark is an implicit answer to the question, but we state it explicitly:
Variable Definition: Pick a sequence of calls to
add_order()
andreduce_order()
, call it S. We define the performance of an instantiation ofarray_based_order_book_side<T>
on S as the total time required to initialize an instance of the class and process all the operations in S.
Notice that this is not the only way to operationalize performance. An alternative might read:
Pick a sequence of calls to
add_order()
andreduce_order()
, call it S. We define the performance ofarray_based_order_book_side<T>
on S, as the 99.9 percentile of the time taken by the component to process each one the operations in S.
I prefer the former definition (for the moment) because it neatly avoids the problem of trying to measure events that take only a few cycles to execute.
The Population
Essentially we have defined the variable that we are going to measure, analogous to deciding exactly how are are going to measure the weight of a person. But what is the group of persons we are going to measure this over? All humans? Only the people that ride the subway in New York? Likewise, we have not defined which inputs are acceptable, that is we need to define the population of interest.
In our case I believe we need to pick two things:
- most obviously, the population of sequences S, we know the
abobs<>
class is slower thanmbobs<>
for some of them. - The template parameter T, there are a possibly infinite number of choices for it, but only two are really used. How often each?
First, we recall our measurements of event depth, the main result of which is that the number of price levels between the inside and operation changing the book has these sample percentiles:
min | p25 | p50 | p75 | p90 | p99 | p99.9 | max |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 6 | 14 | 203 | 2135 | 20009799 |
In fact, we designed abobs<>
to take advantage
of this type of distribution,
just writing the code is an implicit declaration that we believe any
future measure of event depth will be similar.
Naturally I do not expect every day to be identical, so I need
to define how different is unacceptable.
Somewhat arbitrarily I will use any sequence of operations S as long
as the p99 event depth percentile is within 2115 and 2155.
As to the template parameter T
, this is pretty obvious, but we
should use both buy and sell sides of the book in our tests, and we
expect that about half of the sequences are executed against the buy
side, and the other half against the sell side.
Let’s say exactly half to satisfy the lawyers in the room.
I have not said much about the operating conditions of the program: does a run of the program on the real-time class under no load counts as a different sample than the program running on the default scheduling class with high load? I think that depends on what you are trying to measure. If you were trying to measure how resilient is your code against scheduling noise that would be a fine definition of the population. I am interested on the narrower problem of how does the code change improve the performance of processing a stream of add / modify / delete messages. For that purpose the operating system scheduling effects is noise, I want to minimize that noise, not include it in the analysis.
Therefore I think the following is a more realistic statement about the population of measurements:
Population Definition: The population under consideration is runs of our benchmark with any sequence of operations S where the p99 percentile of event depth is within . The population must have 50% of its runs for buy order books, and 50% are for sell order books.
Ooops, I have made a mistake in the design of the benchmarks. The benchmark reuses the same sequence S for all the iterations, I am going to need random sampling on the population, which makes it desirable for the benchmark to run different iterations with different sequences. I think this is one of the advantages of writing down the population precisely, you realize the mistakes and implicit assumptions you are making in the benchmark design, and can correct them.
Also, the benchmark tries to create an input that matches the distribution, but no attempt is made to verify it matches the restrictions we placed on the input. We need to correct for that.
Notice that this also means the population includes sequences of all different lengths, and the benchmark uses a single length in all the iterations. I think eventually we will need a linear model that describes how the performance varies with the input size. That is a more complicated set of statistical tests to run, and we will need to build to it.
Next up
In my next post I will describe how we go about deciding how many samples are needed to draw statistically sound conclusions about the benchmark results.