On Benchmarking, Part 2
08 Jan 2017This is a long series of posts where I try to teach myself how to run rigorous, reproducible microbenchmarks on Linux. You may want to start from the first one and learn with me as I go along. I am certain to make mistakes, please write be back in this bug when I do.
In my previous post I discussed a class
(array_based_order_book_side<>
a/k/a abobs<>
)
which serves as a motivation to learn how to create really good
benchmarks.
Because testing the class inside a full program introduces too much
variation in the results, it is too slow, and it is too cumbersome I
decided to go with a small purpose-built benchmark program.
In that post I pointed out ([I5]) that any good description
of a benchmark must include how does the benchmark measures time, why
is that a good approach to measure time, and how does the benchmark
control for variables that affect the results, such as system load.
This poist is largely written to address those questions.
I will describe a small framework I built to write benchmarks for JayBeams. Mostly this was born with my frustration when not getting the same results if one runs the same benchmark twice. If I cannot reproduce the results myself, for the same code, on the same server, what hope do I have of creating reproducible results for others? I need to get the environment where the benchmarks run under control before I have any hope of embarking of a rigorous analysis of them.
It turns out you need to do a lot of fine tuning in the operating system to get consistency out of your benchmarks, and you also need to select your clock carefully. I find this kind of fine tuning interesting in itself, so I am taking the opportunity to talk about it.
Most folks call such small benchmarks a microbenchmark, and I like the name because it makes a clear distinction from large benchmarks as those for database systems (think TPC-E and its cousins). I plan to use that name hereafter.
Anatomy of a Microbenchmark
Most microbenchmarks follow a similar pattern.
First, you setup the environment necessary to run whatever it is you
are testing.
This is similar to how you setup your mocks in a unit test, and in
fact you may setup mocks for your benchmark too.
In the case of our example (abobs<>
) one wants to build a
synthetic list of operations before the system starts measuring the
performance of the class.
Building that list is expensive, and you certainly do not want to include
it the measurement of the class itself; in general,
microbenchmarks should not measure or report the time required
to setup their test environment.
Then, you run multiple iterations of the test and capture how long it took to run each iteration. How exactly you measure the time is a complicated question, modern operating systems and programming languages offer multiple different ways to measure time. I will discuss which one I picked, and why, later in the post. Sometimes you want to run a number of iterations at the beginning of the benchmark, and discard the results from them. The argument is usually that you are interested in the steady state of the system, not what happens while the system is “warming up”. Whether that is sensible or not depends on the context, of course.
At the end most microbenchmarks report some kind of aggregate of the results, typically the average time across all iterations. Most microbenchmarks stop there, though rarely they include additional statistics, such as the minimum, maximum, standard deviation, or the median.
One of my frustrations is that I rarely see any justification for the choice of statistics: why is the mean the right statistic to consider in the conditions observed during the benchmark? How are you going to deal with outliers if you use the mean? Why not median? Does your data show any kind of central tendency? If not, then neither median nor mean are good ideas, so why report them? Likewise, why is the standard deviation the right measurement of statistical dispersion? Is something like the interquartile range a better statistic of dispersion for your tests?
The other mistake that folks often make is to pick a number of iterations because “it seems high enough”. There are good statistical techniques to decide how many iterations are needed to draw valid conclusions, why not use them?
An even worse mistake is to not consider whether the effect observed by your benchmark even makes sense: if you results indicate that option A is better by less than one machine instruction per iteration vs. option B, do you really think that is meaningful? I think it is not, no matter how many statistical tests you have to prove that it is true, it has barely measurable impact. I want rigorous results, I do not want to join “The Cult of Statistical Significance”.
The JayBeams Microbenchmark Framework
In JayBeams microbenchmark
framework
the user just needs to provide a fixture
template parameter.
The constructor of this fixture must setup the environment for the
microbenchmark.
The fixture::run
member function must run the test.
The framework takes care of the rest:
it reads the configuration parameters for the
test from the command line, calls your constructor,
runs the desired number of warm up and test iterations,
captures the results, and finally reports all the data.
The time measurements for each iteration are captured in memory, since you do not want to contaminate the performance of test with additional I/O. All the memory necessary to capture the results is allocated before the test starts, because I do not want to contaminate the arena either.
Reporting the Results
I chose to make no assumptions in the JayBeams microbenchmark framework as to what are good statistics to report for a given microbenchmark. The choice of statistic depends on the nature of the underlying distribution, and the statistical tests that you are planning to use. Worse, even if I knew the perfect statistics to use, there are some complicated numerical and semi-numerical algorithms involved. Such tasks are best left to specialized software, such as R, or if you are a Python fan, Pandas.
In general, the JayBeams microbenchmark framework will dump all the results to stdout, and expects you to give them to a script (I use R) to perform the analysis there. However, sometimes you just want quick results to guide the modify-compile-test cycles.
The microbenchmark framework also outputs a summary of the results. This summary includes: the number of iterations, the minimum time, the maximum time, and the p25, p50, p75, p90 and p99.9 percentiles of the time across all iterations. BTW, I picked up p99.9 as a notation for “the 99.9th percentile”, or the even worse “the 0.999 quantile of the sample” in one of my jobs, not sure who invented it, and I think it is not very standard, but it should be. The choice of percentiles is based on the fact that most latency measurements are skewed to the right (so we have more percentiles above 90% than below 10%), but the choice is admittedly arbitrary. The system intentionally omits the mean, because the distributions rarely have any central tendency, which is what the mean intuitively represent, and I fear folks would draw wrong conclusions if included.
Clock selection
I mentioned that there are many clocks available in C++ on Linux, and
promised to tell you how I chose.
The JayBeams microbenchmark framework uses std::chrono::steady_clock
, this
is a guaranteed monotonic clock, the resolution depends on your
hardware, but any modern x86 computer is likely to have an
HPET
circuit with at least 100ns resolution.
The Linux kernel can also drive the time measurements using the
TSC register,
which has sub-nanosecond resolution (but many other problems).
In short, this is a good clock for a stopwatch (monotonic), and while
the resolution is not guaranteed to be sub-microseconds, it is likely
to be. That meets my requirements, but why not any of the alternatives?
getrusage(2)
: this system call returns the resource utilization
counters that the system tracks for every process (and in some
systems each thread).
The counters include cpu time, system time, page faults, context
switches, and many others.
Using CPU time instead of wall clock time is good,
because the amount of CPU used should not change while the program is
waiting to be scheduled.
However, the precision of getrusage
is too low for my purposes,
traditionally it was updated 100 times a second, but even on modern
Linux kernels the counters are only incremented around 1,000 times
per second
[1]
[2].
So at best you get millisecond resolution,
while the improvements I am trying to measure may be a few
microseconds.
This system call would introduce measurement errors many times
larger than the effects I want to measure, and therefore it is not
appropriate for my purposes.
std::chrono::high_resolution_clock
: so C++ 11 introduced a number of
different clocks, and this one has potentially higher-resolution
clock than std::chrono::steady_clock
.
That is good, right?
Unfortunately, high_resolution_clock
is not guaranteed to be
monotonic, it might go back in time, or some seconds may be shorter
than others.
I decided to check, maybe I was lucky and it was actually monotonic
on my combination of operating system and compilers.
No such luck, in all the Linux implementations I used this clock is
based on
clock_gettime(CLOCK_REALTIME,...)
[3],
which is subject to changes in the system clock, such as ntp
adjustments.
So this one is rejected because it does not make for a good
stopwatch.
clock_gettime(2)
: is the underlying function used in the
implementation of std::chrono::steady_clock
.
One could argue that using it directly would be more efficient,
however the C++ classes around them add very little overhead, and
offer a much superior interface.
Candidly I would have written a wrapper to use this class, and the
wrapper would have been worse than the one provided by the standard,
so why bother?
gettimeofday(2)
is a POSIX call with similar semantics to
clock_gettime(CLOCK_REALTIME, ...)
.
Even the POSIX standard no longer recommends using it
[4],
and recommends using clock_gettime
instead.
So this one is rejected because it is basically obsoleted, and it is
also not monotonic, so not a good stopwatch.
time(2)
only has second resolution, and it is not monotonic.
Clearly not adequate for my purposes.
rdtscp
/ rdtsc
: Okay, this is the one that all the low level
programmers go after.
It is a x86 instruction that essentially returns the number of
ticks since the CPU started.
You cannot ask for lower overhead than “single instruction”, right?
I have used this approach in the past, but you do need to calibrate the
counter; it is never correct to just take the count and divided by
the clock rate of your CPU.
But there are a number of other
pitfalls too.
Furthermore, clock_gettime
is implemented using
vDSO,
which greatly reduces the overhead of these system calls.
My little
benchmark,
indicates that the difference between them is about 30ns (that is
nanoseconds) on my workstation.
In my opinion, its use is no longer justified on modern Linux
systems; it carries a lot of extra complexity that you only need if you are
measuring things in the nanosecond range.
I may need it eventually, if I start measuring computations that
take less than one microsecond, but until I do I think
std::chrono::steady_clock
is much easier to use.
System Configuration
Running benchmarks on a typical Linux workstation or server can be frustrating because the results vary so much. Run the test once, it takes 20ms, run it again, it takes 50ms, run it a third time it takes 25ms, again and you get 45ms. Where is all this variation coming from? And how can we control it? I have found that you need to control at least the following to get consistent results: (1) the scheduling class for the process, (2) the percentage of the CPU reserved for non real-time processes, (3) the CPU frequency scaling governor in the system, and (4) the overall system load.
I basically tested all different combinations of these parameters, and I will remove the combinations that produce bad results until I find the one (or few ones) that works well. Well, when I say “all combinations” I do not mean that: there are 99 different real-time priorities, do I need to test all of them? What I actually mean is:
scheduling class: I ran the microbenchmark in both the default
scheduling class (SCHED_OTHER
), and at the maximum priority in the
real-time scheduling class (SCHED_FIFO
).
If you want a detailed description of the scheduling classes and how
they work I recommend the man page:
sched(7).
non-real-time CPU reservation: this one is not as well known, so a
brief intro, real-time tasks can starve the non real-time tasks if
they go haywire. That might be Okay, but they can also starve the
interrupts, and that can be a “Bad Thing”[tm].
So by default the Linux kernel is configured to reserve a percentage
of the CPU for non real-time workloads.
Most systems set this to 5%, but it can be changed by writing
into /proc/sys/kernel/sched_rt_runtime_us
[5]).
For benchmarks this is awful, if your systems has more cores than the
benchmark is going to use, why not run with 0% reserved for the non
real-time tasks?
So I try with both a 0% and a 5% reservation for non real-time tasks
and see how this affects the predictability of the results when using
real-time scheduling (it should make no difference when running in the
default scheduling class, so I skipped that).
CPU frequency scaling governor: modern CPUs can change their
frequency dynamically to tradeoff power efficiency against
performance. The system provides different governors that offer
distinct strategies to balance this tradeoff.
I ran the tests with both the ondemand
CPU governor, which attempts
to increase the CPU frequency as soon as it is needed,
and with the performance
CPU frequency governor which always runs
the CPU at the highest available frequency
[6].
system load: we all know that performance results are affected by external load, so to simulate the effect of a loaded vs. idle system I ran the tests with the system basically idle (called ‘unloaded’ in the graphs). Then I repeated the benchmarks while I ran N processes, one for each core, each of these processes tries to consume 100% of a core.
Finally, for all the
all possible combinations of the configuration parameters and load I
described above I run the microbenchmark four times.
I actually used mbobs<>
in these benchmarks, but
the results apply regardless of what you are testing.
Let’s look at the pretty graph:
The first thing that jumps at you is the large number of outliers in the default scheduling class when the the system is under load. That is not a surprise, the poor benchmark is competing with the load generator, and they are both the same priority. We probably do not want to run benchmarks on loaded systems at the default scheduling class, we kind of knew that, but now it is confirmed.
We can also eliminate the ondemand
governor when using the real-time
scheduling class.
When there is no load in the system the results are quite variable
under this governor.
However it seems that it performs well when the system is loaded.
That is weird, right?
Well if you think about it, under high system load the ondemand
governor pushes the CPU frequency to its highest value because the
load generator is asking for the CPU all the time.
That actually improves the consistency of the results because when the
benchmark gets to run the CPU is already running as fast as possible.
In effect, running with the ondemand
governor under high load is
equivalent to running under the performance
governor under any load.
Before Going Further: Fix the Input
Okay, so the ondemand
governor is a bad idea, and running in the
default scheduling class with high load is a bad idea too.
The following graph shows different results for the microbenchmark
when the system is always configured to use the performance
CPU
frequency governor, and excluding the default scheduling class when
the system is under load:
I have broken down the times by the different inputs to the test.
Either with the same seed to the pseudo random number generator (PRNG)
used to create that synthetic sequence of operations we have
discussed, labeled fixed, or using
/dev/urandom
to initialize the PRNG, labeled urandom, of course.
Obviously some of the different in execution time can be attributed to
the input, but there are noticeable differences even when the input is
always the same.
Notice that I do not recommend running the benchmarks for the
abobs<>
with a fixed input.
We want the class to work more efficiently for a large class of
inputs, not simply for the one we happy to measure with.
We are fixing the input in this post in an effort to better tune our
operating system.
Effect of the System Load and Scheduling Limits
By now this has become a quest: how to configure the system and test
to get the most consistent results possible?
Let’s look at the results again, but remove the ondemand
governor,
and only used the same input in all tests:
Okay, we knew from the beginning that running on a loaded system was a bad idea. We can control the outliers with suitable scheduling parameters, but there is a lot more variation between the first and third quartiles (those are the end of the boxes in these graphs btw) with load than without it.
I still have not examined the effect, if any, of that
/proc/sys/kernel/sched_rt_runtime_us
parameter:
The effects seem to be very small, but hard to see in the graph. Time to use some number crunching, I chose the interquartile range (IQR) because it is robust and non-parametric, it captures 50% of the data no matter what the distribution or outliers are:
Governor | Scheduling Class | Real-time Scheduling Limits | IQR |
---|---|---|---|
ondemand | default | N/A | 148 |
performance | default | N/A | 89 |
performance | FIFO | default (95%) | 40 |
performance | FIFO | unlimited | 28 |
That is enough to convince me that there is a clear benefit to using all the settings I have discussed above. Just for fun, here is how bad those IQRs started, even if only restricted to the same seed for all runs:
loaded | seed | scheduling | governor | Interquartile Range |
---|---|---|---|---|
loaded | fixed | default | ondemand | 602 |
loaded | fixed | rt:default | ondemand | 555 |
loaded | fixed | default | performance | 540 |
loaded | fixed | rt:default | performance | 500 |
loaded | fixed | rt:unlimited | performance | 492 |
loaded | fixed | rt:unlimited | ondemand | 481 |
I believe the improvement is quite significant, I will show in a future post that this is the difference between having to run a few hundred iterations of the test vs. tens of thousands of iterations to obtain enough statistical power in the microbenchmark.
I should mention that all these pretty graphs and tables are compelling, but do not have sufficient statistical rigor to draw quantitative conclusions. I feel comfortable asserting that changing these system configuration parameters has an effect on the consistency of the performance results. I cannot assert with any confidence what is the size of the effect, or whether the results are statistically significant, or to what population of tests they apply. Those things are possible to do, but distract us from the objective of describing benchmarks rigorously.
Summary
The system configuration seems to have a very large effect on how
consistent are your benchmark results.
I recommend you run microbenchmarks on the SCHED_FIFO
scheduling class,
at the highest available priority, on a lightly loaded system,
where the CPU frequency scaling governor has been set to
performance
, and where the system has been configured to dedicate up
to 100% of the CPU to real-time tasks.
The microbenchmark framework automatically
set all these configuration parameters.
Well, at the moment I use a driver script to set the CPU frequency
scaling governor and to change the CPU reservation for non real-time
tasks.
I prefer to keep this in a separate script because otherwise one needs
superuser privileges to run the benchmark.
Setting the scheduling class and priority is fine,
you only need the right capabilities via
/etc/security/limits.conf
.
The script is small and easier to inspect for security problems,
in fact, it just relies on sudo, so a simple grep finds all the
relevant lines.
If you do not like these settings, the framework can be configured to
not set them.
It can also be configured to either ignore errors when changing the
scheduling parameters (the default), or to raise an exception if
setting any of the scheduling parameters fails.
I think one should use std::chrono::steady_clock
if you are running C++
microbenchmarks on Linux. Using rdtsc
is probably the only option
if you need to measure things in the [100,1000] nanoseconds range,
but there are many pitfalls and caveats, read about the online before
jumping into coding.
Even with all this effort to control the consistency of the benchmark results, and even with a very simple, purely CPU bound test as used in this post the results exhibit some variability. These benchmarks live in a world where only rigorous statistics can help you draw the right conclusions, or at least help you avoid the wrong conclusions.
As I continue to learn how to run rigorous, reproducible microbenchmarks in Linux I keep having to pick up more and more statistical tools. I would like to talk about them in my next post.
Future Work
There are a few other things about locking down the system
configuration that I left out and should not go unmentioned.
Page faults can introduce latency in the benchmark, and
can be avoided by using the mlockall(2)
system call.
I do not think these programs suffer too much from it, but changing
the framework to make this system call is not too hard and sounds like
fun.
Similar to real-time CPU scheduling, the Linux kernel offers
facilities to schedule I/O operations at different priorities via the
ioprio_set(2)
system calls.
Since these microbenchmarks perform no I/O I am assuming this will not
matter, but possibly could.
I should extend the framework to make these calls, even if it is only
when some configuration flag is set.
I have not found any evidence that setting the CPU affinity for the process (also known as pinning) helps in this case. It might do so, but I have no pretty graph to support that. It also wreaks havoc with non-maskable interrupts when the benchmark runs at higher priority than the interrupt priority.
In some circles it is popular to create a container restricted to a single core and move all system daemons to that core. Then you run your system (or your microbenchmark) in any of the other cores. That probably would help, but it is so tedious to setup that I have not bothered.
Linux 4.7 and higher include schedutil
a CPU frequency scaling
governor based on information provided by the scheduler
[7].
Initial reports indicate that it performs almost as well as the
performance
governor.
For our purposes the performance
scheduler is a
better choice as we are willing to forgo the power efficiency of a
intelligent governor in favor of the predictability of our results.
My friend Gonzalo points out that we
assume std::chrono::steady_clock
has good enough resolution,
we should at least check if this is the case at runtime and warn if
necessary.
Unfortunately, there is no guarantee in C++ as to the resolution of
any of the clocks, nor is there an API to expose the expected
resolution of the clock in the language.
Unfortunately this means any check on resolution must be platform
specific.
On Linux the
clock_getres(2)
seems promising at first, but it turns out to always return 1ns for
all the “high-resolution” clocks, regardless of their actual
resolution.
I do not have, at the moment, a good idea on how to approach this
problem beyond relying on the documentation of the system.
Notes
The data in this post was generated using a shell script, available here, it should be executed in a directory where you have compiled JayBeams.
Likewise, the graphs for this post were generated using a R script, which is also available.
The detailed configuration of the server hardware and software used to run these scripts is included in the comments of the csv file that accompanies this post.
Updates
I completely reworded this post, the first one read like a not very good paper. Same content, just better tone I think. I beg forgiveness from my readers, I am still trying to find a good style for blogs.