8-bit Assembly was Fun Ramblings of a not-so-old Programmer

Of Message Rates and Histograms

After pushing the code coverage to nearly 99% from the low 80’s, I have found exactly one bug. This was a known problem and a feature that I intentionally left unimplemented. Not a huge return on investment, but increasing code coverage is satisfying for its own sake.

The next chunk of code to release deals with measuring expected message rates. To design the system I want to know (within an order of magnitude) how many messages per second, per millisecond, and even per microsecond to expect. This is partially to satisfy my curiosity, partially to illustrate the technical problems when building real-time market data processing systems, and partly because that defines the design envelope.

Don’t write code if you can just go an look it up online

Before we write any code, why not simply look up the numbers online? They might not be accurate, but would help with the initial estimation. My search-fu may not be the best, but this information is hard to find. It is not exactly secret, exchanges publish it so their users can provision their networks and software systems adequately. But it is not on their front page either, they reserve that space for information with more commercial value.

But some digging around can get you the basics.

BATS

For example, as of 2015-06-01 BATS informs its users that they can expect 21,883 messages in the peak millisecond for their BZX exchange. BATS owns several exchanges, and the numbers can be as “low” as 15,000 messages for the peak millisecond.

To illustrate how bursty this data is: in the same publication BATS notes that on the BYX exchange the peak minute may have the equivalent of 50,000 messages/second, while the peak millisecond carries the equivalent of 16,000,000 messages/second.

We will see later than the peak microsecond for some of these feeds can carry 270 messages, and implied rate of 270,000,000 messages/second. In other words, the bursts can be 3 orders of magnitude higher than the average. And they can be very high indeed.

NASDAQ

NASDAQ provides a report recommending 160 Mbps of bandwidth for their ITCH-5.0 feed. Since their messages sizes are around 40 bytes (see the spec), we can estimate that this feed peaks at 400,000 messages/second. However, we can do much better: NASDAQ provides sample data so their users can verify if their feed handlers are processing it correctly. We will use this data, to generate some interesting stats, but that later!

So What about Some Code?

I just pushed a few new classes to github to deal keep histograms, that is, counts of events by bucket. Shortly I will push additional classes where the range of each bucket represents some observed message rate. With these two classes in place we can estimate the min, max, mean, median, p90, p99 or any percentile of message rates we are interested in.

Of course the histograms will also be useful to later compute inter-arrival times, or latencies.

The jb::histogram class decomposes the problem of defining the bucket ranges and computing several statistical estimators into two separate classes. The bucket ranges are defined by a strategy, and I have implemented two simple ones:

  • jb::integer_range_binning: simply defines one bucket for each integer value between some user-prescribed minimum and maximum. In other words, it is about as simple as you can get.
  • jb::explicit_cuts_binning: allows the user to define the exact points for each bucket. Typically this is useful when you want to define buckets of variable size, such as [0,1,2,3,…,9,10,20,30…,100]

Users can define additional binning strategies as long as they conform to the jb::binning_strategy_concept interface. The jb::histogram class enforces these requirements using compile-time assertions, which (hopefully) provide better error messages than whatever the default compiler does.

What is with the weird dates?

If you see a post with an strange date, it is because I am using UTC to date them. Sometimes I post late on US Eastern time, and that may make it appear as if the post is from the future.

Code Coverage Integration

Code coverage metrics is the next tool that I want to make available in my project. Poor coverage metrics are an easy way to look for potential bugs. Alas! The opposite is not true, even with 100% line coverage, and apparently even with 100% branch coverage, one can at best expect to find around 60% of the defects in the code. Or at least, that is what the relatively small amount of literature on the subject seems to suggest. Don’t take my word for it, go buy this book, and read the sections about testing. Before you organize a mob and start passing the pitchforks and torches, remember that this does not mean that unit testing (or automated testing) is useless. It is a relatively cheap way to filter many errors, just not an infalible one.

Build Matrix

At this point the number of builds is getting complicated. We want to build with clang and gcc, we want to get code coverage data, and we want to generate the doxygen documentation. Before the .travis.yml file gets out of hand we need to do some basic refactoring and take advantage of the environment matrix. As usual, that was much harder than anticipated, but now I have a build for clang dbg (without any optimizations), which also uploads the doxygen documentation; a build for gcc with code coverage, which also uploads to coveralls.io; and builds for both gcc and clang with all optimizations.

I added the coveralls.io badge, though the state is shameful at the moment. I have been trying to get lcov and llvm-cov to cooperate without success, I am interested because clang promises to deliver branch coverage, a far more interesting metric than line coverage in my opinion. But this is a nice to have more than a required feature.

Ubuntu 12.04 is irritating me.

My experience with lcov is always more frustrating than it needs to be. Needless to say, the default version of gcov crashed miserably with code generated by gcc-4.9 on Ubuntu 12.04. It seemed reasonable enough to assume that the stock lcov with the right --gcov-tool option would work. Nope, no luck. So we need to build a more recent version of lcov in addition to installing more recent versions of gcov and gcc.

Lucky for me, somebody else had solved this problem before.

Automating Doxygen Documentation

I think of continuous integration, unit testing, code reviews, design documents, and documentation as practices that prevent or catch common errors. It is easy to see why that is the case for unit testing, you are making sure the code works as you expect as soon as possible; or with continuous integration: you are making sure that defects do not go unnoticed for too long.

I believe (yes, this is one of those opinions I promised in the About page), that documentation is also a practice to prevent defects: it stops others from using your code incorrectly. It states, in words that humans can read, how you expect the code to be used and what how should others expect the code to behave.

Others have said this better than I possibly could, but it is worth repeating: yes, by all means make your interfaces so obvious that very little documentation is needed; yes, by all means use the type system so it is hard to use the code incorrectly (but you can go too far on this); and yes, by all means write unit tests that describe the expected behaviors and uses of your code. Do all those things and then document your code, state how it is to be used, state what should happen when it is used. Yes, maintaining the documentation is hard, just do it and shut up will you?

This is why I am setting up automated generation of Doxygen documents. Writing Doxygen comments does not absolve me of all responsibilities, I still should write design documents of some sort, and nice pages describing how the code should be used, and examples. But it is a start, and it allows me to see when documentation is missing.

Forced to Compile from Scratch

I ran out of luck locating pre-built binaries for my dependencies. First, the autoconf-archive packages that I can locate for Ubuntu 12.04 do not have good support for Boost.Log, which I need. Second, there are no packages for yaml-cpp, which I use to parse (duh) YAML files, and I also need. And last, but this was expected, JayBeams depends on Skye.

None of these packages are really big, so I simply resigned myself to compile them from source and installing them. But that will be a drag if I want to use the Travis CI functionality for build matrices.

As I write this Travis CI is dutifully compiling the code. The first build was “successful”, but I purposefully set it up to just install all the dependencies and then run ./configure. No sense in getting more errors when I expect things to fail.

So, after some unsuccessful web searches I created a few more installation steps:

before_install:
# ... lots of stuff skipped see git repo for details ...
  - wget -q http://ftpmirror.gnu.org/autoconf-archive/autoconf-archive-2015.02.24.tar.xz
  - tar -xf autoconf-archive-2015.02.24.tar.xz
  - (cd autoconf-archive-2015.02.24 && ./configure --prefix=/usr && make && sudo make install)
  - sudo apt-get -qq -y install cmake
  - wget -q https://github.com/jbeder/yaml-cpp/archive/release-0.5.1.tar.gz
  - tar -xf release-0.5.1.tar.gz
  - (cd yaml-cpp-release-0.5.1 && mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX=/usr .. && make && make test && sudo make install)
  - wget -q https://github.com/coryan/Skye/releases/download/v0.2/skye-0.2.tar.gz
  - tar -xf skye-0.2.tar.gz
  - (cd skye-0.2 && CXX=g++-4.9 CC=gcc-4.9 ./configure --with-boost-libdir=/usr/lib/x86_64-linux-gnu/ && make check && sudo make install)

The full gory details can be found in the repository.

The sheer complexity of the installation process is making it more and more tempting to try some kind of container-based solution. Simply pull the container and compile. Potentially it can get developers going faster too: install this container and develop in that environment. On the other hand, users may want to install editors, IDEs, debuggers and other tools that would not be in the container, so a lot of customization is unavoidable.

At this point I just cringe at the number of steps before_install and keep going.

Configuration and Logging

Configuration and Logging are some of those things that all projects must chose how to do. Of the two, the least interesting to me was logging. I needed a solution, but I was not particularly interested in implementing one. I have done this in the past and I was unlikely to learn anything new. A good logging library will (amongst many other things) be able to filter by severity at run-time, and will also completely eliminate some severity levels at compile-time (if desired). It will help you identify the source of the messages, by filename and line number for example, but also can include the process and thread that generated the message. It can send the log to multiple destinations. It can timestamp the messages. Of course, it uses the iostream interface to log the basic types and take advantage of any user-defined streaming operators. I have chosen Boost.Log simply because I was already using Boost and seems to met most of the requirements I can think of.

Configuration

Application configuration is a more difficult topic. I wanted a configuration framework that allowed:

  • User-defined types as configuration options, e.g. time durations, or kernel scheduling parameters.
  • Recursively defined configuration options, that is, one can use a configuration object inside another.
  • The configuration objects have suitable default values, without requiring custom coding of special classes.
  • The default values can be defined at compile-time using -D options to the compiler, so one can change the defaults on a different platform, for example.
  • One should be able to override one configuration parameter without having to explicitly repeat the default values for the other parameters.
  • For tests and simple examples it should be possible to override parameters in the code, without having to modify argv or something similar.
  • Because the configuration can get quite complex, one should be able to read the configuration from files.
  • The location of these files should be configurable using some kind of environment variable.
  • The library should look at a set of standard locations for the configuration file, such as “/etc”, and then “wherever the binary is installed”, and then “whatever the value of $FOO_HOME is”.
  • The values set by the configuration files can be overriden by command-line arguments.

I defined a number of classes that achieve (I think) all these goals. To parse the configuration files I used YAML, because it was easy to hand craft configuration files, and I picked the yaml-cpp library because it seemed easy enough to use.

Using jb::configuration

A full example of the configuration classes can be found in the examples/configuration.cpp file.

If the Doxygen documentation and the example is not enough, please reach out to me through the mailing list. I will be happy to write a longer document, but the main motivation to commit this code soon was to get the continuous integration and automatic documentation going.