25 Aug 2015
After pushing the code coverage to nearly 99% from the low 80’s, I
have found exactly
one bug.
This was a known problem and a feature that I intentionally left
unimplemented. Not a huge return on investment, but increasing code
coverage is satisfying for its own sake.
The next chunk of code to release deals with measuring expected
message rates. To design the system I want to know (within an order
of magnitude) how many messages per second, per millisecond, and even
per microsecond to expect. This is partially to satisfy my
curiosity, partially to illustrate the technical problems when
building real-time market data processing systems, and partly because
that defines the design envelope.
Don’t write code if you can just go an look it up online
Before we write any code, why not simply look up the numbers online?
They might not be accurate, but would help with the initial
estimation.
My search-fu may not be the best, but this information is hard to
find. It is not exactly secret, exchanges publish it so their users
can provision their networks and software systems adequately.
But it is not on their front page either, they reserve that space for
information with more commercial value.
But some digging around can get you the basics.
BATS
For example, as of 2015-06-01 BATS
informs
its users that they can expect 21,883 messages in the peak
millisecond for their BZX exchange.
BATS owns several exchanges, and the numbers can be as “low” as 15,000
messages for the peak millisecond.
To illustrate how bursty this data is: in the same publication BATS
notes that on the BYX exchange the peak minute may have the equivalent
of 50,000 messages/second, while the peak millisecond carries the
equivalent of 16,000,000 messages/second.
We will see later than the peak microsecond for some of these feeds
can carry 270 messages, and implied rate of 270,000,000
messages/second. In other words, the bursts can be 3 orders of
magnitude higher than the average. And they can be very high indeed.
NASDAQ
NASDAQ provides a
report
recommending 160 Mbps of bandwidth for their ITCH-5.0 feed.
Since their messages sizes are around 40 bytes (see the
spec),
we can estimate that this feed peaks at 400,000 messages/second.
However, we can do much better: NASDAQ provides
sample data so their
users can verify if their feed handlers are processing it correctly.
We will use this data, to generate some interesting stats, but that later!
So What about Some Code?
I just pushed a few new classes to github to deal keep histograms,
that is, counts of events by bucket. Shortly I will push additional
classes where the range of each bucket represents some observed
message rate. With these two classes in place we can estimate the
min, max, mean, median, p90, p99 or any percentile of message rates we
are interested in.
Of course the histograms will also be useful to later compute
inter-arrival times, or latencies.
The jb::histogram
class decomposes the problem of defining the
bucket ranges and computing several statistical estimators into two
separate classes. The bucket ranges are defined by a strategy, and I
have implemented two simple ones:
jb::integer_range_binning
: simply defines one bucket for each
integer value between some user-prescribed minimum and maximum. In
other words, it is about as simple as you can get.
jb::explicit_cuts_binning
: allows the user to define the exact
points for each bucket. Typically this is useful when you want to
define buckets of variable size, such as
[0,1,2,3,…,9,10,20,30…,100]
Users can define additional binning strategies as long as they conform
to the jb::binning_strategy_concept
interface. The jb::histogram
class enforces these requirements using compile-time assertions, which
(hopefully) provide better error messages than whatever the default
compiler does.
What is with the weird dates?
If you see a post with an strange date, it is because I am using UTC
to date them. Sometimes I post late on US Eastern time, and that may
make it appear as if the post is from the future.
24 Aug 2015
Code coverage metrics is the next tool that I want to make available
in my project. Poor coverage metrics are an easy way to look for
potential bugs.
Alas! The opposite is not true, even with 100% line coverage, and
apparently even with 100% branch coverage, one can at best expect to
find around 60% of the defects in the code.
Or at least, that is what the relatively small amount of literature on
the subject seems to suggest.
Don’t take my word for it, go buy this
book,
and read the sections about testing.
Before you organize a mob and start passing the pitchforks and
torches, remember that this does not mean that unit testing (or
automated testing) is useless.
It is a relatively cheap way to filter many errors, just not an
infalible one.
Build Matrix
At this point the number of builds is getting complicated.
We want to build with clang and gcc, we want to get code coverage
data, and we want to generate the doxygen documentation.
Before the .travis.yml
file gets out of hand we need to do some
basic refactoring and take advantage of the environment matrix.
As usual, that was much harder than anticipated, but now I have a
build for clang dbg (without any optimizations), which also uploads the
doxygen documentation; a build for gcc with code coverage, which also
uploads to
coveralls.io;
and builds for both gcc and clang with all optimizations.
I added the coveralls.io badge, though the state is shameful at the
moment.
I have been trying to get lcov
and llvm-cov
to cooperate without
success, I am interested because clang promises to deliver branch
coverage, a far more interesting metric than line coverage in my
opinion. But this is a nice to have more than a required feature.
Ubuntu 12.04 is irritating me.
My experience with lcov
is always more frustrating than it needs to be.
Needless to say, the default version of gcov crashed
miserably with code generated by gcc-4.9
on Ubuntu 12.04.
It seemed reasonable enough to assume that the stock lcov
with the
right --gcov-tool
option would work. Nope, no luck.
So we need to build a more recent version of lcov in
addition to installing more recent versions of gcov and gcc.
Lucky for me, somebody else had
solved
this problem before.
23 Aug 2015
I think of continuous integration, unit testing, code reviews, design
documents, and documentation as practices that prevent or catch
common errors.
It is easy to see why that is the case for unit testing, you are
making sure the code works as you expect as soon as possible;
or with continuous integration: you are making sure that defects do
not go unnoticed for too long.
I believe (yes, this is one of those opinions I promised in the About
page),
that documentation is also a practice to prevent defects: it stops others
from using your code incorrectly. It states, in words that humans can
read, how you expect the code to be used and what how should others
expect the code to behave.
Others have said this
better
than I possibly could, but it is worth repeating:
yes, by all means make your interfaces so obvious that very little
documentation is needed;
yes, by all means use the type system so it
is hard to use the code incorrectly (but you can go too far on this);
and yes, by all means write unit tests that describe the expected
behaviors and uses of your code.
Do all those things and then document your code, state how it is to be
used, state what should happen when it is used.
Yes, maintaining the documentation is hard, just do it and shut up
will you?
This is why I am setting up automated generation of Doxygen
documents.
Writing Doxygen comments does not absolve me of all
responsibilities, I still should write design documents of some sort,
and nice pages describing how the code should be used, and examples.
But it is a start, and it allows me to see when documentation is
missing.
23 Aug 2015
I ran out of luck locating pre-built binaries for my dependencies.
First, the autoconf-archive
packages that I can locate for Ubuntu
12.04 do not have good support for Boost.Log, which I need.
Second, there are no packages for yaml-cpp
, which I use to parse
(duh) YAML files, and I also need.
And last, but this was expected, JayBeams depends on Skye.
None of these packages are really big, so I simply resigned myself to
compile them from source and installing them. But that will be a drag
if I want to use the Travis CI functionality for build matrices.
As I write this Travis CI is dutifully compiling the code. The first
build was “successful”, but I purposefully set it up to just install
all the dependencies and then run ./configure
. No sense in getting
more errors when I expect things to fail.
So, after some unsuccessful web searches I created a few more
installation steps:
before_install:
# ... lots of stuff skipped see git repo for details ...
- wget -q http://ftpmirror.gnu.org/autoconf-archive/autoconf-archive-2015.02.24.tar.xz
- tar -xf autoconf-archive-2015.02.24.tar.xz
- (cd autoconf-archive-2015.02.24 && ./configure --prefix=/usr && make && sudo make install)
- sudo apt-get -qq -y install cmake
- wget -q https://github.com/jbeder/yaml-cpp/archive/release-0.5.1.tar.gz
- tar -xf release-0.5.1.tar.gz
- (cd yaml-cpp-release-0.5.1 && mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX=/usr .. && make && make test && sudo make install)
- wget -q https://github.com/coryan/Skye/releases/download/v0.2/skye-0.2.tar.gz
- tar -xf skye-0.2.tar.gz
- (cd skye-0.2 && CXX=g++-4.9 CC=gcc-4.9 ./configure --with-boost-libdir=/usr/lib/x86_64-linux-gnu/ && make check && sudo make install)
The full gory details can be found in the
repository.
The sheer complexity of the installation process is making it more and
more tempting to try some kind of container-based solution. Simply
pull the container and compile.
Potentially it can get developers going faster too:
install this container and develop in that environment. On the other
hand, users may want to install editors, IDEs, debuggers and other
tools that would not be in the container, so a lot of customization is
unavoidable.
At this point I just cringe at the number of steps before_install
and keep going.
23 Aug 2015
Configuration and Logging are some of those things that all projects
must chose how to do. Of the two, the least interesting to me was
logging. I needed a solution, but I was not particularly interested
in implementing one. I have done this in the past and I was unlikely
to learn anything new.
A good logging library will (amongst many other things) be able to
filter by severity at run-time,
and will also completely eliminate some severity levels at
compile-time (if desired).
It will help you identify the source of the messages, by filename and
line number for example, but also can include the process and thread
that generated the message.
It can send the log to multiple destinations.
It can timestamp the messages.
Of course, it uses the iostream interface to log the basic types and
take advantage of any user-defined streaming operators.
I have chosen
Boost.Log
simply because I was already using Boost and seems to met most of the
requirements I can think of.
Configuration
Application configuration is a more difficult topic. I wanted a
configuration framework that allowed:
- User-defined types as configuration options, e.g. time durations, or
kernel scheduling parameters.
- Recursively defined configuration options, that is, one can use a
configuration object inside another.
- The configuration objects have suitable default values, without
requiring custom coding of special classes.
- The default values can be defined at compile-time using
-D
options
to the compiler, so one can change the defaults on a different
platform, for example.
- One should be able to override one configuration parameter without
having to explicitly repeat the default values for the other
parameters.
- For tests and simple examples it should be possible to override
parameters in the code, without having to modify
argv
or something
similar.
- Because the configuration can get quite complex, one should be able
to read the configuration from files.
- The location of these files should be configurable using some kind
of environment variable.
- The library should look at a set of standard locations for the
configuration file, such as “/etc”, and then “wherever the binary is
installed”, and then “whatever the value of $FOO_HOME is”.
- The values set by the configuration files can be overriden by
command-line arguments.
I defined a number of classes that achieve (I think) all these goals.
To parse the configuration files I used YAML, because it was easy to
hand craft configuration files, and I picked the yaml-cpp
library
because it seemed easy enough to use.
Using jb::configuration
A full example of the configuration classes can be found in the
examples/configuration.cpp
file.
If the Doxygen documentation and the example is not enough, please
reach out to me through the
mailing list.
I will be happy to write a longer document, but the
main motivation to commit this code soon was to get the continuous
integration and automatic documentation going.