In the
previous
post we showed how cross-correlation could be used to find the time
delay between identical and very simple functions.
Now we want to explore what happens when one of the signals has some
noise.
In the last post we were considering two simple triangular signals
A and B, with B delayed some 13 microseconds from A.
We now modify B by adding some 5% noise to it:
And compute the cross-correlation with A:
We can also check the value of this cross correlation:
No changes! The cross-correlation can cope with a small amount of
noise without problems. To finalize the examples with triangular
functions we add a lot of noise to the signal:
And once more we compute the cross-correlation:
And obtain basic statistics about the cross-correlation values:
Once more, there are no changes to the estimate! The cross
correlation can deal with uniform noise without problems.
Quotes and Square functions
So far we have been using triangular functions because they were
easy to generate. Market signals more closely resemble square
functions: a quote value is valid until it changes. Moreover,
market data is not regularly sampled in time. One might receive no
updates for several milliseconds, and then receive multiple updates
in the same microsecond! But to illustrate how this would work we
can make our life easy. Suppose we have the best bid quantity
sampled every microsecond, and it had the following values:
We use a similar trick as before to create a time shifted version of
this signal, and add some noise to it:
And as before we can compute the cross-correlation:
And obtain basic statistics about the cross-correlation values:
One problem is that the different between the peak and the minimum
is not that high, in relative terms it is only 0.7%.
Conclusion
In these last three posts we have reviewed how cross-correlations
work for simple triangular functions, triangular functions with some
noise and finally for step functions with noise.
We observed that some FFT libraries avoid computation by not
rescaling, which can present problems interpreting the results.
We also observed that the result of the cross-correlation is a
measure of area, which can have very large values for some functions
and it would also be desirable to rescale.
If you are unfamiliar with the markets, or how to interpret a market
feed as a real function you might want to check the
previous
post on this topic.
Let’s start with a simple function and apply the fourier transform and
its inverse, the R snippets are available in the
repository
if you prefer to see or modify the code.
Here we will break after each bit of code to offer some
explanations.
first we write a simple function to generate triangular functions.
Nothing fancy really, but will save us time later
using the function we create a triangle
and wrap the triangle in a data.frame(), because ggplot2 really
likes data.frame()
ggplot2 generates sensible and good looking plots in most cases,
make sure it is loaded
then we can plot the triangle function
next, let’s apply the FFT transform and the inverse to the
triangular function
and save this into a new data.frame()
let’s add the new data to the data.frame()
What is going on here? To save computations, the “Fast” Fourier
Transform omits rescaling the function by , where is
the number of samples. If we apply this rescaling manually things
match perfectly
the more or less obvious question is how well does a function
correlate to itself, this is easy to compute
we are going to be wrapping these functions in a data.frame() a
lot, so let’s create a function for it
and plot the results
Notice that the y-axis is labeled , this is because the
result of a cross correlation is a measure of area. That means that
as we process data with larger values the cross-correlation will
grow with the square of the value too.
Let’s see how the correlation works with a time shifted signal
Let’s see how the cross-correlation looks like, but since we will be
doing several correlations, we write a helper function …
The graphs are pretty, but exactly where is the peak?
That is a perfect match, but market (or other) signals are rarely so
perfectly match, what happens if we add some noise?
I think it is time to validate the idea that cross-correlation is a
good way to estimate delays of market signals.
Originally I validated these notions using
R,
because the compile-test-debug cycle of C++ is too slow for these
purposes.
I do not claim that R is the best choice (or even a good choice) for
this purpose: any other scripting language with good support for
statistics would have done the trick, say Matlab, or Python.
I am familiar with R, and generates pretty graphs easily so I went
with it.
Market feeds and the inside.
If you are familiar with the markets you can skip to the
next section.
If you are not, hopefully the following explanation gives
you enough background to understands what follows.
And if are familiar with the markets and still chose to read it,
my apologies for the lack of precision, accuracy, or for the extremely
elementary treatment. It is intended as an extremely brief
introduction for those almost completely unfamiliar in the field.
Many, if not most, electronic markets operate as continuously
crossing markets of limit orders in independent books.
We need to expand on those terms a bit because some of the readers may
not be familiar with them.
Independent Books: by this we mean that the crossing algorithm in
the exchange, that is, the process by which buy and sell orders are
matched to each other, looks at a single security at a time.
The algorithm considers all the orders in Apple (for example), to
decide if a buyer matches a seller, but does not consider the orders
in Apple and Microsoft together.
The term “book” refers, as far as I know, to a ledger that in old days
was used to keep the list of buy orders and sell orders in the market.
There are markets that cross complex orders, that is, orders that want
to buy or sell more than one security at a time, other than citing
their existence, we will ignore these markets altogether.
Limit Orders: most markets support orders that specify a limit
price, that is the worst price they are willing to execute at.
For BUY orders, worst means the highest possible price they would
tolerate. For example, a BUY limit order at $10.00 would be willing
to transact at $9.00, or $9.99, and event at $10.00, but not at $10.01
nor even at $10.01000001.
Likewise, for SELL orders, worst means the lowest possible price
they would tolerate.
Continuously Crossing: this means that any order that could be
executed is immediately executed. For example, if the lowest SELL
order in a market is offering $10.01 and a new BUY order enters the
market at $10.02 then the two orders would be immediately executed.
The precise execution price depends on many things, though generally
the orders would be match at $10.01 there are many exceptions to that
rule.
Most markets have periods where certain orders are not
immediately executable, for example, in the US markets DAY orders are
only executable between 09:30 and 16:00 Eastern.
Some kind of auction process is executed at 09:30 to clear all DAY
orders that are crossing.
Non-limit Orders: there are many more order types than limit
orders, a full treatment of which is outside the scope of this
introduction. But briefly, MARKET orders execute at the best
available price. They can be though of as limit orders with an
extremely high (for BUY) or extremely low (for SELL) orders.
There are also orders whose limit price is tied to some market
attribute (PEGGED orders),
orders that only become active if the market is trading below or
above a certain price (STOP orders),
orders that trade slowly during the day, orders that execute only at
the market midpoint, etc., etc., etc.
Markets as Seen by the Computer Scientist
If you are a computer scientist, these continuously crossing markets
as a computer scientist you will notice an obvious invariant:
at any point in time
the highest BUY order has a limit price strictly lower than the price
of the lowest SELL order.
If this was not the case the best BUY order and the best SELL order
should match, execute and be removed from the book.
So, in the time periods when this invariant holds, the highest BUY
limit price is referred to as the best bid price in the market.
Likewise, the lowest SELL order is referred to as the best offer
in the market.
We have not mentioned this, but the reader would not be surprised to
hear that each order defines a quantity of securities that it is
willing to trade. No rational market agent would be willing (or able)
to buy or sell an infinite number whatever securities are traded.
Because there may be multiple orders willing to buy at the same price,
the best bid and best offer are always annotated with the quantities
available at that price level. The combination of all these figures,
the best bid price, best bid quantity, best offer price and best offer
quantities are referred to as the inside of the market (implicitly,
the inside of the market in each specific security).
There are some amusing subtleties regarding how the quantity available
is represented (in the US markets is in units of roundlots, which
are almost always 100 shares). But we will ignore these details for
the moment.
As one should expect, the inside changes over time. What is often
surprising to new students of market microstructure is that there are
multiple data sources for the inside data.
One can obtain the data through direct market data feeds from the
exchanges (sometimes an exchange may offer several different versions
of the same feed!),
or obtain it through the
consolidated
feeds,
or through market data re-distributors.
These different feeds have different latency characteristics, JayBeams
is a library to measure the difference of these latencies in real-time.
Market Feeds as Functions
The motivation for JayBeams is market data, but we can think of a
particular attribute in a market feed as a function of time.
For example, we could say that the best bid price for SPY on the
ITCH-5.0 feed is basically a function with real values.
Whatever is left of the mathematician in me wants to talk about
families of functions in indexed by the security, but
this would not help us (yet).
In the next sections we will just think about a single book at a time,
that is, when we say “the inside for the market” we will mean “the
inside for the market in security X”.
We may need to consider multiple securities simultaneously later, and
when we do so we will be explicit.
Let us first examine a single attribute on the inside, say the best
bid quantity. We will represent this attribute in different feeds as
different functions.
For example, let us call the inside best bid quantity from a
direct feed, and as the inside best bid quantity from a
consolidated feed.
Our expectation is that there is a value such that:
well, we actually expect to change over time because feeds
respond differently under different loads. But let’s start with the
simplifying assumption that is mostly constant.
As promised, I recently released a little program to estimate message
rates for the ITCH-5.0 data. It took me a while because I wanted to
refactor some code, but it is now done and available in my
github repository.
What makes market data processing fun is the extreme message rates you
have to deal with, for example, my estimates for ITCH-5.0 show that
you can expect 560,000 messages per second at peak.
Now, if data was nicely
distributed over a second that would leave you more than a microsecond
to process each message. Not a lot, but manageable, however, nearly 1%
of the messages are generated just 297 nanoseconds after the previous
one, so you need to process the messages in sub 300 nanos or risk some
delays.
Even if you do not care about performance in the peak second (and you
should), 75% of the milliseconds contain over 1,000 messages, so you
must be able to sustain 1,000,000 messages per second.
Just 3 main memory accesses will make it hard to keep up with such a
feed (ref).
A single memory allocation would probably ruin your day too
(ref).
If there were no other concerns, you would be writing most of this
code in assembly (or VHDL, as some people do).
Unfortunately, or fortunately depending on your perspective,
the requirements change quicker than
you can possibly develop assembly (or VHDL) code for such systems.
By the time your assembly code is done another feed has been created,
or there is a new version of the feed, or the feed handler needs to be
used in a different model of hardware, or another program wants to
have an embedded feed handler, or you want to reuse the code for
another purpose.
The ITCH-5.0 protocol suffers from a common confusion in the
securities industry. It calls their ticker field ‘Stock’. While it
is true that most securities traded on Nasdaq are common stock
securities, many are not: Nasdaq also trades Exchange Traded Funds,
Exchange Traded Notes, and even Warrants.
And the name of a security, i.e., the string used to identify the
security in the many electronic protocols used between exchanges and
participants, is not the same as the security. A more appropriate
name would have been ‘Security Identifier’, or ‘Ticker’, or ‘Symbol’.
In software we are used to not confusing an object with the multiple
references to it.
Likewise, we should get used to not confuse stock, which is a
security, that is, a contract granting specific rights to its owner;
vs. the name of the thing, such as a ticker: a short string used to
identify a security in some contexts.
Also, software engineers in the field should be aware that the same
security may be identified by different strings in different contexts,
e.g. many exchanges used different tickers for the same security, yes,
even in the US markets.
This is particularly obvious if you start
looking at securities with suffixes, such as preferred stock.
In addition, the same tickers refers to
different securities as time goes by, for example, after some
corporate actions.
And outside the US markets it is common to use fairly
long strings (ISINs) or just numbers (in Japan) to identify
securities in computer systems.
And that the string used to identify securities in a
market feed may not be the same string used to identify them when
placing orders, or clearing.
In short, it behooves software engineers in the field to keep these
things straight in their designs, and in their heads too I might add.