8-bit Assembly was Fun Ramblings of a not-so-old Programmer

Validate Cross Correlation, Part 3

In the previous post we showed how cross-correlation could be used to find the time delay between identical and very simple functions. Now we want to explore what happens when one of the signals has some noise.

In the last post we were considering two simple triangular signals A and B, with B delayed some 13 microseconds from A.

A graph of two triangular functions, the x-axis is labeled 'usec', the y-axis is labeled 'value'.  The triangular functions are labeled A and B.  B is time shifted, has the value of A a few microseconds earlier.

We now modify B by adding some 5% noise to it:

c <- b + runif(length(b), -0.05, 0.05)
ac.df <- rbind(a.df, to.df(c, "C"))
qplot(x=usec, y=value, color=variable, data=ac.df) +
  theme(legend.position="bottom")

A graph of two triangular functions, the x-axis is labeled 'usec', the y-axis is labeled 'value'.  The triangular functions are labeled A and C.  C is time shifted, has the value of A a few microseconds earlier, and it is slightly 'noisy'.

And compute the cross-correlation with A:

corr.ac.df <- correlation.df(a, c, "A * C")
qplot(x=usec, y=value, color=variable, data=corr.ac.df) +
  scale_y_continuous(name=expression(value^2)) +
  theme(legend.position="bottom")

Another sinusoidal graph.The x axis labeled usec, ranging from 0 to 128. The y axis labeled "value squared, ranging from approximately -30 to 30. The sinusoid has a single period, which peaks around 15, and bottoms at around 75.

We can also check the value of this cross correlation:

> which.max(corr.ac.df$value)
[1] 13
> max(corr.ab.df$value)
[1] 42.6875

No changes! The cross-correlation can cope with a small amount of noise without problems. To finalize the examples with triangular functions we add a lot of noise to the signal:

d <- b + runif(length(b), -0.2, 0.2)
ad.df <- rbind(a.df, to.df(d, "D (20% noise)"))
qplot(x=usec, y=value, color=variable, data=ad.df) +
  theme(legend.position="bottom")

A graph of two triangular functions, the x-axis is labeled 'usec', the y-axis is labeled 'value'.  The triangular functions are labeled A and D.  D is time shifted, has the value of A a few microseconds earlier, and it is very 'noisy'.

And once more we compute the cross-correlation:

corr.ad.df <- correlation.df(a, d, "A * D")
qplot(x=usec, y=value, color=variable, data=corr.ad.df) +
  scale_y_continuous(name=expression(value^2)) +
  theme(legend.position="bottom")

Another sinusoidal graph.The x axis labeled usec, ranging from 0 to 128. The y axis labeled "value squared", ranging from approximately -30 to 30. The sinusoid has a single period, which peaks around 15, and bottoms at around 75.

And obtain basic statistics about the cross-correlation values:

> which.max(corr.ad.df$value)
[1] 13
> summary(corr.ad.df$value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-43.34  -29.57    0.00    0.00   29.57   43.34 

Once more, there are no changes to the estimate! The cross correlation can deal with uniform noise without problems.

Quotes and Square functions

So far we have been using triangular functions because they were easy to generate. Market signals more closely resemble square functions: a quote value is valid until it changes. Moreover, market data is not regularly sampled in time. One might receive no updates for several milliseconds, and then receive multiple updates in the same microsecond! But to illustrate how this would work we can make our life easy. Suppose we have the best bid quantity sampled every microsecond, and it had the following values:

S <- c(rep(1000, 100), rep(1100, 100), rep(1200, 56))
S.df <- to.df(S, "S")
qplot(x=usec, y=value, color=variable, data=S.df) +
  ylim(0, max(S.df$value)) +
  theme(legend.position="bottom")

A graph of a step function. The x-axis is labeled 'usec', it ranges from 0 to 256. The y-axis is labeled 'value', it ranges from 1000 to 1200. The values are labeled 'S'.
The function has contant value 1000 from 0 to 100, then constant value 1100 from 100 to 200, and then constant value 1200.

We use a similar trick as before to create a time shifted version of this signal, and add some noise to it:

T <- S[((seq(1, length(S)) - 27) %% length(S)) + 1]
T <- T + runif(length(T), -10, 10)
ST.df <- rbind(S.df, to.df(T, "T"))
qplot(x=usec, y=value, color=variable, data=ST.df) +
  ylim(0, max(ST.df$value)) +
  theme(legend.position="bottom")

A graph of two step functions. The x-axis is labeled 'usec', it ranges from 0 to 256. The y-axis is labeled 'value', it ranges from 1000 to 1200. The values are labeled 'S' and 'T'.
The 'S' values has contant value 1000 from 0 to 100, then constant value 1100 from 100 to 200, and then constant value 1200.The 'T' values are the 'S' values delayed by approximately 30 microseconds, with some amount of noise.

And as before we can compute the cross-correlation:

corr.ST.df <- correlation.df(S, T, "S * T")
qplot(x=usec, y=value, color=variable, data=corr.ST.df) +
  scale_y_continuous(name=expression(value^2)) +
  theme(legend.position="bottom")

A graph with a peak around 30.The x axis labeled usec, ranging from 0 to 256. The y axis labeled "value squared", ranging from approximately 299500000 to over 301500000. The values start at around 307500000 grow linearly to the peak at over 301500000, the decreate linearly for some time, and and then decrease in 2 apparently linear segments. The values are basically constant between 120 and 180. Then grow again in two linear segments finishing just below 301000000.

And obtain basic statistics about the cross-correlation values:

> which.max(corr.ST.df$value)
[1] 27
> summary(corr.ST.df$value)

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="w"> </span><span class="n">Min.</span><span class="w">   </span><span class="m">1</span><span class="n">st</span><span class="w"> </span><span class="n">Qu.</span><span class="w">    </span><span class="n">Median</span><span class="w">      </span><span class="n">Mean</span><span class="w">   </span><span class="m">3</span><span class="n">rd</span><span class="w"> </span><span class="n">Qu.</span><span class="w">      </span><span class="n">Max.</span><span class="w"> </span></code></pre></figure>

299500000 299600000 299900000 300200000 300700000 301600000 

One problem is that the different between the peak and the minimum is not that high, in relative terms it is only 0.7%.

Conclusion

In these last three posts we have reviewed how cross-correlations work for simple triangular functions, triangular functions with some noise and finally for step functions with noise. We observed that some FFT libraries avoid computation by not rescaling, which can present problems interpreting the results. We also observed that the result of the cross-correlation is a measure of area, which can have very large values for some functions and it would also be desirable to rescale.

Validate Cross Correlation, Part 2

If you are unfamiliar with the markets, or how to interpret a market feed as a real function you might want to check the previous post on this topic.

Let’s start with a simple function and apply the fourier transform and its inverse, the R snippets are available in the repository if you prefer to see or modify the code. Here we will break after each bit of code to offer some explanations.

first we write a simple function to generate triangular functions. Nothing fancy really, but will save us time later

triangle <- function(period) {
  p4 <- period / 4
  up <- (seq(1, period/2) - p4) / p4
  dn <- (p4 - seq(1, period/2)) / p4
  return(c(up, dn))
}

using the function we create a triangle

t <- triangle(128)

and wrap the triangle in a data.frame(), because ggplot2 really likes data.frame()

df <- data.frame(usec=seq(0, length(t) - 1), value=t, variable="T(t)")

ggplot2 generates sensible and good looking plots in most cases, make sure it is loaded

require(ggplot2)  # May need to install.packages("ggplot2")

then we can plot the triangle function

qplot(x=usec, y=value, color=variable, data=df)

A sampled function, the x-axis goes from 0 to 128, the values start at -1.0 and grow linearly to 1.0 just before x is equal to 64.  Then the values decrease linearly to -1.0 when x is equal to 128.

next, let’s apply the FFT transform and the inverse to the triangular function

fft.i <- fft(fft(t), inverse=TRUE)

and save this into a new data.frame()

tmp <- df
tmp$value <- Re(fft.i)
tmp$variable <- 'FFT^1(FFT(t))'
df.i <- rbind(df, tmp)

let’s add the new data to the data.frame()

qplot(x=usec, y=value, color=variable, data=df.i)

Two triangular functions as before.  One labeled 'T(t)' with much smaller amplitude. The second, labeled FFT inverse applied to FFT of T(t) has much higher amplitude, it ranges from -128 to +128.

What is going on here? To save computations, the “Fast” Fourier Transform omits rescaling the function by , where is the number of samples. If we apply this rescaling manually things match perfectly

fft.i <- fft.i / length(fft.i)
tmp <- df
tmp$value <- Re(fft.i)
tmp$variable <- 'FFT^1(FFT(T))(t)'
df.i <- rbind(df, tmp)
qplot(x=usec, y=value, color=variable, data=df.i)

Two triangular functions as before.  One labeled T(t), and a second labeled FFT inverse applied to FFT of T(t).  Both overlap perfectly, they vary in the -1 to +1 range.

the more or less obvious question is how well does a function correlate to itself, this is easy to compute

corr <- Re(fft( Conj(fft(t)) * fft(t), inverse=TRUE)) / length(t)

we are going to be wrapping these functions in a data.frame() a lot, so let’s create a function for it

to.df <- function(a, name) {
  return(data.frame(
    usec=seq(0, length(a) - 1), value=a, variable=name))
}

and plot the results

corr.df <- to.df(corr, "(T * T)(t)")
qplot(x=usec, y=value, color=variable, data=corr.df) +
  scale_y_continuous(name=expression(value^2)) +
  theme(legend.position="bottom")

A sinusoidal wave.  The x axis varies from 0 to 128.  The wave starts with a high value at around 32, a low value of -32 reached when x is approximately 64, and growing back to +32 when x is equal to 128.

Notice that the y-axis is labeled , this is because the result of a cross correlation is a measure of area. That means that as we process data with larger values the cross-correlation will grow with the square of the value too.

Let’s see how the correlation works with a time shifted signal

a <- triangle(128)
a.df <- to.df(a, "A")
b <- a[((seq(1, length(a)) - 13) %% length(a)) + 1]
ab.df <- rbind(a.df, to.df(b, "B"))
qplot(x=usec, y=value, color=variable, data=ab.df) +
  theme(legend.position="bottom")

A graph of two triangular functions, the x-axis is labeled 'usec', the y-axis is labeled 'value'.  The triangular functions are labeled A and B.  B is time shifted, has the value of A a few microseconds earlier.

Let’s see how the cross-correlation looks like, but since we will be doing several correlations, we write a helper function …

correlation <- function(a, b) {
  inv <- Re(fft( Conj(fft(a)) * fft(b), inverse=TRUE))
  return(inv / length(a))
}
correlation.df <- function(a, b, name) {
  return(to.df(correlation(a, b), name))
}
corr.ab.df <- correlation.df(a, b, "A * B")
qplot(x=usec, y=value, color=variable, data=corr.ab.df) +
  scale_y_continuous(name=expression(value^2)) +
  theme(legend.position="bottom")

Another sinusoidal graph.The x axis labeled usec, ranging from 0 to 128. The y axis labeled $$value^2$$, ranging from approximately -30 to 30. The sinusoid has a single period, which peaks around 15, and bottoms at around 75.

The graphs are pretty, but exactly where is the peak?

which.max(corr.ab.df$value)
[1] 13

That is a perfect match, but market (or other) signals are rarely so perfectly match, what happens if we add some noise?

Validate Cross Correlation, Part 1

I think it is time to validate the idea that cross-correlation is a good way to estimate delays of market signals. Originally I validated these notions using R, because the compile-test-debug cycle of C++ is too slow for these purposes. I do not claim that R is the best choice (or even a good choice) for this purpose: any other scripting language with good support for statistics would have done the trick, say Matlab, or Python. I am familiar with R, and generates pretty graphs easily so I went with it.

Market feeds and the inside.

If you are familiar with the markets you can skip to the next section. If you are not, hopefully the following explanation gives you enough background to understands what follows. And if are familiar with the markets and still chose to read it, my apologies for the lack of precision, accuracy, or for the extremely elementary treatment. It is intended as an extremely brief introduction for those almost completely unfamiliar in the field.

Many, if not most, electronic markets operate as continuously crossing markets of limit orders in independent books. We need to expand on those terms a bit because some of the readers may not be familiar with them.

Independent Books: by this we mean that the crossing algorithm in the exchange, that is, the process by which buy and sell orders are matched to each other, looks at a single security at a time. The algorithm considers all the orders in Apple (for example), to decide if a buyer matches a seller, but does not consider the orders in Apple and Microsoft together. The term “book” refers, as far as I know, to a ledger that in old days was used to keep the list of buy orders and sell orders in the market. There are markets that cross complex orders, that is, orders that want to buy or sell more than one security at a time, other than citing their existence, we will ignore these markets altogether.

Limit Orders: most markets support orders that specify a limit price, that is the worst price they are willing to execute at. For BUY orders, worst means the highest possible price they would tolerate. For example, a BUY limit order at $10.00 would be willing to transact at $9.00, or $9.99, and event at $10.00, but not at $10.01 nor even at $10.01000001. Likewise, for SELL orders, worst means the lowest possible price they would tolerate.

Continuously Crossing: this means that any order that could be executed is immediately executed. For example, if the lowest SELL order in a market is offering $10.01 and a new BUY order enters the market at $10.02 then the two orders would be immediately executed. The precise execution price depends on many things, though generally the orders would be match at $10.01 there are many exceptions to that rule. Most markets have periods where certain orders are not immediately executable, for example, in the US markets DAY orders are only executable between 09:30 and 16:00 Eastern. Some kind of auction process is executed at 09:30 to clear all DAY orders that are crossing.

Non-limit Orders: there are many more order types than limit orders, a full treatment of which is outside the scope of this introduction. But briefly, MARKET orders execute at the best available price. They can be though of as limit orders with an extremely high (for BUY) or extremely low (for SELL) orders. There are also orders whose limit price is tied to some market attribute (PEGGED orders), orders that only become active if the market is trading below or above a certain price (STOP orders), orders that trade slowly during the day, orders that execute only at the market midpoint, etc., etc., etc.

Markets as Seen by the Computer Scientist

If you are a computer scientist, these continuously crossing markets as a computer scientist you will notice an obvious invariant: at any point in time the highest BUY order has a limit price strictly lower than the price of the lowest SELL order. If this was not the case the best BUY order and the best SELL order should match, execute and be removed from the book.

So, in the time periods when this invariant holds, the highest BUY limit price is referred to as the best bid price in the market. Likewise, the lowest SELL order is referred to as the best offer in the market.

We have not mentioned this, but the reader would not be surprised to hear that each order defines a quantity of securities that it is willing to trade. No rational market agent would be willing (or able) to buy or sell an infinite number whatever securities are traded.

Because there may be multiple orders willing to buy at the same price, the best bid and best offer are always annotated with the quantities available at that price level. The combination of all these figures, the best bid price, best bid quantity, best offer price and best offer quantities are referred to as the inside of the market (implicitly, the inside of the market in each specific security).

There are some amusing subtleties regarding how the quantity available is represented (in the US markets is in units of roundlots, which are almost always 100 shares). But we will ignore these details for the moment.

As one should expect, the inside changes over time. What is often surprising to new students of market microstructure is that there are multiple data sources for the inside data. One can obtain the data through direct market data feeds from the exchanges (sometimes an exchange may offer several different versions of the same feed!), or obtain it through the consolidated feeds, or through market data re-distributors. These different feeds have different latency characteristics, JayBeams is a library to measure the difference of these latencies in real-time.

Market Feeds as Functions

The motivation for JayBeams is market data, but we can think of a particular attribute in a market feed as a function of time. For example, we could say that the best bid price for SPY on the ITCH-5.0 feed is basically a function with real values. Whatever is left of the mathematician in me wants to talk about families of functions in indexed by the security, but this would not help us (yet).

In the next sections we will just think about a single book at a time, that is, when we say “the inside for the market” we will mean “the inside for the market in security X”. We may need to consider multiple securities simultaneously later, and when we do so we will be explicit.

Let us first examine a single attribute on the inside, say the best bid quantity. We will represent this attribute in different feeds as different functions. For example, let us call the inside best bid quantity from a direct feed, and as the inside best bid quantity from a consolidated feed.

Our expectation is that there is a value such that:

well, we actually expect to change over time because feeds respond differently under different loads. But let’s start with the simplifying assumption that is mostly constant.

Time Delay Estimation for Functions

We simply refer the reader to the Wikipedia article on Cross-correlation and the section therein on Time delay analysis.

As this article points out, one can estimate the delay as:

The key observation is that we can estimate the cross-correlation using the Fast Fourier transform :

Enough Math! Show me how it Looks!

Will do, in the next post, this one is getting to long anyway.

Some Estimates for ITCH

As promised, I recently released a little program to estimate message rates for the ITCH-5.0 data. It took me a while because I wanted to refactor some code, but it is now done and available in my github repository.

What makes market data processing fun is the extreme message rates you have to deal with, for example, my estimates for ITCH-5.0 show that you can expect 560,000 messages per second at peak. Now, if data was nicely distributed over a second that would leave you more than a microsecond to process each message. Not a lot, but manageable, however, nearly 1% of the messages are generated just 297 nanoseconds after the previous one, so you need to process the messages in sub 300 nanos or risk some delays.

Even if you do not care about performance in the peak second (and you should), 75% of the milliseconds contain over 1,000 messages, so you must be able to sustain 1,000,000 messages per second.

Just 3 main memory accesses will make it hard to keep up with such a feed (ref). A single memory allocation would probably ruin your day too (ref). If there were no other concerns, you would be writing most of this code in assembly (or VHDL, as some people do). Unfortunately, or fortunately depending on your perspective, the requirements change quicker than you can possibly develop assembly (or VHDL) code for such systems. By the time your assembly code is done another feed has been created, or there is a new version of the feed, or the feed handler needs to be used in a different model of hardware, or another program wants to have an embedded feed handler, or you want to reuse the code for another purpose.

Securities vs. Symbols

The ITCH-5.0 protocol suffers from a common confusion in the securities industry. It calls their ticker field ‘Stock’. While it is true that most securities traded on Nasdaq are common stock securities, many are not: Nasdaq also trades Exchange Traded Funds, Exchange Traded Notes, and even Warrants.

And the name of a security, i.e., the string used to identify the security in the many electronic protocols used between exchanges and participants, is not the same as the security. A more appropriate name would have been ‘Security Identifier’, or ‘Ticker’, or ‘Symbol’.

In software we are used to not confusing an object with the multiple references to it. Likewise, we should get used to not confuse stock, which is a security, that is, a contract granting specific rights to its owner; vs. the name of the thing, such as a ticker: a short string used to identify a security in some contexts.

Also, software engineers in the field should be aware that the same security may be identified by different strings in different contexts, e.g. many exchanges used different tickers for the same security, yes, even in the US markets. This is particularly obvious if you start looking at securities with suffixes, such as preferred stock. In addition, the same tickers refers to different securities as time goes by, for example, after some corporate actions.

And outside the US markets it is common to use fairly long strings (ISINs) or just numbers (in Japan) to identify securities in computer systems. And that the string used to identify securities in a market feed may not be the same string used to identify them when placing orders, or clearing.

In short, it behooves software engineers in the field to keep these things straight in their designs, and in their heads too I might add.

References

Haddocks’_Eyes