Building a Lock-Free SPSC Queue for Market Data
How I designed a single-producer single-consumer queue that achieves sub-100ns latency for streaming tick data - the design decisions, cache line considerations, and benchmarks.
The most common latency bottleneck in a market data pipeline isn’t your network stack or your order book - it’s the handoff between threads. A naive std::mutex-protected queue adds 200-500ns per operation on modern hardware. For a tick processing pipeline where you need to react in under a microsecond, that’s a non-starter.
Here’s how I built the SPSC queue that underpins the market data feed in my trading interface.
Why SPSC?
Market data pipelines have a natural single-producer structure: one thread receives bytes off the wire and parses them into tick structs. A second thread consumes those ticks and updates the order book. Classic SPSC territory.
The constraint of one producer and one consumer allows a key simplification: we only need two indices to be atomic, and they’re written by different threads. This avoids the costly CAS loop of a general MPMC queue.
The Core Design
template <typename T, std::size_t N>
class SpscQueue {
static_assert(std::has_single_bit(N), "N must be a power of two");
alignas(64) std::atomic<std::size_t> head_{0};
alignas(64) std::atomic<std::size_t> tail_{0};
std::array<T, N> buf_;
public:
bool push(T const& val) noexcept {
auto const t = tail_.load(std::memory_order_relaxed);
auto const next = (t + 1) & (N - 1);
if (next == head_.load(std::memory_order_acquire))
return false; // full
buf_[t] = val;
tail_.store(next, std::memory_order_release);
return true;
}
bool pop(T& val) noexcept {
auto const h = head_.load(std::memory_order_relaxed);
if (h == tail_.load(std::memory_order_acquire))
return false; // empty
val = buf_[h];
head_.store((h + 1) & (N - 1), std::memory_order_release);
return true;
}
};
A few things are doing heavy lifting here.
alignas(64) on the indices. Without this, head_ and tail_ land on the same cache line. The producer writes tail_; the consumer writes head_. With false sharing, each write invalidates the other core’s cache line - you pay a ~200ns coherency round trip on every operation. Separate cache lines eliminate this entirely.
Memory order selection. push uses relaxed to load tail_ (only the producer touches it), acquire to load head_ (need to see the consumer’s write), and release to store the new tail_ (publish the data before advertising availability). The acquire/release pair establishes the happens-before edge that makes the data transfer safe.
Power-of-two capacity. The modulo becomes a bitwise AND, which matters when this code runs in a tight hot loop.
Cache Warming
One issue I hit in practice: when the queue is empty and the consumer is spinning, it repeatedly loads tail_ from the producer’s cache line. Under MESI, this is a shared read and stays hot. But on the first push after a quiet period, there’s a short window where the consumer sees stale data and spins an extra iteration.
For ultra-low-latency scenarios, I added a prefetch hint in the consumer’s spin loop:
while (!q.pop(tick)) {
__builtin_ia32_pause();
// keep tail_ cache line warm
__builtin_prefetch(&q.tail_, 0, 3);
}
This shaved ~15ns off worst-case first-message latency in my benchmarks.
Benchmarks
Measured on a pinned core pair (no NUMA crossing), Ryzen 9 7950X:
| Queue type | p50 | p99 | p99.9 |
|---|---|---|---|
std::mutex + std::queue | 380ns | 620ns | 2.1µs |
boost::lockfree::spsc_queue | 68ns | 95ns | 140ns |
| This implementation | 41ns | 58ns | 82ns |
The gap vs Boost comes from Boost’s internal use of seq_cst barriers and an extra branch for their overwrite policy.
One Gotcha: the Compiler Is Your Enemy
Without __attribute__((noinline)) on push/pop and volatile-style barriers, the compiler happily hoists the capacity check out of the loop, or reorders the store before the data write. Always inspect the assembly. godbolt.org with -O3 -march=native is your friend.
The code for this queue - along with the rest of the market data pipeline - is part of the backend for my trading interface.