Building a Lock-Free SPSC Queue for Market Data

The most common latency bottleneck in a market data pipeline isn’t your network stack or your order book - it’s the handoff between threads. A naive std::mutex-protected queue adds 200-500ns per operation on modern hardware. For a tick processing pipeline where you need to react in under a microsecond, that’s a non-starter.

Here’s how I built the SPSC queue that underpins the market data feed in my trading interface.

Why SPSC?

Market data pipelines have a natural single-producer structure: one thread receives bytes off the wire and parses them into tick structs. A second thread consumes those ticks and updates the order book. Classic SPSC territory.

The constraint of one producer and one consumer allows a key simplification: we only need two indices to be atomic, and they’re written by different threads. This avoids the costly CAS loop of a general MPMC queue.

The Core Design

template <typename T, std::size_t N>
class SpscQueue {
    static_assert(std::has_single_bit(N), "N must be a power of two");

    alignas(64) std::atomic<std::size_t> head_{0};
    alignas(64) std::atomic<std::size_t> tail_{0};
    std::array<T, N> buf_;

public:
    bool push(T const& val) noexcept {
        auto const t = tail_.load(std::memory_order_relaxed);
        auto const next = (t + 1) & (N - 1);
        if (next == head_.load(std::memory_order_acquire))
            return false; // full
        buf_[t] = val;
        tail_.store(next, std::memory_order_release);
        return true;
    }

    bool pop(T& val) noexcept {
        auto const h = head_.load(std::memory_order_relaxed);
        if (h == tail_.load(std::memory_order_acquire))
            return false; // empty
        val = buf_[h];
        head_.store((h + 1) & (N - 1), std::memory_order_release);
        return true;
    }
};

A few things are doing heavy lifting here.

alignas(64) on the indices. Without this, head_ and tail_ land on the same cache line. The producer writes tail_; the consumer writes head_. With false sharing, each write invalidates the other core’s cache line - you pay a ~200ns coherency round trip on every operation. Separate cache lines eliminate this entirely.

Memory order selection. push uses relaxed to load tail_ (only the producer touches it), acquire to load head_ (need to see the consumer’s write), and release to store the new tail_ (publish the data before advertising availability). The acquire/release pair establishes the happens-before edge that makes the data transfer safe.

Power-of-two capacity. The modulo becomes a bitwise AND, which matters when this code runs in a tight hot loop.

Cache Warming

One issue I hit in practice: when the queue is empty and the consumer is spinning, it repeatedly loads tail_ from the producer’s cache line. Under MESI, this is a shared read and stays hot. But on the first push after a quiet period, there’s a short window where the consumer sees stale data and spins an extra iteration.

For ultra-low-latency scenarios, I added a prefetch hint in the consumer’s spin loop:

while (!q.pop(tick)) {
    __builtin_ia32_pause();
    // keep tail_ cache line warm
    __builtin_prefetch(&q.tail_, 0, 3);
}

This shaved ~15ns off worst-case first-message latency in my benchmarks.

Benchmarks

Measured on a pinned core pair (no NUMA crossing), Ryzen 9 7950X:

Queue type	p50	p99	p99.9
`std::mutex` + `std::queue`	380ns	620ns	2.1µs
`boost::lockfree::spsc_queue`	68ns	95ns	140ns
This implementation	41ns	58ns	82ns

The gap vs Boost comes from Boost’s internal use of seq_cst barriers and an extra branch for their overwrite policy.

One Gotcha: the Compiler Is Your Enemy

Without __attribute__((noinline)) on push/pop and volatile-style barriers, the compiler happily hoists the capacity check out of the loop, or reorders the store before the data write. Always inspect the assembly. godbolt.org with -O3 -march=native is your friend.

The code for this queue - along with the rest of the market data pipeline - is part of the backend for my trading interface.