๐Ÿ“ฃ Haystack 2.27 is here! Better DX for document stores & automatic list joining in pipelines

Tutorial: Compress the KV Cache with TurboQuant and Haystack


  • Level: Advanced
  • Time to complete: 20 min
  • Components Used: HuggingFaceLocalChatGenerator
  • Goal: Apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack.

Overview

Every time an LLM generates a token, it reads and writes a key-value (KV) cache - a growing table of intermediate activations that lets the model attend to previous tokens without recomputing them. On long contexts or large models, this cache becomes the dominant consumer of GPU memory.

TurboQuant is a KV cache compression algorithm from Google Research (ICLR 2026) that shrinks those vectors to 3โ€“4 bits per coordinate without any retraining. It works in two stages:

  1. PolarQuant - a random orthogonal rotation maps cache vectors to a more uniform distribution, then quantizes them in polar coordinates using Lloyd-Max optimal centroids.
  2. QJL (Quantized Johnson-Lindenstrauss) - a single extra bit per vector corrects residual errors in attention score computation, preserving accuracy at extreme compression ratios.

The result: KV memory can drop from 1,639 MiB to 435 MiB (3.76x) on an RTX 4090, with โ‰ฅ6x reduction validated on server hardware, and near-identical output quality.

In this tutorial you will use turboquant-vllm, a community implementation of the TurboQuant algorithm, to wire CompressedDynamicCache into Haystack’s HuggingFaceLocalChatGenerator, run a generation, and measure time-to-first-token, throughput, and live VRAM usage.

Installing Haystack and TurboQuant

First, let’s install haystack-ai and turboquant-vllm, a community implementation of the TurboQuant algorithm that provides the CompressedDynamicCache wrapper.

%%bash

pip install -q haystack-ai turboquant-vllm

Setting Up a Streaming Callback

To measure time-to-first-token (TTFT) and throughput, we pass a streaming callback that timestamps each arriving token. The first call marks TTFT, while the last marks the end of generation.

import time

first_token_time = None
last_token_time = None

def timing_callback(chunk):
    global first_token_time, last_token_time
    now = time.perf_counter()
    if first_token_time is None:
        first_token_time = now
    last_token_time = now

Compressing the KV Cache

Next, let’s create the compressed cache. We start with HuggingFace’s standard DynamicCache and wrap it with CompressedDynamicCache, which intercepts cache writes and applies TurboQuant compression in place.

Two parameters control the compression:

  • head_dim - the dimensionality of each attention head’s key/value vectors
  • bits - the target bit-width per coordinate

Note: Pass the original cache object to the generator - not compressed. CompressedDynamicCache modifies cache internally, so both variables point to the same compressed state.

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

# The CompressedDynamicCache modifies the DynamicCache internally,
# so we pass the same `cache` instance to both the generator,
# and not `compressed` directly.
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)

Initializing the Generator

Now let’s set up HuggingFaceLocalChatGenerator with a selected model, like Qwen/Qwen3-4B-Thinking-2507. We pass the compressed cache via generation_kwargs so that every decoding step writes through TurboQuant.

from haystack.components.generators.chat import HuggingFaceLocalChatGenerator

generator = HuggingFaceLocalChatGenerator(
    model="Qwen/Qwen3-4B-Thinking-2507",
    task="text-generation",
    generation_kwargs={
        "past_key_values": cache,
        "use_cache": True,
    },
    streaming_callback=timing_callback,
)

Running the Generator

Let’s run a generation and record the total wall time.

from haystack.dataclasses import ChatMessage

start = time.perf_counter()
output = generator.run(messages=[
    ChatMessage.from_user("What is the capital of France?"),
])
total_time = time.perf_counter() - start
reply = output["replies"][0]
print(reply.text)

Reading the Metrics

Three metrics to check:

  • TTFT (time-to-first-token) - latency to the first output token - a proxy for perceived responsiveness.
  • Throughput (tok/s) - tokens decoded per second. TurboQuant’s memory savings reduce cache read pressure, which can improve this on memory-bandwidth-bound hardware.
  • Total time - end-to-end wall time including model loading overhead.
tokens = reply.meta["usage"]["completion_tokens"]
if first_token_time is not None and last_token_time is not None:
    generation_time = last_token_time - first_token_time
    print(f"TTFT: {first_token_time - start:.3f}s")
    print(f"Tokens: {tokens}")
    print(f"Speed: {tokens / generation_time:.1f} tok/s")
print(f"Total time: {total_time:.3f}s")

Checking VRAM Usage

vram_bytes() returns the byte footprint of all compressed KV tensors. Compare it against an uncompressed DynamicCache to verify the reduction reported in the TurboQuant paper.

compressed.vram_bytes()

๐ŸŽ‰ Congratulations! You’ve successfully run a local LLM with TurboQuant KV cache compression through Haystack and measured its real-world memory and throughput impact.