Scaling ClickHouse for Embedding Search

Hands-on guide to schema, compression, TTL and vector-index strategies for efficient ClickHouse-based embedding search in 2026.

Hook: Why teams struggle to run embedding search in ClickHouse (and how to fix it)

Embedding search for retrieval-augmented generation (RAG) promises huge productivity wins — but many engineering teams hit three recurring problems: exploding storage and CPU costs, slow or inconsistent recall at scale, and operational complexity when combining ClickHouse with ANN engines. This guide gives you concrete schema patterns, TTL and compression tactics, and vector-indexing strategies you can apply in 2026 to run fast, cost-effective embedding search on ClickHouse-backed RAG pipelines.

Executive takeaways (apply these first)

Two-stage retrieval is the default: use ClickHouse for fast metadata filtering and candidate reduction, then run ANN re-ranking on a compact candidate set.
Store embeddings compactly: Float16 / quantized bytes + column compression saves 3–10x storage with minimal hit to recall.
Use TTL + partitions for lifecycle: MergeTree TTLs and time-based partitions keep storage predictable and reduce cold data cost.
Apply coarse filters in SQL: use tags, vector coarse hashing, or precomputed clustering in ClickHouse to limit ANN scope per query.
Hybrid deployment is pragmatic: ClickHouse + external ANN (FAISS, HNSWlib, Milvus, or a managed vector DB) gives the best trade-offs at scale.

The 2026 context: why ClickHouse is a serious RAG contender

ClickHouse continues to expand as a general analytics backbone. By late 2025 and into 2026 we saw accelerated adoption across product and ML teams, driven by larger funding and ecosystem investments; the company’s growth (including major funding rounds reported in January 2026) indicates enterprise momentum for using ClickHouse beyond classic OLAP workloads. Practically, that means more tooling, connectors, and community patterns for embedding storage, metadata indexing, and hybrid ANN integrations — all of which we map below.

What to expect in 2026 operationally

ClickHouse remains extremely cost-efficient per GB vs many managed vector DBs.
Teams increasingly pair ClickHouse for filtering/analytics with a purpose-built ANN engine (open-source or managed) for the top-k search.
Compression and quantization became standard practice in 2025–2026 to control storage and egress costs.

Core schema patterns for embeddings in ClickHouse

Choose a schema that matches your scale and retrieval pattern. Below are three battle-tested patterns: canonical row-per-chunk, compact bytes-first, and metadata-first — with CREATE TABLE snippets and trade-offs.

1) Canonical (developer-friendly)

Simple, readable: keep the embedding as an Array(Float32) for easy debugging and ad-hoc similarity queries. Best for small-to-medium datasets (up to low millions of vectors).

CREATE TABLE docs_embeddings
(
  doc_id UUID,
  chunk_id UInt64,
  text String,
  embedding Array(Float32) CODEC(ZSTD(5)),
  dims UInt16 DEFAULT length(embedding),
  tenant_id String, -- for multi-tenant filtering
  created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id, chunk_id)
TTL created_at + INTERVAL 90 DAY DELETE;

Pros: easy to inspect, simple ingestion. Cons: high storage per vector (Float32 * dim).

2) Compact bytes-first (production at scale)

Store quantized bytes or Float16 blobs. Use external libraries to encode/decode to/from bytes. Great when you want ClickHouse to be the canonical store but keep bytes small.

CREATE TABLE docs_embeddings_compact
(
  doc_id UUID,
  chunk_id UInt64,
  text String,
  embedding_blob String CODEC(ZSTD(9)), -- bytes: FP16, PQ, or OPQ compressed
  embedding_method Enum8('fp16'=1,'pq'=2,'opq'=3),
  dims UInt16,
  tenant_id LowCardinality(String),
  created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id)
TTL created_at + INTERVAL 180 DAY DELETE;

Pros: lower storage, network efficiency when transferring embeddings out for ANN. Cons: needs encoding/decoding step in application code.

3) Metadata-first (analytics + RAG)

When metadata filtering is the common path — e.g., tenant, domain, language — store small summary fields and push heavy text/embeddings to cold tiers or separate tables.

CREATE TABLE docs_metadata
(
  doc_id UUID,
  chunk_id UInt64,
  tenant_id LowCardinality(String),
  language Enum8('en'=1,'es'=2,'de'=3),
  namespace String,
  score_estimate Float32,
  created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, namespace, doc_id)
TTL created_at + INTERVAL 365 DAY DELETE;

-- Separate cold storage for full text and embeddings
CREATE TABLE docs_blob_cold ...

Pros: best for cost control — limit hot working set size for ANN. Cons: more complex joins and pipeline logic.

Indexing & search strategies (two-stage retrieval)

At scale, direct vector scanning in ClickHouse is costly. The practical, high-recall approach is two-stage retrieval:

Use ClickHouse SQL to apply metadata filters and coarse candidate selection.
Use a purpose-built ANN engine to re-rank candidates by exact or approximate similarity.

Why two-stage works

Filtering in SQL can eliminate 90–99% of data cheaply (tenant, time window, tags). ANN engines are optimized for vector math and provide latency and GPU support that ClickHouse does not prioritize. This division keeps costs down and latency predictable.

Practical candidate-selection patterns

Metadata prefilter: SELECT candidates WHERE tenant_id = 'abc' AND language = 'en' AND created_at > now() - INTERVAL 180 DAY LIMIT 10000
Coarse quantization buckets: Precompute k-means cluster ids for each embedding and store cluster_id. Query cluster_id of the query embedding and select only those clusters.
Locality Sensitive Hashing (LSH) sketch columns: store a few hash tokens from an LSH function and use them in WHERE clauses to narrow candidates quickly.

Example: SQL + FAISS pipeline

-- ClickHouse: pick candidate rows (cheap)
SELECT doc_id, chunk_id, embedding_blob
FROM docs_embeddings_compact
WHERE tenant_id='acme'
  AND namespace='support-kb'
  AND created_at >= now() - INTERVAL 365 DAY
LIMIT 5000;

-- Application: decode embedding_blob -> float32 vectors
-- Build or query FAISS/HNSW index against these candidates and return top-k ids

Compression & quantization: how to reduce cost with minimal recall loss

2025–2026 brought wider adoption of embedding compression patterns. Use a combination of these techniques depending on your accuracy target:

Float16: ~2x storage reduction vs Float32; often <2% recall drop for many models.
Product Quantization (PQ): 4–16x reduction; good for large buckets but requires ANN support that can use PQ codes.
OPQ / residual quantization: next-step improvements for high-dimensional vectors.
PCA or random projection: reduce dims (e.g., from 1536 -> 256), then store compressed vectors.
Column codec in ClickHouse: use CODEC(ZSTD(5..9)) for embedding columns or CODEC(LZ4) for lower CPU cost. Example: embedding Array(Float32) CODEC(ZSTD(7)).

Recommendation: start with Float16 + ZSTD(5) and run offline recall tests vs your golden dataset. If recall falls below threshold, try PCA->256 then Float16, or PQ with external ANN.

TTL, partitioning and lifecycle management

Keeping your active working set small is the most effective cost control. ClickHouse MergeTree TTLs and partitions make lifecycle predictable.

Time partitions: partition by month (toYYYYMM(created_at)) for predictable compaction and deletion cost.
TTL policies: Use TTL ... DELETE to automatically drop old vectors, or TTL ... TO DISK('cold') to tier cold blobs to cheaper storage.
Cold storage: separate the huge historical archive; keep only the last N months of vectors in hot ANN-ready tables.

-- Example: ttl to move embeddings to 'slow' disk and then delete
CREATE TABLE docs_embeddings_tiered
(
  ..., -- schema
  created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id)
TTL created_at + INTERVAL 30 DAY TO DISK('slow'),
    created_at + INTERVAL 365 DAY DELETE;

Operational and cost tips (real-world checklist)

Benchmark end-to-end latency: measure SQL filtering + fetch + ANN search + re-ranking + LLM call. Optimize the slowest stage.
Batch requests: fetch embeddings in larger batches where possible to amortize connection overhead.
Use Materialized Views: precompute heavy filters and cluster assignments into MV tables to reduce query time.
Monitor cardinality: use LowCardinality types for tags and enums to reduce memory and index size.
Isolate search nodes: dedicating fast NVMe nodes for hot tables avoids IO contention with analytics workloads.
Right-size ANN: GPUs for heavy throughput (500–1000 qps) or high-dimensional brute-force; CPU HNSW for medium QPS; managed vector DBs if you want SLA and lower ops.

Hybrid deployment patterns: when to use ClickHouse only vs ClickHouse + Vector DB

Decision factors: dataset size, latency requirements, ops budget, and model update frequency.

ClickHouse-only (small-to-medium datasets, tight budget): store Float16 embeddings in ClickHouse and run re-ranking in process or via a lightweight ANN (HNSWlib). Pros: lower storage cost, simplified consistency. Cons: higher engineering work to tune ANN and scaling.
ClickHouse + self-hosted ANN (growing scale): ClickHouse for metadata and candidate selection; FAISS/HNSWlib/Milvus on dedicated nodes for ANN. Pros: flexible, high performance. Cons: more infra to manage.
ClickHouse + managed vector DB (SaaS) (speed to market): sync embeddings to Pinecone/Weaviate/Milvus Cloud while keeping analytics in ClickHouse. Pros: minimal ops, built-in indexing. Cons: higher per-GB cost and potential data egress costs.

End-to-end example: Query flow with code snippets

Below is a concise Python example showing the typical two-stage flow: query embedding -> ClickHouse filter -> decode embeddings -> FAISS search -> fetch top-K texts.

from clickhouse_connect import Client
import numpy as np
import faiss

# ClickHouse client
ch = Client(host='clickhouse-host', username='user', password='pw', database='db')

# 1) make query embedding (from your embedding model)
q_vec = np.array(get_embedding('How to reset my password?'), dtype=np.float32)

# 2) prefilter candidates using ClickHouse
rows = ch.query('''
SELECT doc_id, chunk_id, embedding_blob
FROM docs_embeddings_compact
WHERE tenant_id='acme' AND language='en'
LIMIT 5000
''')

# 3) decode blobs -> matrix (app-specific decoding: FP16 -> float32 or PQ decode)
emb_matrix = np.stack([decode_blob(r['embedding_blob']) for r in rows])

# 4) FAISS index on candidate set (flat or HNSW)
index = faiss.IndexFlatIP(emb_matrix.shape[1])  # inner-product for cosine
faiss.normalize_L2(emb_matrix)
index.add(emb_matrix)
faiss.normalize_L2(q_vec.reshape(1,-1))
D, I = index.search(q_vec.reshape(1,-1), k=10)

# 5) fetch selected docs and pass to LLM
selected_ids = [rows[i]['doc_id'] for i in I[0]]
docs = ch.query(f"SELECT doc_id, text FROM docs_embeddings_compact WHERE doc_id IN ({','.join(map(str,selected_ids))})")

Monitoring, reindexing and evaluation

Operational playbook:

Maintain a golden query set for recall/precision checks; run nightly evaluation comparing full-scan baseline vs production two-stage pipeline.
Recompute cluster assignments and quantization centroids monthly or whenever you update base embedding model.
Autoscale ANN nodes based on 95th percentile QPS and tail latency requirements (measure heavy-tail effects).

Cost modeling — rules of thumb

Storage: Float32 1536-dim ~ 6 MB per 1,000 vectors; Float16 ~3 MB per 1,000. PQ/OPQ can go <1MB per 1,000 depending on code size.
Network egress: moving embeddings between services is expensive; prefer compact blobs and batch transfers.
Compute: ANN re-ranking is the main cost driver as you increase candidate pool size; reduce candidates from 50k -> 5k for ~10x speedups.
Operations: managed vector DBs cost more per GB but save engineering hours; ClickHouse scales cheaper if you have SRE bandwidth.

Putting it together — recommended starting blueprint (2026)

Store compressed Float16 embeddings in ClickHouse with CODEC(ZSTD(5‑7)).
Partition by month and apply TTL to remove cold vectors after 90–180 days unless needed for compliance.
Compute k-means cluster ids and store cluster_id for coarse candidate reduction.
Implement two-stage retrieval: ClickHouse candidate select (<=5k) -> FAISS/Milvus/managed vector DB ranking -> re-rank and pass to LLM.
Monitor recall vs a full-scan baseline weekly; iterate compression strategy if recall drops.

In 2026, running embedding search is less about picking a single product and more about designing a predictable pipeline: cheap, auditable filtering in ClickHouse + best-in-class vector search for re-ranking.

Final checklist before ship

Have a golden test set for RAG recall and latency SLAs.
Validate compression (Float16, PQ) offline before rollout.
Automate TTL and partition management in schema.
Instrument end-to-end costs (storage, network, ANN compute) per tenant/namespace.
Plan for reindexing when you change the embedding model (store original text or raw source to recompute).

Next steps & call to action

If you’re ready to pilot a production RAG system using ClickHouse, start with a 30-day experiment: ingest a sample of your documents, enable Float16 + ZSTD compression, implement metadata prefiltering, and run a FAISS-backed re-ranking on 5k candidates. Measure recall, costs, and P95 latency. If you want a reproducible checklist and scripts tailored to your data shape, download our ClickHouse + FAISS starter kit or reach out for a hands-on audit and architecture session.

Scaling ClickHouse for Embedding Search: Schemas, Indexing and Cost Tips

Hook: Why teams struggle to run embedding search in ClickHouse (and how to fix it)

Executive takeaways (apply these first)

The 2026 context: why ClickHouse is a serious RAG contender

What to expect in 2026 operationally

Core schema patterns for embeddings in ClickHouse

1) Canonical (developer-friendly)

2) Compact bytes-first (production at scale)

3) Metadata-first (analytics + RAG)

Indexing & search strategies (two-stage retrieval)

Why two-stage works

Practical candidate-selection patterns

Example: SQL + FAISS pipeline

Compression & quantization: how to reduce cost with minimal recall loss

TTL, partitioning and lifecycle management

Operational and cost tips (real-world checklist)

Hybrid deployment patterns: when to use ClickHouse only vs ClickHouse + Vector DB

End-to-end example: Query flow with code snippets

Monitoring, reindexing and evaluation

Cost modeling — rules of thumb

Putting it together — recommended starting blueprint (2026)

Final checklist before ship

Next steps & call to action

Related Topics

trainmyai

Up Next

How to Build a Prompt Testing Harness for LLM Apps

Best AI SDKs for Building LLM Apps in 2026

OpenAI vs Anthropic vs Gemini for Prompt Engineering: Features, Limits, and Fit

From Our Network

LLM Evaluation Metrics Explained: Accuracy, Hallucination, Latency, and Cost

Best AI Developer Tools for Building and Testing LLM Apps

RAG vs Long Context: Which Architecture Is Better for Your AI App?

System Prompt vs User Prompt vs Developer Prompt: Differences, Risks, and Design Patterns

Prompt Engineering Best Practices for Reliable LLM Outputs: A Living Checklist

Prompt Optimization Workflow: Diagnose, Iterate, and Measure Improvements

Hook: Why teams struggle to run embedding search in ClickHouse (and how to fix it)

Executive takeaways (apply these first)

The 2026 context: why ClickHouse is a serious RAG contender

What to expect in 2026 operationally

Core schema patterns for embeddings in ClickHouse

1) Canonical (developer-friendly)

2) Compact bytes-first (production at scale)

3) Metadata-first (analytics + RAG)

Indexing & search strategies (two-stage retrieval)

Why two-stage works

Practical candidate-selection patterns

Example: SQL + FAISS pipeline

Compression & quantization: how to reduce cost with minimal recall loss

TTL, partitioning and lifecycle management

Operational and cost tips (real-world checklist)

Hybrid deployment patterns: when to use ClickHouse only vs ClickHouse + Vector DB

End-to-end example: Query flow with code snippets

Monitoring, reindexing and evaluation

Cost modeling — rules of thumb

Putting it together — recommended starting blueprint (2026)

Final checklist before ship

Next steps & call to action

Related Reading

Related Topics

trainmyai

Up Next

How to Build a Prompt Testing Harness for LLM Apps

Best AI SDKs for Building LLM Apps in 2026

OpenAI vs Anthropic vs Gemini for Prompt Engineering: Features, Limits, and Fit

From Our Network

LLM Evaluation Metrics Explained: Accuracy, Hallucination, Latency, and Cost

Best AI Developer Tools for Building and Testing LLM Apps

RAG vs Long Context: Which Architecture Is Better for Your AI App?

System Prompt vs User Prompt vs Developer Prompt: Differences, Risks, and Design Patterns

Prompt Engineering Best Practices for Reliable LLM Outputs: A Living Checklist

Prompt Optimization Workflow: Diagnose, Iterate, and Measure Improvements