Scaling ClickHouse for Embedding Search: Schemas, Indexing and Cost Tips
Hands-on guide to schema, compression, TTL and vector-index strategies for efficient ClickHouse-based embedding search in 2026.
Hook: Why teams struggle to run embedding search in ClickHouse (and how to fix it)
Embedding search for retrieval-augmented generation (RAG) promises huge productivity wins — but many engineering teams hit three recurring problems: exploding storage and CPU costs, slow or inconsistent recall at scale, and operational complexity when combining ClickHouse with ANN engines. This guide gives you concrete schema patterns, TTL and compression tactics, and vector-indexing strategies you can apply in 2026 to run fast, cost-effective embedding search on ClickHouse-backed RAG pipelines.
Executive takeaways (apply these first)
- Two-stage retrieval is the default: use ClickHouse for fast metadata filtering and candidate reduction, then run ANN re-ranking on a compact candidate set.
- Store embeddings compactly: Float16 / quantized bytes + column compression saves 3–10x storage with minimal hit to recall.
- Use TTL + partitions for lifecycle: MergeTree TTLs and time-based partitions keep storage predictable and reduce cold data cost.
- Apply coarse filters in SQL: use tags, vector coarse hashing, or precomputed clustering in ClickHouse to limit ANN scope per query.
- Hybrid deployment is pragmatic: ClickHouse + external ANN (FAISS, HNSWlib, Milvus, or a managed vector DB) gives the best trade-offs at scale.
The 2026 context: why ClickHouse is a serious RAG contender
ClickHouse continues to expand as a general analytics backbone. By late 2025 and into 2026 we saw accelerated adoption across product and ML teams, driven by larger funding and ecosystem investments; the company’s growth (including major funding rounds reported in January 2026) indicates enterprise momentum for using ClickHouse beyond classic OLAP workloads. Practically, that means more tooling, connectors, and community patterns for embedding storage, metadata indexing, and hybrid ANN integrations — all of which we map below.
What to expect in 2026 operationally
- ClickHouse remains extremely cost-efficient per GB vs many managed vector DBs.
- Teams increasingly pair ClickHouse for filtering/analytics with a purpose-built ANN engine (open-source or managed) for the top-k search.
- Compression and quantization became standard practice in 2025–2026 to control storage and egress costs.
Core schema patterns for embeddings in ClickHouse
Choose a schema that matches your scale and retrieval pattern. Below are three battle-tested patterns: canonical row-per-chunk, compact bytes-first, and metadata-first — with CREATE TABLE snippets and trade-offs.
1) Canonical (developer-friendly)
Simple, readable: keep the embedding as an Array(Float32) for easy debugging and ad-hoc similarity queries. Best for small-to-medium datasets (up to low millions of vectors).
CREATE TABLE docs_embeddings
(
doc_id UUID,
chunk_id UInt64,
text String,
embedding Array(Float32) CODEC(ZSTD(5)),
dims UInt16 DEFAULT length(embedding),
tenant_id String, -- for multi-tenant filtering
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id, chunk_id)
TTL created_at + INTERVAL 90 DAY DELETE;
Pros: easy to inspect, simple ingestion. Cons: high storage per vector (Float32 * dim).
2) Compact bytes-first (production at scale)
Store quantized bytes or Float16 blobs. Use external libraries to encode/decode to/from bytes. Great when you want ClickHouse to be the canonical store but keep bytes small.
CREATE TABLE docs_embeddings_compact
(
doc_id UUID,
chunk_id UInt64,
text String,
embedding_blob String CODEC(ZSTD(9)), -- bytes: FP16, PQ, or OPQ compressed
embedding_method Enum8('fp16'=1,'pq'=2,'opq'=3),
dims UInt16,
tenant_id LowCardinality(String),
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id)
TTL created_at + INTERVAL 180 DAY DELETE;
Pros: lower storage, network efficiency when transferring embeddings out for ANN. Cons: needs encoding/decoding step in application code.
3) Metadata-first (analytics + RAG)
When metadata filtering is the common path — e.g., tenant, domain, language — store small summary fields and push heavy text/embeddings to cold tiers or separate tables.
CREATE TABLE docs_metadata
(
doc_id UUID,
chunk_id UInt64,
tenant_id LowCardinality(String),
language Enum8('en'=1,'es'=2,'de'=3),
namespace String,
score_estimate Float32,
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, namespace, doc_id)
TTL created_at + INTERVAL 365 DAY DELETE;
-- Separate cold storage for full text and embeddings
CREATE TABLE docs_blob_cold ...
Pros: best for cost control — limit hot working set size for ANN. Cons: more complex joins and pipeline logic.
Indexing & search strategies (two-stage retrieval)
At scale, direct vector scanning in ClickHouse is costly. The practical, high-recall approach is two-stage retrieval:
- Use ClickHouse SQL to apply metadata filters and coarse candidate selection.
- Use a purpose-built ANN engine to re-rank candidates by exact or approximate similarity.
Why two-stage works
Filtering in SQL can eliminate 90–99% of data cheaply (tenant, time window, tags). ANN engines are optimized for vector math and provide latency and GPU support that ClickHouse does not prioritize. This division keeps costs down and latency predictable.
Practical candidate-selection patterns
- Metadata prefilter: SELECT candidates WHERE tenant_id = 'abc' AND language = 'en' AND created_at > now() - INTERVAL 180 DAY LIMIT 10000
- Coarse quantization buckets: Precompute k-means cluster ids for each embedding and store cluster_id. Query cluster_id of the query embedding and select only those clusters.
- Locality Sensitive Hashing (LSH) sketch columns: store a few hash tokens from an LSH function and use them in WHERE clauses to narrow candidates quickly.
Example: SQL + FAISS pipeline
-- ClickHouse: pick candidate rows (cheap)
SELECT doc_id, chunk_id, embedding_blob
FROM docs_embeddings_compact
WHERE tenant_id='acme'
AND namespace='support-kb'
AND created_at >= now() - INTERVAL 365 DAY
LIMIT 5000;
-- Application: decode embedding_blob -> float32 vectors
-- Build or query FAISS/HNSW index against these candidates and return top-k ids
Compression & quantization: how to reduce cost with minimal recall loss
2025–2026 brought wider adoption of embedding compression patterns. Use a combination of these techniques depending on your accuracy target:
- Float16: ~2x storage reduction vs Float32; often <2% recall drop for many models.
- Product Quantization (PQ): 4–16x reduction; good for large buckets but requires ANN support that can use PQ codes.
- OPQ / residual quantization: next-step improvements for high-dimensional vectors.
- PCA or random projection: reduce dims (e.g., from 1536 -> 256), then store compressed vectors.
- Column codec in ClickHouse: use CODEC(ZSTD(5..9)) for embedding columns or CODEC(LZ4) for lower CPU cost. Example: embedding Array(Float32) CODEC(ZSTD(7)).
Recommendation: start with Float16 + ZSTD(5) and run offline recall tests vs your golden dataset. If recall falls below threshold, try PCA->256 then Float16, or PQ with external ANN.
TTL, partitioning and lifecycle management
Keeping your active working set small is the most effective cost control. ClickHouse MergeTree TTLs and partitions make lifecycle predictable.
- Time partitions: partition by month (toYYYYMM(created_at)) for predictable compaction and deletion cost.
- TTL policies: Use TTL ... DELETE to automatically drop old vectors, or TTL ... TO DISK('cold') to tier cold blobs to cheaper storage.
- Cold storage: separate the huge historical archive; keep only the last N months of vectors in hot ANN-ready tables.
-- Example: ttl to move embeddings to 'slow' disk and then delete
CREATE TABLE docs_embeddings_tiered
(
..., -- schema
created_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, doc_id)
TTL created_at + INTERVAL 30 DAY TO DISK('slow'),
created_at + INTERVAL 365 DAY DELETE;
Operational and cost tips (real-world checklist)
- Benchmark end-to-end latency: measure SQL filtering + fetch + ANN search + re-ranking + LLM call. Optimize the slowest stage.
- Batch requests: fetch embeddings in larger batches where possible to amortize connection overhead.
- Use Materialized Views: precompute heavy filters and cluster assignments into MV tables to reduce query time.
- Monitor cardinality: use LowCardinality types for tags and enums to reduce memory and index size.
- Isolate search nodes: dedicating fast NVMe nodes for hot tables avoids IO contention with analytics workloads.
- Right-size ANN: GPUs for heavy throughput (500–1000 qps) or high-dimensional brute-force; CPU HNSW for medium QPS; managed vector DBs if you want SLA and lower ops.
Hybrid deployment patterns: when to use ClickHouse only vs ClickHouse + Vector DB
Decision factors: dataset size, latency requirements, ops budget, and model update frequency.
- ClickHouse-only (small-to-medium datasets, tight budget): store Float16 embeddings in ClickHouse and run re-ranking in process or via a lightweight ANN (HNSWlib). Pros: lower storage cost, simplified consistency. Cons: higher engineering work to tune ANN and scaling.
- ClickHouse + self-hosted ANN (growing scale): ClickHouse for metadata and candidate selection; FAISS/HNSWlib/Milvus on dedicated nodes for ANN. Pros: flexible, high performance. Cons: more infra to manage.
- ClickHouse + managed vector DB (SaaS) (speed to market): sync embeddings to Pinecone/Weaviate/Milvus Cloud while keeping analytics in ClickHouse. Pros: minimal ops, built-in indexing. Cons: higher per-GB cost and potential data egress costs.
End-to-end example: Query flow with code snippets
Below is a concise Python example showing the typical two-stage flow: query embedding -> ClickHouse filter -> decode embeddings -> FAISS search -> fetch top-K texts.
from clickhouse_connect import Client
import numpy as np
import faiss
# ClickHouse client
ch = Client(host='clickhouse-host', username='user', password='pw', database='db')
# 1) make query embedding (from your embedding model)
q_vec = np.array(get_embedding('How to reset my password?'), dtype=np.float32)
# 2) prefilter candidates using ClickHouse
rows = ch.query('''
SELECT doc_id, chunk_id, embedding_blob
FROM docs_embeddings_compact
WHERE tenant_id='acme' AND language='en'
LIMIT 5000
''')
# 3) decode blobs -> matrix (app-specific decoding: FP16 -> float32 or PQ decode)
emb_matrix = np.stack([decode_blob(r['embedding_blob']) for r in rows])
# 4) FAISS index on candidate set (flat or HNSW)
index = faiss.IndexFlatIP(emb_matrix.shape[1]) # inner-product for cosine
faiss.normalize_L2(emb_matrix)
index.add(emb_matrix)
faiss.normalize_L2(q_vec.reshape(1,-1))
D, I = index.search(q_vec.reshape(1,-1), k=10)
# 5) fetch selected docs and pass to LLM
selected_ids = [rows[i]['doc_id'] for i in I[0]]
docs = ch.query(f"SELECT doc_id, text FROM docs_embeddings_compact WHERE doc_id IN ({','.join(map(str,selected_ids))})")
Monitoring, reindexing and evaluation
Operational playbook:
- Maintain a golden query set for recall/precision checks; run nightly evaluation comparing full-scan baseline vs production two-stage pipeline.
- Recompute cluster assignments and quantization centroids monthly or whenever you update base embedding model.
- Autoscale ANN nodes based on 95th percentile QPS and tail latency requirements (measure heavy-tail effects).
Cost modeling — rules of thumb
- Storage: Float32 1536-dim ~ 6 MB per 1,000 vectors; Float16 ~3 MB per 1,000. PQ/OPQ can go <1MB per 1,000 depending on code size.
- Network egress: moving embeddings between services is expensive; prefer compact blobs and batch transfers.
- Compute: ANN re-ranking is the main cost driver as you increase candidate pool size; reduce candidates from 50k -> 5k for ~10x speedups.
- Operations: managed vector DBs cost more per GB but save engineering hours; ClickHouse scales cheaper if you have SRE bandwidth.
Putting it together — recommended starting blueprint (2026)
- Store compressed Float16 embeddings in ClickHouse with CODEC(ZSTD(5‑7)).
- Partition by month and apply TTL to remove cold vectors after 90–180 days unless needed for compliance.
- Compute k-means cluster ids and store cluster_id for coarse candidate reduction.
- Implement two-stage retrieval: ClickHouse candidate select (<=5k) -> FAISS/Milvus/managed vector DB ranking -> re-rank and pass to LLM.
- Monitor recall vs a full-scan baseline weekly; iterate compression strategy if recall drops.
In 2026, running embedding search is less about picking a single product and more about designing a predictable pipeline: cheap, auditable filtering in ClickHouse + best-in-class vector search for re-ranking.
Final checklist before ship
- Have a golden test set for RAG recall and latency SLAs.
- Validate compression (Float16, PQ) offline before rollout.
- Automate TTL and partition management in schema.
- Instrument end-to-end costs (storage, network, ANN compute) per tenant/namespace.
- Plan for reindexing when you change the embedding model (store original text or raw source to recompute).
Next steps & call to action
If you’re ready to pilot a production RAG system using ClickHouse, start with a 30-day experiment: ingest a sample of your documents, enable Float16 + ZSTD compression, implement metadata prefiltering, and run a FAISS-backed re-ranking on 5k candidates. Measure recall, costs, and P95 latency. If you want a reproducible checklist and scripts tailored to your data shape, download our ClickHouse + FAISS starter kit or reach out for a hands-on audit and architecture session.
Related Reading
- Small Creator CRM Guide: Choose the Right CRM for Your Audience (and Budget)
- Yankee Barway: Cocktails Using Rare Citrus You Can Source in NYC
- Evidence Roundup: Why Paywall-Free Platforms Change Student Participation — What Studies Say
- The Traveller’s Guide to Choosing a Credit Card for Earning Airline Miles in 2026
- Event Lighting Blueprints: Using RGBIC Strips to Paint Your Venue Red, White, and Blue
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Decoding AI Fatigue: Why Less is More in AI Development
The Future of Work: Why Automation Needs a Human Touch
Generative AI in Education: A Double-Edged Sword
AI-Driven Mental Health: Exploring Innovative Music Therapy Solutions
Trusting AI in Advertising: Why LLMs Aren’t Spending Your Ad Dollars Yet
From Our Network
Trending stories across our publication group