Give Feedback

Weaviate Benchmarksprivate

Validated on 28 Apr 2026 • Last edited on 7 May 2026

DigitalOcean Managed Weaviate is a fully managed Weaviate vector database for retrieval-augmented generation, semantic search, and similarity-based AI workloads. Clusters are provisioned, secured, backed up, and patched by DigitalOcean.

Copy page as Markdown View page as Markdown

Managed Weaviate performance varies by tier, dataset size, and query-time tuning. The comparison below shows how the small, medium, and large tiers behave under a 100,000-vector approximate nearest-neighbor (ANN) workload, with an additional 1,000,000-vector run on the large tier to illustrate how performance scales.

Note

The numbers below come from a single run per tier. Per-ef query sweeps are short (2-4 seconds), so absolute QPS and latency values can be affected by transient infrastructure noise. Treat these results as directional for tier selection, not as a performance guarantee. Re-run with longer query durations before quoting specific numbers externally.

Methodology

What These Benchmarks Measure

Weaviate’s core operation is approximate nearest-neighbor search: given a query vector, return the k most similar vectors from the indexed set.

The benchmarks report four metrics:

Recall@10: Fraction of returned results that are in the true top 10. 1.0 means perfect agreement with brute-force search.
Throughput (QPS): Queries per second under concurrent load.
Latency: Wall-clock time per query, reported as mean and p99.
NDCG@10: Normalized discounted cumulative gain at rank 10. It scores how highly relevant results are ranked within the top 10, with more weight on positions closer to the top. Reported in the per-ef tables next to Recall@10.

There is a fundamental tradeoff: the same index can be tuned to favor recall or speed, but not both simultaneously.

The HNSW Index

Weaviate uses the HNSW (Hierarchical Navigable Small World) index.

Build-time parameters (fixed in these benchmarks):

efConstruction = 256: Higher values improve recall at the cost of slower indexing.
maxConnections (M) = 16: Controls graph density and memory usage.

Query-time parameter (swept):

ef: Search depth. Higher values increase recall and latency, and reduce QPS.

At index build, efConstruction and M are fixed, and only ef is swept at query time to trace a recall-throughput curve for each tier.

The Dataset

Property	Value
Name	`dbpedia-openai-100k-angular.hdf5`
Source	ann-benchmarks.com
Vectors	100,000
Dimensions	1,536
Embedding model	OpenAI `text-embedding-ada-002`
Distance metric	Cosine (angular)
Test queries	1,000 (973 used after filtering)
Ground truth	Pre-computed exact top-100 neighbors per query

Results generalize reasonably well to similar semantic search workloads.

Benchmark Tool

These benchmarks use the open-source weaviate-benchmarking tool with the ann-benchmark subcommand. It performs:

Schema setup: Drops and recreates the Vector class with the configured HNSW parameters.
Ingest: Loads all vectors over gRPC using a producer-consumer pipeline with 8 worker goroutines and 100-vector batches.
Quiesce: Waits 30 seconds after ingest so the index can settle.
Query sweep: For each ef in {16, 24, 32, 48, 64, 96, 128, 256, 512}, runs the full test query set with 8 concurrent workers and records latency for every query.
Recall computation: Compares returned neighbors against the ground truth.

Test Environment

Load generator: A separate VM running the benchmarker CLI in the same region as the target cluster.
Network: Traffic flows through the managed-cluster load balancer (TLS, gRPC over HTTPS on port 443). That path adds non-trivial latency compared with localhost benchmarks, so absolute numbers can look slower than leaderboards that run in process.
Concurrency: --parallel 8 for both ingest and query phases.
Top-k: --limit 10. Recall and NDCG are reported at 10.

Small Tier (100k Vectors)

Property	Value
vCPU	1
Memory	2 GB
Disk	3 GB
Region	TOR1
Replicas	1
Shards	1

Headline Numbers

Metric	Value
Import time (100k vectors)	624.4 s (~10:24)
Peak QPS (`ef=24`)	158.9
Best recall (`ef=512`)	0.9993
Best mean latency (`ef=24`)	48 ms
QPS at recall >= 0.98	129.8

Full Results

`ef`	QPS	Mean latency (ms)	p99 latency (ms)	Recall@10	NDCG@10
16	141.9	55.5	446.2	0.9140	0.9654
24	158.9	48.4	251.9	0.9447	0.9788
32	110.4	70.1	428.5	0.9594	0.9847
48	135.7	51.8	284.1	0.9763	0.9912
64	129.8	59.4	233.5	0.9846	0.9944
96	108.1	70.9	238.5	0.9912	0.9968
128	86.4	88.2	300.7	0.9933	0.9976
256	56.4	135.8	418.7	0.9986	0.9995
512	40.1	190.6	567.1	0.9993	0.9998

Observations

Throughput does not decrease monotonically with ef. The expected ordering would be ef=16 > 24 > 32 > 48, but QPS jumps around. The most likely cause is single-run noise: each ef sweep runs for only a few seconds, so a brief CPU contention spike can dominate the measurement.
p99 is consistently 4-8 times the mean, much higher than the medium and large tiers. This is the signature of a resource-constrained instance where slow queries queue behind faster ones.
Import takes more than 10 minutes for 100,000 vectors, which is slow enough to matter for production data refresh patterns.

Recommended Operating Point

For most workloads targeting high recall, ef=64 provides the best balance:

Metric	Value
Recall@10	0.985
QPS	130
Mean latency	59 ms
p99 latency	234 ms

If lower recall (~95%) is acceptable, ef=24 increases throughput to ~159 QPS.

Suitability

Best for development, staging, and low-traffic production (about 100 QPS). High tail latency makes it a poor fit for latency-sensitive applications.

Medium Tier (100k Vectors)

Property	Value
vCPU	2
Memory	4 GB
Disk	11 GB
Region	TOR1
Replicas	1
Shards	1

Headline Numbers

Metric	Value
Import time (100k vectors)	184.5 s (~3:04)
Peak QPS (`ef=16`)	454.9
Best recall (`ef=512`)	0.9996
Best mean latency (`ef=16`)	17 ms
QPS at recall >= 0.98	322.5

Full Results

`ef`	QPS	Mean latency (ms)	p99 latency (ms)	Recall@10	NDCG@10
16	454.9	17.1	54.2	0.9190	0.9687
24	330.8	22.1	111.6	0.9481	0.9799
32	271.0	25.4	160.2	0.9609	0.9853
48	414.5	17.6	90.9	0.9763	0.9913
64	322.5	23.4	102.2	0.9817	0.9934
96	337.4	22.7	99.8	0.9911	0.9968
128	348.4	21.9	54.4	0.9945	0.9980
256	267.5	29.0	60.1	0.9983	0.9994
512	178.6	43.2	95.4	0.9996	0.9999

Observations

The QPS curve is strongly non-monotonic, with throughput rising at ef=48 and ef=128 even though more work is being done. This is not algorithmically plausible. The most likely cause is noisy-neighbor variance on shared infrastructure during the brief window each ef runs.
Tail latency ratios are healthier than the small tier (typically 2-7 times the mean), which suggests this tier has the headroom to absorb concurrent load.
Import is about 3.4 times faster than the small tier on the same workload.

Recommended Operating Point

For recall >= 0.98, ef=64 offers a strong balance:

Metric	Value
Recall@10	0.982
QPS	322
Mean latency	23 ms
p99 latency	102 ms

If a 99.5%+ recall target is required, ef=128 gave 348 QPS at 22 ms mean and 54 ms p99 in this run. The result is anomalously good and should be re-validated with a longer query duration before being relied on.

Suitability

The medium tier fits moderate-traffic production workloads with sustained throughput in the low hundreds of QPS at sub-100 ms p99. The unstable QPS curve in this single run means absolute numbers should be re-validated with longer duration sweeps before being quoted in external materials.

Large Tier (100k Vectors)

Property	Value
vCPU	8
Memory	32 GB
Disk	230 GB
Region	TOR1
Replicas	1
Shards	1

Headline Numbers

Metric	Value
Import time (100k vectors)	79.3 s (~1:19)
Peak QPS (`ef=24`)	1304
Best recall (`ef=512`)	0.9994
Best mean latency (`ef=24`)	6.1 ms
QPS at recall >= 0.98	1109

Full Results

`ef`	QPS	Mean latency (ms)	p99 latency (ms)	Recall@10	NDCG@10
16	1042	7.5	30.6	0.9201	0.9681
24	1304	6.1	13.9	0.9455	0.9786
32	1052	7.4	17.9	0.9581	0.9842
48	1025	7.6	15.3	0.9756	0.9909
64	1109	7.1	15.4	0.9828	0.9937
96	956	8.1	16.8	0.9888	0.9959
128	919	8.4	17.3	0.9936	0.9977
256	583	13.0	27.2	0.9974	0.9991
512	357	21.6	49.9	0.9994	0.9998

Observations

Apart from a small QPS bump at ef=24 (typical of warmup effects in the first measurement window), the QPS curve is well-behaved and decreases monotonically.
p99 latency remains under 30 ms for ef <= 128 and under 50 ms at ef=512. The p99-to-mean ratio is consistently 2-3 times, indicating available CPU and memory headroom.
Import takes 79 seconds for 100,000 vectors, which is fast enough to support frequent re-indexing or larger datasets.

Recommended Operating Points

For recall >= 0.98, run with ef=64:

Metric	Value
Recall@10	0.983
QPS	1109
Mean latency	7 ms
p99 latency	15 ms

For recall >= 0.99, step up to ef=128:

Metric	Value
Recall@10	0.994
QPS	919
Mean latency	8 ms
p99 latency	17 ms

Suitability

On the 100,000-vector workload in this benchmark, the large tier’s per-ef table shows higher QPS and lower mean and p99 latencies than the small and medium tiers for most settings. At recall >= 0.98 with ef=64, the operating-point table lists 1,109 QPS, 7 ms mean, and 15 ms p99, with 79 s import time for that dataset. The 1,000,000-vector section below uses the same large-tier SKU for a follow-up run.

Tier Comparison

At recall >= 0.98 (ef=64):

Tier	QPS	Mean latency	p99 latency	Import time
Small	130	59 ms	234 ms	624 s
Medium	322	23 ms	102 ms	184 s
Large	1109	7 ms	15 ms	79 s

In the tier comparison table above, each larger tier shows higher QPS, lower mean and p99 latency, and shorter import time than the tier above it. The steepest p99 improvement is between small and medium; the largest QPS gain is between medium and large (see that table for exact values).

Iso-Recall Comparison

For each tier, the best operating point at common recall targets:

Recall >= 0.95

Tier	Best `ef`	QPS	Mean (ms)	p99 (ms)
Small	24	159	48	252
Medium	24	331	22	112
Large	24	1304	6	14

Recall >= 0.98

Tier	Best `ef`	QPS	Mean (ms)	p99 (ms)
Small	64	130	59	234
Medium	64	322	23	102
Large	64	1109	7	15

Recall >= 0.99

Tier	Best `ef`	QPS	Mean (ms)	p99 (ms)
Small	96	108	71	238
Medium	128	348	22	54
Large	128	919	8	17

Recall >= 0.999

Tier	Best `ef`	QPS	Mean (ms)	p99 (ms)
Small	256	56	136	419
Medium	256	268	29	60
Large	512	357	22	50

When to Pick Each Tier

Use case	Tier	Rationale
Development, staging, demos	Small	Same benchmark setup as the other tiers (HNSW, `ef` sweep, ground truth). In these runs, headline import for 100k vectors exceeded 10 minutes and p99 latency was often several times the mean in the full results, which points to a resource-constrained node and more tail variance than on medium or large.
Low-traffic or internal production	Medium	At the `recall >= 0.98` operating point in this run (`ef=64`), mean latency was 23 ms and p99 was 102 ms (see Medium tier tables). QPS versus `ef` was not monotonic across the short sweeps, so treat headline QPS as indicative until you rerun with longer query windows.
Latency-sensitive applications	Large	Full results show lower mean and p99 latencies across most `ef` values than in the small- and medium-tier tables for this workload. At `recall >= 0.98` with `ef=64`, this run reported 7 ms mean and 15 ms p99 (see Large tier tables).
Multi-million vector datasets	Large	Shorter headline import on the 100k workload in this benchmark than on small or medium (see tier comparison import row). This doc also includes a follow-up run on the same large-tier SKU with about 990k vectors after filtering.

Large Tier (1,000,000 Vectors)

This run uses the 1,000,000-vector variant dbpedia-openai-1000k-angular.hdf5 (about 990,000 vectors after filtering) on the same large-tier hardware. It is excluded from the cross-tier comparison above because the dataset differs.

Headline Numbers

Metric	Value
Import time (990k vectors)	944.3 s (~15:44)
Peak QPS (`ef=24`)	1090
Best recall (`ef=512`)	0.997
Best mean latency (`ef=24`)	7.2 ms
QPS at recall >= 0.98	873

Full Results

`ef`	QPS	Mean latency (ms)	p99 latency (ms)	Recall@10	NDCG@10
16	539	14.4	111.3	0.8878	0.9537
24	1090	7.2	12.0	0.9241	0.9698
32	845	9.3	20.8	0.9430	0.9778
48	922	8.6	15.4	0.9628	0.9861
64	815	9.7	20.0	0.9721	0.9896
96	873	9.1	16.6	0.9817	0.9933
128	763	10.4	16.8	0.9866	0.9952
256	551	14.4	28.8	0.9935	0.9977
512	446	17.7	31.7	0.9969	0.9989

Observations

ef=16 is anomalously slow. The most likely cause is cold-cache effects on the very first iteration of the sweep. Treat that row with caution.
From ef=24 onward the curve decreases roughly as expected with minor jitter, all within about 10-15%.
Recall is lower at the same ef than on the 100,000-vector workload, which is expected. With 10 times the vectors, the HNSW graph is denser and the same ef explores a smaller fraction of it. To match a recall target on the 1,000,000-vector dataset, increase ef accordingly.
p99 stays under 32 ms across the full sweep (excluding the anomalous ef=16), well within typical interactive-query SLAs for a million-vector index.
Import time scaled slightly worse than linear: 944 seconds for 1,000,000 vectors versus 79 seconds for 100,000. HNSW build cost grows super-linearly with vector count.

Key Insight

Increasing dataset size by 10 times reduced throughput by about 25-30% at comparable recall, indicating relatively stable performance on the large tier.

Recommended Operating Points

For recall >= 0.98, run with ef=96:

Metric	Value
Recall@10	0.982
QPS	873
Mean latency	9 ms
p99 latency	17 ms

For recall >= 0.99, use ef=256:

Metric	Value
Recall@10	0.994
QPS	551
Mean latency	14 ms
p99 latency	29 ms

Same Hardware, Different Scale

Side-by-side, large tier, both datasets, at recall >= 0.98:

Dataset	`ef`	QPS	Mean (ms)	p99 (ms)	Recall	Import
100k	64	1109	7	15	0.983	79 s
1M	96	873	9	17	0.982	944 s

Notes on Results

Single-run measurements: Each tier was benchmarked once. Per-ef query sweeps run for only 2-4 seconds, so they are vulnerable to transient noise. The clearest example is the medium tier’s non-monotonic QPS curve. Re-run with --queryDuration 60 or longer per ef value to get statistically stable numbers before quoting any of these results in customer-facing material.
Noisy-neighbor variance: Managed multi-tenant infrastructure is subject to co-tenant interference. The unstable medium and small results may understate the steady-state performance of those tiers.
No memory metrics: Weaviate’s Prometheus endpoint is not exposed on managed clusters during preview, so heap and resident-set-size measurements are not available.
Single dataset and dimensionality: All measurements are on 100,000 vectors at 1,536 dimensions (plus the 1,000,000-vector run on the large tier). A smaller-dimension dataset would round out the picture.
Network latency floor: Traffic flows through the managed-cluster load balancer in TOR1, so absolute latencies include several milliseconds of networking overhead that would not be present in an in-process or same-VPC benchmark.

Weaviate Benchmarksprivate

Methodology

What These Benchmarks Measure

The HNSW Index

The Dataset

Benchmark Tool

Test Environment

Small Tier (100k Vectors)

Headline Numbers

Full Results

Observations

Recommended Operating Point

Suitability

Medium Tier (100k Vectors)

Headline Numbers

Full Results

Observations

Recommended Operating Point

Suitability

Large Tier (100k Vectors)

Headline Numbers

Full Results

Observations

Recommended Operating Points

Suitability

Tier Comparison

Iso-Recall Comparison

Recall >= 0.95

Recall >= 0.98

Recall >= 0.99

Recall >= 0.999

When to Pick Each Tier

Large Tier (1,000,000 Vectors)

Headline Numbers

Full Results

Observations

Key Insight

Recommended Operating Points

Same Hardware, Different Scale

Notes on Results

We can't find any results for your search.