ef query sweeps are short (2-4 seconds), so absolute QPS and latency values can be affected by transient infrastructure noise. Treat these results as directional for tier selection, not as a performance guarantee. Re-run with longer query durations before quoting specific numbers externally.
Weaviate Benchmarksprivate
Validated on 28 Apr 2026 • Last edited on 7 May 2026
DigitalOcean Managed Weaviate is a fully managed Weaviate vector database for retrieval-augmented generation, semantic search, and similarity-based AI workloads. Clusters are provisioned, secured, backed up, and patched by DigitalOcean.
Managed Weaviate performance varies by tier, dataset size, and query-time tuning. The comparison below shows how the small, medium, and large tiers behave under a 100,000-vector approximate nearest-neighbor (ANN) workload, with an additional 1,000,000-vector run on the large tier to illustrate how performance scales.
Methodology
What These Benchmarks Measure
Weaviate’s core operation is approximate nearest-neighbor search: given a query vector, return the k most similar vectors from the indexed set.
The benchmarks report four metrics:
- Recall@10: Fraction of returned results that are in the true top 10.
1.0means perfect agreement with brute-force search. - Throughput (QPS): Queries per second under concurrent load.
- Latency: Wall-clock time per query, reported as mean and p99.
- NDCG@10: Normalized discounted cumulative gain at rank 10. It scores how highly relevant results are ranked within the top 10, with more weight on positions closer to the top. Reported in the per-
eftables next to Recall@10.
There is a fundamental tradeoff: the same index can be tuned to favor recall or speed, but not both simultaneously.
The HNSW Index
Weaviate uses the HNSW (Hierarchical Navigable Small World) index.
Build-time parameters (fixed in these benchmarks):
efConstruction= 256: Higher values improve recall at the cost of slower indexing.maxConnections(M) = 16: Controls graph density and memory usage.
Query-time parameter (swept):
ef: Search depth. Higher values increase recall and latency, and reduce QPS.
At index build, efConstruction and M are fixed, and only ef is swept at query time to trace a recall-throughput curve for each tier.
The Dataset
| Property | Value |
|---|---|
| Name | dbpedia-openai-100k-angular.hdf5 |
| Source | ann-benchmarks.com |
| Vectors | 100,000 |
| Dimensions | 1,536 |
| Embedding model | OpenAI text-embedding-ada-002 |
| Distance metric | Cosine (angular) |
| Test queries | 1,000 (973 used after filtering) |
| Ground truth | Pre-computed exact top-100 neighbors per query |
Results generalize reasonably well to similar semantic search workloads.
Benchmark Tool
These benchmarks use the open-source weaviate-benchmarking tool with the ann-benchmark subcommand. It performs:
- Schema setup: Drops and recreates the
Vectorclass with the configured HNSW parameters. - Ingest: Loads all vectors over gRPC using a producer-consumer pipeline with 8 worker goroutines and 100-vector batches.
- Quiesce: Waits 30 seconds after ingest so the index can settle.
- Query sweep: For each
efin{16, 24, 32, 48, 64, 96, 128, 256, 512}, runs the full test query set with 8 concurrent workers and records latency for every query. - Recall computation: Compares returned neighbors against the ground truth.
Test Environment
- Load generator: A separate VM running the benchmarker CLI in the same region as the target cluster.
- Network: Traffic flows through the managed-cluster load balancer (TLS, gRPC over HTTPS on port 443). That path adds non-trivial latency compared with localhost benchmarks, so absolute numbers can look slower than leaderboards that run in process.
- Concurrency:
--parallel 8for both ingest and query phases. - Top-k:
--limit 10. Recall and NDCG are reported at 10.
Small Tier (100k Vectors)
| Property | Value |
|---|---|
| vCPU | 1 |
| Memory | 2 GB |
| Disk | 3 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline Numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 624.4 s (~10:24) |
Peak QPS (ef=24) |
158.9 |
Best recall (ef=512) |
0.9993 |
Best mean latency (ef=24) |
48 ms |
| QPS at recall >= 0.98 | 129.8 |
Full Results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 141.9 | 55.5 | 446.2 | 0.9140 | 0.9654 |
| 24 | 158.9 | 48.4 | 251.9 | 0.9447 | 0.9788 |
| 32 | 110.4 | 70.1 | 428.5 | 0.9594 | 0.9847 |
| 48 | 135.7 | 51.8 | 284.1 | 0.9763 | 0.9912 |
| 64 | 129.8 | 59.4 | 233.5 | 0.9846 | 0.9944 |
| 96 | 108.1 | 70.9 | 238.5 | 0.9912 | 0.9968 |
| 128 | 86.4 | 88.2 | 300.7 | 0.9933 | 0.9976 |
| 256 | 56.4 | 135.8 | 418.7 | 0.9986 | 0.9995 |
| 512 | 40.1 | 190.6 | 567.1 | 0.9993 | 0.9998 |
Observations
- Throughput does not decrease monotonically with
ef. The expected ordering would beef=16 > 24 > 32 > 48, but QPS jumps around. The most likely cause is single-run noise: eachefsweep runs for only a few seconds, so a brief CPU contention spike can dominate the measurement. - p99 is consistently 4-8 times the mean, much higher than the medium and large tiers. This is the signature of a resource-constrained instance where slow queries queue behind faster ones.
- Import takes more than 10 minutes for 100,000 vectors, which is slow enough to matter for production data refresh patterns.
Recommended Operating Point
For most workloads targeting high recall, ef=64 provides the best balance:
| Metric | Value |
|---|---|
| Recall@10 | 0.985 |
| QPS | 130 |
| Mean latency | 59 ms |
| p99 latency | 234 ms |
If lower recall (~95%) is acceptable, ef=24 increases throughput to ~159 QPS.
Suitability
Best for development, staging, and low-traffic production (about 100 QPS). High tail latency makes it a poor fit for latency-sensitive applications.
Medium Tier (100k Vectors)
| Property | Value |
|---|---|
| vCPU | 2 |
| Memory | 4 GB |
| Disk | 11 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline Numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 184.5 s (~3:04) |
Peak QPS (ef=16) |
454.9 |
Best recall (ef=512) |
0.9996 |
Best mean latency (ef=16) |
17 ms |
| QPS at recall >= 0.98 | 322.5 |
Full Results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 454.9 | 17.1 | 54.2 | 0.9190 | 0.9687 |
| 24 | 330.8 | 22.1 | 111.6 | 0.9481 | 0.9799 |
| 32 | 271.0 | 25.4 | 160.2 | 0.9609 | 0.9853 |
| 48 | 414.5 | 17.6 | 90.9 | 0.9763 | 0.9913 |
| 64 | 322.5 | 23.4 | 102.2 | 0.9817 | 0.9934 |
| 96 | 337.4 | 22.7 | 99.8 | 0.9911 | 0.9968 |
| 128 | 348.4 | 21.9 | 54.4 | 0.9945 | 0.9980 |
| 256 | 267.5 | 29.0 | 60.1 | 0.9983 | 0.9994 |
| 512 | 178.6 | 43.2 | 95.4 | 0.9996 | 0.9999 |
Observations
- The QPS curve is strongly non-monotonic, with throughput rising at
ef=48andef=128even though more work is being done. This is not algorithmically plausible. The most likely cause is noisy-neighbor variance on shared infrastructure during the brief window eachefruns. - Tail latency ratios are healthier than the small tier (typically 2-7 times the mean), which suggests this tier has the headroom to absorb concurrent load.
- Import is about 3.4 times faster than the small tier on the same workload.
Recommended Operating Point
For recall >= 0.98, ef=64 offers a strong balance:
| Metric | Value |
|---|---|
| Recall@10 | 0.982 |
| QPS | 322 |
| Mean latency | 23 ms |
| p99 latency | 102 ms |
If a 99.5%+ recall target is required, ef=128 gave 348 QPS at 22 ms mean and 54 ms p99 in this run. The result is anomalously good and should be re-validated with a longer query duration before being relied on.
Suitability
The medium tier fits moderate-traffic production workloads with sustained throughput in the low hundreds of QPS at sub-100 ms p99. The unstable QPS curve in this single run means absolute numbers should be re-validated with longer duration sweeps before being quoted in external materials.
Large Tier (100k Vectors)
| Property | Value |
|---|---|
| vCPU | 8 |
| Memory | 32 GB |
| Disk | 230 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline Numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 79.3 s (~1:19) |
Peak QPS (ef=24) |
1304 |
Best recall (ef=512) |
0.9994 |
Best mean latency (ef=24) |
6.1 ms |
| QPS at recall >= 0.98 | 1109 |
Full Results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 1042 | 7.5 | 30.6 | 0.9201 | 0.9681 |
| 24 | 1304 | 6.1 | 13.9 | 0.9455 | 0.9786 |
| 32 | 1052 | 7.4 | 17.9 | 0.9581 | 0.9842 |
| 48 | 1025 | 7.6 | 15.3 | 0.9756 | 0.9909 |
| 64 | 1109 | 7.1 | 15.4 | 0.9828 | 0.9937 |
| 96 | 956 | 8.1 | 16.8 | 0.9888 | 0.9959 |
| 128 | 919 | 8.4 | 17.3 | 0.9936 | 0.9977 |
| 256 | 583 | 13.0 | 27.2 | 0.9974 | 0.9991 |
| 512 | 357 | 21.6 | 49.9 | 0.9994 | 0.9998 |
Observations
- Apart from a small QPS bump at
ef=24(typical of warmup effects in the first measurement window), the QPS curve is well-behaved and decreases monotonically. - p99 latency remains under 30 ms for
ef <= 128and under 50 ms atef=512. The p99-to-mean ratio is consistently 2-3 times, indicating available CPU and memory headroom. - Import takes 79 seconds for 100,000 vectors, which is fast enough to support frequent re-indexing or larger datasets.
Recommended Operating Points
For recall >= 0.98, run with ef=64:
| Metric | Value |
|---|---|
| Recall@10 | 0.983 |
| QPS | 1109 |
| Mean latency | 7 ms |
| p99 latency | 15 ms |
For recall >= 0.99, step up to ef=128:
| Metric | Value |
|---|---|
| Recall@10 | 0.994 |
| QPS | 919 |
| Mean latency | 8 ms |
| p99 latency | 17 ms |
Suitability
On the 100,000-vector workload in this benchmark, the large tier’s per-ef table shows higher QPS and lower mean and p99 latencies than the small and medium tiers for most settings. At recall >= 0.98 with ef=64, the operating-point table lists 1,109 QPS, 7 ms mean, and 15 ms p99, with 79 s import time for that dataset. The 1,000,000-vector section below uses the same large-tier SKU for a follow-up run.
Tier Comparison
At recall >= 0.98 (ef=64):
| Tier | QPS | Mean latency | p99 latency | Import time |
|---|---|---|---|---|
| Small | 130 | 59 ms | 234 ms | 624 s |
| Medium | 322 | 23 ms | 102 ms | 184 s |
| Large | 1109 | 7 ms | 15 ms | 79 s |
In the tier comparison table above, each larger tier shows higher QPS, lower mean and p99 latency, and shorter import time than the tier above it. The steepest p99 improvement is between small and medium; the largest QPS gain is between medium and large (see that table for exact values).
Iso-Recall Comparison
For each tier, the best operating point at common recall targets:
Recall >= 0.95
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 24 | 159 | 48 | 252 |
| Medium | 24 | 331 | 22 | 112 |
| Large | 24 | 1304 | 6 | 14 |
Recall >= 0.98
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 64 | 130 | 59 | 234 |
| Medium | 64 | 322 | 23 | 102 |
| Large | 64 | 1109 | 7 | 15 |
Recall >= 0.99
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 96 | 108 | 71 | 238 |
| Medium | 128 | 348 | 22 | 54 |
| Large | 128 | 919 | 8 | 17 |
Recall >= 0.999
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 256 | 56 | 136 | 419 |
| Medium | 256 | 268 | 29 | 60 |
| Large | 512 | 357 | 22 | 50 |
When to Pick Each Tier
| Use case | Tier | Rationale |
|---|---|---|
| Development, staging, demos | Small | Same benchmark setup as the other tiers (HNSW, ef sweep, ground truth). In these runs, headline import for 100k vectors exceeded 10 minutes and p99 latency was often several times the mean in the full results, which points to a resource-constrained node and more tail variance than on medium or large. |
| Low-traffic or internal production | Medium | At the recall >= 0.98 operating point in this run (ef=64), mean latency was 23 ms and p99 was 102 ms (see Medium tier tables). QPS versus ef was not monotonic across the short sweeps, so treat headline QPS as indicative until you rerun with longer query windows. |
| Latency-sensitive applications | Large | Full results show lower mean and p99 latencies across most ef values than in the small- and medium-tier tables for this workload. At recall >= 0.98 with ef=64, this run reported 7 ms mean and 15 ms p99 (see Large tier tables). |
| Multi-million vector datasets | Large | Shorter headline import on the 100k workload in this benchmark than on small or medium (see tier comparison import row). This doc also includes a follow-up run on the same large-tier SKU with about 990k vectors after filtering. |
Large Tier (1,000,000 Vectors)
This run uses the 1,000,000-vector variant dbpedia-openai-1000k-angular.hdf5 (about 990,000 vectors after filtering) on the same large-tier hardware. It is excluded from the cross-tier comparison above because the dataset differs.
Headline Numbers
| Metric | Value |
|---|---|
| Import time (990k vectors) | 944.3 s (~15:44) |
Peak QPS (ef=24) |
1090 |
Best recall (ef=512) |
0.997 |
Best mean latency (ef=24) |
7.2 ms |
| QPS at recall >= 0.98 | 873 |
Full Results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 539 | 14.4 | 111.3 | 0.8878 | 0.9537 |
| 24 | 1090 | 7.2 | 12.0 | 0.9241 | 0.9698 |
| 32 | 845 | 9.3 | 20.8 | 0.9430 | 0.9778 |
| 48 | 922 | 8.6 | 15.4 | 0.9628 | 0.9861 |
| 64 | 815 | 9.7 | 20.0 | 0.9721 | 0.9896 |
| 96 | 873 | 9.1 | 16.6 | 0.9817 | 0.9933 |
| 128 | 763 | 10.4 | 16.8 | 0.9866 | 0.9952 |
| 256 | 551 | 14.4 | 28.8 | 0.9935 | 0.9977 |
| 512 | 446 | 17.7 | 31.7 | 0.9969 | 0.9989 |
Observations
ef=16is anomalously slow. The most likely cause is cold-cache effects on the very first iteration of the sweep. Treat that row with caution.- From
ef=24onward the curve decreases roughly as expected with minor jitter, all within about 10-15%. - Recall is lower at the same
efthan on the 100,000-vector workload, which is expected. With 10 times the vectors, the HNSW graph is denser and the sameefexplores a smaller fraction of it. To match a recall target on the 1,000,000-vector dataset, increaseefaccordingly. - p99 stays under 32 ms across the full sweep (excluding the anomalous
ef=16), well within typical interactive-query SLAs for a million-vector index. - Import time scaled slightly worse than linear: 944 seconds for 1,000,000 vectors versus 79 seconds for 100,000. HNSW build cost grows super-linearly with vector count.
Key Insight
Increasing dataset size by 10 times reduced throughput by about 25-30% at comparable recall, indicating relatively stable performance on the large tier.
Recommended Operating Points
For recall >= 0.98, run with ef=96:
| Metric | Value |
|---|---|
| Recall@10 | 0.982 |
| QPS | 873 |
| Mean latency | 9 ms |
| p99 latency | 17 ms |
For recall >= 0.99, use ef=256:
| Metric | Value |
|---|---|
| Recall@10 | 0.994 |
| QPS | 551 |
| Mean latency | 14 ms |
| p99 latency | 29 ms |
Same Hardware, Different Scale
Side-by-side, large tier, both datasets, at recall >= 0.98:
| Dataset | ef |
QPS | Mean (ms) | p99 (ms) | Recall | Import |
|---|---|---|---|---|---|---|
| 100k | 64 | 1109 | 7 | 15 | 0.983 | 79 s |
| 1M | 96 | 873 | 9 | 17 | 0.982 | 944 s |
Notes on Results
- Single-run measurements: Each tier was benchmarked once. Per-
efquery sweeps run for only 2-4 seconds, so they are vulnerable to transient noise. The clearest example is the medium tier’s non-monotonic QPS curve. Re-run with--queryDuration 60or longer perefvalue to get statistically stable numbers before quoting any of these results in customer-facing material. - Noisy-neighbor variance: Managed multi-tenant infrastructure is subject to co-tenant interference. The unstable medium and small results may understate the steady-state performance of those tiers.
- No memory metrics: Weaviate’s Prometheus endpoint is not exposed on managed clusters during preview, so heap and resident-set-size measurements are not available.
- Single dataset and dimensionality: All measurements are on 100,000 vectors at 1,536 dimensions (plus the 1,000,000-vector run on the large tier). A smaller-dimension dataset would round out the picture.
- Network latency floor: Traffic flows through the managed-cluster load balancer in TOR1, so absolute latencies include several milliseconds of networking overhead that would not be present in an in-process or same-VPC benchmark.