ef query sweeps are short (2-4 seconds), so absolute QPS and latency values can be skewed by transient infrastructure noise. Treat these results as indicative for tier selection rather than as a vendor performance guarantee. Re-run with longer query durations before quoting any specific number externally.
Weaviate Benchmarksprivate
Validated on 27 Apr 2026 • Last edited on 27 Apr 2026
DigitalOcean Managed Weaviate is a fully managed Weaviate vector database for retrieval-augmented generation, semantic search, and similarity-based AI workloads. Clusters are provisioned, secured, backed up, and patched by DigitalOcean.
These benchmarks compare the small, medium, and large Managed Weaviate tiers on a 100,000-vector approximate nearest-neighbor (ANN) workload, with a separate run on a 1,000,000-vector workload for the large tier.
Methodology
What These Benchmarks Measure
Weaviate’s core operation is approximate nearest-neighbor search: given a query vector, return the k most similar vectors from the indexed set. Three metrics matter:
- Recall@10: Of the 10 results returned, what fraction are actually in the true top 10.
1.0means perfect agreement with brute-force search. - Throughput (QPS): Queries per second the cluster can sustain under concurrent load.
- Latency: Wall-clock time per query, reported as the mean and the 99th percentile (p99).
There is a fundamental tradeoff: the same index can be tuned to favor recall or speed, but not both at the same time.
The HNSW Index
Weaviate’s default vector index is HNSW (Hierarchical Navigable Small World graph). It has two phases:
- Build-time parameters, set when the index is created:
efConstruction: How thoroughly the graph is built. Higher values mean slower indexing and a higher recall ceiling. These benchmarks use256in all runs.maxConnections(M): How many neighbors each graph node retains. Higher values mean more memory and slightly better recall. These benchmarks use16in all runs.
- Query-time parameter, swept across each benchmark:
ef: Search depth at query time. Higherefexplores more of the graph per query, which increases recall and latency and decreases QPS.
By holding efConstruction and maxConnections fixed and sweeping ef, each tier traces a recall-versus-throughput curve.
The dataset
| Property | Value |
|---|---|
| Name | dbpedia-openai-100k-angular.hdf5 |
| Source | ann-benchmarks.com |
| Vectors | 100,000 |
| Dimensions | 1,536 |
| Embedding model | OpenAI text-embedding-ada-002 |
| Distance metric | Cosine (angular) |
| Test queries | 1,000 (973 used after filtering) |
| Ground truth | Pre-computed exact top-100 neighbors per query |
Recall numbers from this dataset should generalize reasonably well to comparable semantic-search workloads.
The benchmark tool
These benchmarks use the open-source weaviate-benchmarking tool with the ann-benchmark subcommand. It performs:
- Schema setup: Drops and recreates the
Vectorclass with the configured HNSW parameters. - Ingest: Loads all vectors over gRPC using a producer-consumer pipeline with 8 worker goroutines and 100-vector batches.
- Quiesce: Waits 30 seconds after ingest so the index can settle.
- Query sweep: For each
efin{16, 24, 32, 48, 64, 96, 128, 256, 512}, runs the full test query set with 8 concurrent workers and records latency for every query. - Recall computation: Compares returned neighbors against the ground truth.
Test environment
- Load generator: A separate VM running the benchmarker CLI in the same region as the target cluster.
- Network: Traffic flows through the managed-cluster load balancer (TLS, gRPC over HTTPS on port 443). This adds non-trivial latency compared with localhost benchmarks. Keep this in mind when comparing absolute numbers to leaderboards that run in process.
- Concurrency:
--parallel 8for both ingest and query phases. - Top-k:
--limit 10. Recall and NDCG are reported at 10.
Small Tier (100k vectors)
| Property | Value |
|---|---|
| vCPU | 1 |
| Memory | 2 GB |
| Disk | 3 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 624.4 s (~10:24) |
Peak QPS (ef=24) |
158.9 |
Best recall (ef=512) |
0.9993 |
Best mean latency (ef=24) |
48 ms |
| QPS at recall >= 0.98 | 129.8 |
Full results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 141.9 | 55.5 | 446.2 | 0.9140 | 0.9654 |
| 24 | 158.9 | 48.4 | 251.9 | 0.9447 | 0.9788 |
| 32 | 110.4 | 70.1 | 428.5 | 0.9594 | 0.9847 |
| 48 | 135.7 | 51.8 | 284.1 | 0.9763 | 0.9912 |
| 64 | 129.8 | 59.4 | 233.5 | 0.9846 | 0.9944 |
| 96 | 108.1 | 70.9 | 238.5 | 0.9912 | 0.9968 |
| 128 | 86.4 | 88.2 | 300.7 | 0.9933 | 0.9976 |
| 256 | 56.4 | 135.8 | 418.7 | 0.9986 | 0.9995 |
| 512 | 40.1 | 190.6 | 567.1 | 0.9993 | 0.9998 |
Observations
- Throughput does not decrease monotonically with
ef. The expected ordering would beef=16 > 24 > 32 > 48, but QPS jumps around. The most likely cause is single-run noise: eachefsweep runs for only a few seconds, so a brief CPU contention spike can dominate the measurement. - p99 is consistently 4-8 times the mean, much higher than the medium and large tiers. This is the signature of a resource-constrained instance where slow queries queue behind faster ones.
- Import takes more than 10 minutes for 100,000 vectors, which is slow enough to matter for production data refresh patterns.
Recommended operating point
For a target of recall >= 0.98, run with ef=64:
| Metric | Value |
|---|---|
| Recall@10 | 0.985 |
| QPS | 130 |
| Mean latency | 59 ms |
| p99 latency | 234 ms |
If the application can tolerate ~95% recall in exchange for higher throughput, drop to ef=24 for ~159 QPS at 48 ms mean latency.
Suitability
The small tier is appropriate for development, staging, and low-traffic production workloads (up to about 100 sustained QPS). Its high tail latency makes it a poor choice for latency-sensitive interactive applications. For predictable sub-100 ms p99, use the medium or large tier.
Medium Tier (100k vectors)
| Property | Value |
|---|---|
| vCPU | 2 |
| Memory | 4 GB |
| Disk | 11 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 184.5 s (~3:04) |
Peak QPS (ef=16) |
454.9 |
Best recall (ef=512) |
0.9996 |
Best mean latency (ef=16) |
17 ms |
| QPS at recall >= 0.98 | 322.5 |
Full results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 454.9 | 17.1 | 54.2 | 0.9190 | 0.9687 |
| 24 | 330.8 | 22.1 | 111.6 | 0.9481 | 0.9799 |
| 32 | 271.0 | 25.4 | 160.2 | 0.9609 | 0.9853 |
| 48 | 414.5 | 17.6 | 90.9 | 0.9763 | 0.9913 |
| 64 | 322.5 | 23.4 | 102.2 | 0.9817 | 0.9934 |
| 96 | 337.4 | 22.7 | 99.8 | 0.9911 | 0.9968 |
| 128 | 348.4 | 21.9 | 54.4 | 0.9945 | 0.9980 |
| 256 | 267.5 | 29.0 | 60.1 | 0.9983 | 0.9994 |
| 512 | 178.6 | 43.2 | 95.4 | 0.9996 | 0.9999 |
Observations
- The QPS curve is strongly non-monotonic, with throughput rising at
ef=48andef=128even though more work is being done. This is not algorithmically plausible. The most likely cause is noisy-neighbor variance on shared infrastructure during the brief window eachefruns. - Tail latency ratios are healthier than the small tier (typically 2-7 times the mean), which suggests this tier has the headroom to absorb concurrent load.
- Import is about 3.4 times faster than the small tier on the same workload.
Recommended operating point
For recall >= 0.98, run with ef=64:
| Metric | Value |
|---|---|
| Recall@10 | 0.982 |
| QPS | 322 |
| Mean latency | 23 ms |
| p99 latency | 102 ms |
If a 99.5%+ recall target is required, ef=128 gave 348 QPS at 22 ms mean and 54 ms p99 in this run. The result is anomalously good and should be re-validated with a longer query duration before being relied on.
Suitability
The medium tier fits moderate-traffic production workloads with sustained throughput in the low hundreds of QPS at sub-100 ms p99. The unstable QPS curve in this single run means absolute numbers should be re-validated with longer duration sweeps before being quoted in external materials.
Large Tier (100k vectors)
| Property | Value |
|---|---|
| vCPU | 8 |
| Memory | 32 GB |
| Disk | 230 GB |
| Region | TOR1 |
| Replicas | 1 |
| Shards | 1 |
Headline numbers
| Metric | Value |
|---|---|
| Import time (100k vectors) | 79.3 s (~1:19) |
Peak QPS (ef=24) |
1304 |
Best recall (ef=512) |
0.9994 |
Best mean latency (ef=24) |
6.1 ms |
| QPS at recall >= 0.98 | 1109 |
Full results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 1042 | 7.5 | 30.6 | 0.9201 | 0.9681 |
| 24 | 1304 | 6.1 | 13.9 | 0.9455 | 0.9786 |
| 32 | 1052 | 7.4 | 17.9 | 0.9581 | 0.9842 |
| 48 | 1025 | 7.6 | 15.3 | 0.9756 | 0.9909 |
| 64 | 1109 | 7.1 | 15.4 | 0.9828 | 0.9937 |
| 96 | 956 | 8.1 | 16.8 | 0.9888 | 0.9959 |
| 128 | 919 | 8.4 | 17.3 | 0.9936 | 0.9977 |
| 256 | 583 | 13.0 | 27.2 | 0.9974 | 0.9991 |
| 512 | 357 | 21.6 | 49.9 | 0.9994 | 0.9998 |
Observations
- Apart from a small QPS bump at
ef=24(typical of warmup effects in the first measurement window), the QPS curve is well-behaved and decreases monotonically. - p99 stays under 30 ms for
ef <= 128and under 50 ms even atef=512. The p99-to-mean ratio is consistently 2-3 times, which indicates plenty of CPU and memory headroom. - Import takes 79 seconds for 100,000 vectors, which is fast enough to support frequent re-indexing or larger datasets.
Recommended operating point
For recall >= 0.98, run with ef=64:
| Metric | Value |
|---|---|
| Recall@10 | 0.983 |
| QPS | 1109 |
| Mean latency | 7 ms |
| p99 latency | 15 ms |
For recall >= 0.99, step up to ef=128:
| Metric | Value |
|---|---|
| Recall@10 | 0.994 |
| QPS | 919 |
| Mean latency | 8 ms |
| p99 latency | 17 ms |
Suitability
The large tier handles production semantic-search workloads at scale: 1,000+ QPS with double-digit-millisecond p99 at high recall is suitable for latency-sensitive interactive applications such as chat, search, and recommendations. For 100,000-vector workloads the cluster shows substantial CPU and memory headroom, indicating that significantly larger datasets are viable on the same tier.
Tier Comparison
At the recall >= 0.98 operating point (ef=64) on the 100,000-vector workload:
| Tier | QPS | Mean latency | p99 latency | Import time |
|---|---|---|---|---|
| Small | 130 | 59 ms | 234 ms | 624 s |
| Medium | 322 | 23 ms | 102 ms | 184 s |
| Large | 1109 | 7 ms | 15 ms | 79 s |
The large tier delivered about 8.5 times the throughput of the small tier and about 3.4 times the throughput of the medium tier, with dramatically lower tail latency.
Iso-recall comparison
For each tier, the best operating point at common recall targets:
Recall >= 0.95
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 24 | 159 | 48 | 252 |
| Medium | 24 | 331 | 22 | 112 |
| Large | 24 | 1304 | 6 | 14 |
Recall >= 0.98
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 64 | 130 | 59 | 234 |
| Medium | 64 | 322 | 23 | 102 |
| Large | 64 | 1109 | 7 | 15 |
Recall >= 0.99
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 96 | 108 | 71 | 238 |
| Medium | 128 | 348 | 22 | 54 |
| Large | 128 | 919 | 8 | 17 |
Recall >= 0.999
| Tier | Best ef |
QPS | Mean (ms) | p99 (ms) |
|---|---|---|---|---|
| Small | 256 | 56 | 136 | 419 |
| Medium | 256 | 268 | 29 | 60 |
| Large | 512 | 357 | 22 | 50 |
When to pick which tier
| Use case | Recommended tier | Rationale |
|---|---|---|
| Development, staging, demos | Small | Cheapest tier, functionally complete. Latency variability is acceptable for non-production traffic. |
| Internal tools, batch jobs, low-traffic production | Medium | Three times the throughput of small at sub-100 ms p99 for the recommended operating point. Adequate headroom for traffic spikes. |
| Latency-sensitive production: chat, search UI, recommendations | Large | Sub-20 ms p99 at 99% recall and sustained 1000+ QPS. The only tested tier that delivered tight tail-latency guarantees. |
| Datasets significantly larger than 100,000 vectors | Large | Small and medium import times suggest those tiers would struggle with multi-million-vector indexing windows. |
Bonus: Large Tier on a 1,000,000-Vector Workload
This run uses the 1,000,000-vector variant dbpedia-openai-1000k-angular.hdf5 (about 990,000 vectors after filtering) on the same large-tier hardware. It is excluded from the cross-tier comparison above because the dataset differs.
Headline numbers
| Metric | Value |
|---|---|
| Import time (990k vectors) | 944.3 s (~15:44) |
Peak QPS (ef=24) |
1090 |
Best recall (ef=512) |
0.997 |
Best mean latency (ef=24) |
7.2 ms |
| QPS at recall >= 0.98 | 873 |
Full results
ef |
QPS | Mean latency (ms) | p99 latency (ms) | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| 16 | 539 | 14.4 | 111.3 | 0.8878 | 0.9537 |
| 24 | 1090 | 7.2 | 12.0 | 0.9241 | 0.9698 |
| 32 | 845 | 9.3 | 20.8 | 0.9430 | 0.9778 |
| 48 | 922 | 8.6 | 15.4 | 0.9628 | 0.9861 |
| 64 | 815 | 9.7 | 20.0 | 0.9721 | 0.9896 |
| 96 | 873 | 9.1 | 16.6 | 0.9817 | 0.9933 |
| 128 | 763 | 10.4 | 16.8 | 0.9866 | 0.9952 |
| 256 | 551 | 14.4 | 28.8 | 0.9935 | 0.9977 |
| 512 | 446 | 17.7 | 31.7 | 0.9969 | 0.9989 |
Observations
ef=16is anomalously slow. The most likely cause is cold-cache effects on the very first iteration of the sweep. Treat that row with caution.- From
ef=24onward the curve decreases roughly as expected with minor jitter, all within ~10-15%. - Recall is lower at the same
efthan on the 100,000-vector workload, which is expected. With 10 times the vectors, the HNSW graph is denser and the sameefexplores a smaller fraction of it. To match a recall target on the 1,000,000-vector dataset, increaseefaccordingly. - p99 stays under 32 ms across the full sweep (excluding the anomalous
ef=16), well within typical interactive-query SLAs for a million-vector index. - Import time scaled slightly worse than linear: 944 seconds for 1,000,000 vectors versus 79 seconds for 100,000. HNSW build cost grows super-linearly with vector count.
Recommended operating point
For recall >= 0.98, run with ef=96:
| Metric | Value |
|---|---|
| Recall@10 | 0.982 |
| QPS | 873 |
| Mean latency | 9 ms |
| p99 latency | 17 ms |
For recall >= 0.99, use ef=256:
| Metric | Value |
|---|---|
| Recall@10 | 0.994 |
| QPS | 551 |
| Mean latency | 14 ms |
| p99 latency | 29 ms |
Same hardware, different scale
Side-by-side, large tier, both datasets, at recall >= 0.98:
| Dataset | ef |
QPS | Mean (ms) | p99 (ms) | Recall | Import |
|---|---|---|---|---|---|---|
| 100k | 64 | 1109 | 7 | 15 | 0.983 | 79 s |
| 1M | 96 | 873 | 9 | 17 | 0.982 | 944 s |
A 10 times increase in dataset size reduced peak throughput by roughly 25-30% at comparable recall, which is a graceful degradation curve.
Caveats
- Single-run measurements. Each tier was benchmarked once. Per-
efquery sweeps run for only 2-4 seconds, so they are vulnerable to transient noise. The clearest example is the medium tier’s non-monotonic QPS curve. Re-run with--queryDuration 60or longer perefvalue to get statistically stable numbers before quoting any of these results in customer-facing material. - Noisy-neighbor variance. Managed multi-tenant infrastructure is subject to co-tenant interference. The unstable medium and small results may understate the steady-state performance of those tiers.
- No memory metrics. Weaviate’s Prometheus endpoint is not exposed on managed clusters during preview, so heap and resident-set-size measurements are not available.
- Single dataset, single dimensionality. All measurements are on 100,000 vectors at 1,536 dimensions (plus the bonus 1,000,000-vector run on the large tier). A smaller-dimension dataset would round out the picture.
- Network latency floor. Traffic flows through the managed-cluster load balancer in TOR1, so absolute latencies include several milliseconds of networking overhead that would not be present in an in-process or same-VPC benchmark.