Weaviate Benchmarksprivate

Validated on 28 Apr 2026 • Last edited on 7 May 2026

DigitalOcean Managed Weaviate is a fully managed Weaviate vector database for retrieval-augmented generation, semantic search, and similarity-based AI workloads. Clusters are provisioned, secured, backed up, and patched by DigitalOcean.

Managed Weaviate performance varies by tier, dataset size, and query-time tuning. The comparison below shows how the small, medium, and large tiers behave under a 100,000-vector approximate nearest-neighbor (ANN) workload, with an additional 1,000,000-vector run on the large tier to illustrate how performance scales.

Note
The numbers below come from a single run per tier. Per-ef query sweeps are short (2-4 seconds), so absolute QPS and latency values can be affected by transient infrastructure noise. Treat these results as directional for tier selection, not as a performance guarantee. Re-run with longer query durations before quoting specific numbers externally.

Methodology

What These Benchmarks Measure

Weaviate’s core operation is approximate nearest-neighbor search: given a query vector, return the k most similar vectors from the indexed set.

The benchmarks report four metrics:

  • Recall@10: Fraction of returned results that are in the true top 10. 1.0 means perfect agreement with brute-force search.
  • Throughput (QPS): Queries per second under concurrent load.
  • Latency: Wall-clock time per query, reported as mean and p99.
  • NDCG@10: Normalized discounted cumulative gain at rank 10. It scores how highly relevant results are ranked within the top 10, with more weight on positions closer to the top. Reported in the per-ef tables next to Recall@10.

There is a fundamental tradeoff: the same index can be tuned to favor recall or speed, but not both simultaneously.

The HNSW Index

Weaviate uses the HNSW (Hierarchical Navigable Small World) index.

Build-time parameters (fixed in these benchmarks):

  • efConstruction = 256: Higher values improve recall at the cost of slower indexing.
  • maxConnections (M) = 16: Controls graph density and memory usage.

Query-time parameter (swept):

  • ef: Search depth. Higher values increase recall and latency, and reduce QPS.

At index build, efConstruction and M are fixed, and only ef is swept at query time to trace a recall-throughput curve for each tier.

The Dataset

Property Value
Name dbpedia-openai-100k-angular.hdf5
Source ann-benchmarks.com
Vectors 100,000
Dimensions 1,536
Embedding model OpenAI text-embedding-ada-002
Distance metric Cosine (angular)
Test queries 1,000 (973 used after filtering)
Ground truth Pre-computed exact top-100 neighbors per query

Results generalize reasonably well to similar semantic search workloads.

Benchmark Tool

These benchmarks use the open-source weaviate-benchmarking tool with the ann-benchmark subcommand. It performs:

  1. Schema setup: Drops and recreates the Vector class with the configured HNSW parameters.
  2. Ingest: Loads all vectors over gRPC using a producer-consumer pipeline with 8 worker goroutines and 100-vector batches.
  3. Quiesce: Waits 30 seconds after ingest so the index can settle.
  4. Query sweep: For each ef in {16, 24, 32, 48, 64, 96, 128, 256, 512}, runs the full test query set with 8 concurrent workers and records latency for every query.
  5. Recall computation: Compares returned neighbors against the ground truth.

Test Environment

  • Load generator: A separate VM running the benchmarker CLI in the same region as the target cluster.
  • Network: Traffic flows through the managed-cluster load balancer (TLS, gRPC over HTTPS on port 443). That path adds non-trivial latency compared with localhost benchmarks, so absolute numbers can look slower than leaderboards that run in process.
  • Concurrency: --parallel 8 for both ingest and query phases.
  • Top-k: --limit 10. Recall and NDCG are reported at 10.

Small Tier (100k Vectors)

Property Value
vCPU 1
Memory 2 GB
Disk 3 GB
Region TOR1
Replicas 1
Shards 1

Headline Numbers

Metric Value
Import time (100k vectors) 624.4 s (~10:24)
Peak QPS (ef=24) 158.9
Best recall (ef=512) 0.9993
Best mean latency (ef=24) 48 ms
QPS at recall >= 0.98 129.8

Full Results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 141.9 55.5 446.2 0.9140 0.9654
24 158.9 48.4 251.9 0.9447 0.9788
32 110.4 70.1 428.5 0.9594 0.9847
48 135.7 51.8 284.1 0.9763 0.9912
64 129.8 59.4 233.5 0.9846 0.9944
96 108.1 70.9 238.5 0.9912 0.9968
128 86.4 88.2 300.7 0.9933 0.9976
256 56.4 135.8 418.7 0.9986 0.9995
512 40.1 190.6 567.1 0.9993 0.9998

Observations

  • Throughput does not decrease monotonically with ef. The expected ordering would be ef=16 > 24 > 32 > 48, but QPS jumps around. The most likely cause is single-run noise: each ef sweep runs for only a few seconds, so a brief CPU contention spike can dominate the measurement.
  • p99 is consistently 4-8 times the mean, much higher than the medium and large tiers. This is the signature of a resource-constrained instance where slow queries queue behind faster ones.
  • Import takes more than 10 minutes for 100,000 vectors, which is slow enough to matter for production data refresh patterns.

For most workloads targeting high recall, ef=64 provides the best balance:

Metric Value
Recall@10 0.985
QPS 130
Mean latency 59 ms
p99 latency 234 ms

If lower recall (~95%) is acceptable, ef=24 increases throughput to ~159 QPS.

Suitability

Best for development, staging, and low-traffic production (about 100 QPS). High tail latency makes it a poor fit for latency-sensitive applications.

Medium Tier (100k Vectors)

Property Value
vCPU 2
Memory 4 GB
Disk 11 GB
Region TOR1
Replicas 1
Shards 1

Headline Numbers

Metric Value
Import time (100k vectors) 184.5 s (~3:04)
Peak QPS (ef=16) 454.9
Best recall (ef=512) 0.9996
Best mean latency (ef=16) 17 ms
QPS at recall >= 0.98 322.5

Full Results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 454.9 17.1 54.2 0.9190 0.9687
24 330.8 22.1 111.6 0.9481 0.9799
32 271.0 25.4 160.2 0.9609 0.9853
48 414.5 17.6 90.9 0.9763 0.9913
64 322.5 23.4 102.2 0.9817 0.9934
96 337.4 22.7 99.8 0.9911 0.9968
128 348.4 21.9 54.4 0.9945 0.9980
256 267.5 29.0 60.1 0.9983 0.9994
512 178.6 43.2 95.4 0.9996 0.9999

Observations

  • The QPS curve is strongly non-monotonic, with throughput rising at ef=48 and ef=128 even though more work is being done. This is not algorithmically plausible. The most likely cause is noisy-neighbor variance on shared infrastructure during the brief window each ef runs.
  • Tail latency ratios are healthier than the small tier (typically 2-7 times the mean), which suggests this tier has the headroom to absorb concurrent load.
  • Import is about 3.4 times faster than the small tier on the same workload.

For recall >= 0.98, ef=64 offers a strong balance:

Metric Value
Recall@10 0.982
QPS 322
Mean latency 23 ms
p99 latency 102 ms

If a 99.5%+ recall target is required, ef=128 gave 348 QPS at 22 ms mean and 54 ms p99 in this run. The result is anomalously good and should be re-validated with a longer query duration before being relied on.

Suitability

The medium tier fits moderate-traffic production workloads with sustained throughput in the low hundreds of QPS at sub-100 ms p99. The unstable QPS curve in this single run means absolute numbers should be re-validated with longer duration sweeps before being quoted in external materials.

Large Tier (100k Vectors)

Property Value
vCPU 8
Memory 32 GB
Disk 230 GB
Region TOR1
Replicas 1
Shards 1

Headline Numbers

Metric Value
Import time (100k vectors) 79.3 s (~1:19)
Peak QPS (ef=24) 1304
Best recall (ef=512) 0.9994
Best mean latency (ef=24) 6.1 ms
QPS at recall >= 0.98 1109

Full Results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 1042 7.5 30.6 0.9201 0.9681
24 1304 6.1 13.9 0.9455 0.9786
32 1052 7.4 17.9 0.9581 0.9842
48 1025 7.6 15.3 0.9756 0.9909
64 1109 7.1 15.4 0.9828 0.9937
96 956 8.1 16.8 0.9888 0.9959
128 919 8.4 17.3 0.9936 0.9977
256 583 13.0 27.2 0.9974 0.9991
512 357 21.6 49.9 0.9994 0.9998

Observations

  • Apart from a small QPS bump at ef=24 (typical of warmup effects in the first measurement window), the QPS curve is well-behaved and decreases monotonically.
  • p99 latency remains under 30 ms for ef <= 128 and under 50 ms at ef=512. The p99-to-mean ratio is consistently 2-3 times, indicating available CPU and memory headroom.
  • Import takes 79 seconds for 100,000 vectors, which is fast enough to support frequent re-indexing or larger datasets.

For recall >= 0.98, run with ef=64:

Metric Value
Recall@10 0.983
QPS 1109
Mean latency 7 ms
p99 latency 15 ms

For recall >= 0.99, step up to ef=128:

Metric Value
Recall@10 0.994
QPS 919
Mean latency 8 ms
p99 latency 17 ms

Suitability

On the 100,000-vector workload in this benchmark, the large tier’s per-ef table shows higher QPS and lower mean and p99 latencies than the small and medium tiers for most settings. At recall >= 0.98 with ef=64, the operating-point table lists 1,109 QPS, 7 ms mean, and 15 ms p99, with 79 s import time for that dataset. The 1,000,000-vector section below uses the same large-tier SKU for a follow-up run.

Tier Comparison

At recall >= 0.98 (ef=64):

Tier QPS Mean latency p99 latency Import time
Small 130 59 ms 234 ms 624 s
Medium 322 23 ms 102 ms 184 s
Large 1109 7 ms 15 ms 79 s

In the tier comparison table above, each larger tier shows higher QPS, lower mean and p99 latency, and shorter import time than the tier above it. The steepest p99 improvement is between small and medium; the largest QPS gain is between medium and large (see that table for exact values).

Iso-Recall Comparison

For each tier, the best operating point at common recall targets:

Recall >= 0.95

Tier Best ef QPS Mean (ms) p99 (ms)
Small 24 159 48 252
Medium 24 331 22 112
Large 24 1304 6 14

Recall >= 0.98

Tier Best ef QPS Mean (ms) p99 (ms)
Small 64 130 59 234
Medium 64 322 23 102
Large 64 1109 7 15

Recall >= 0.99

Tier Best ef QPS Mean (ms) p99 (ms)
Small 96 108 71 238
Medium 128 348 22 54
Large 128 919 8 17

Recall >= 0.999

Tier Best ef QPS Mean (ms) p99 (ms)
Small 256 56 136 419
Medium 256 268 29 60
Large 512 357 22 50

When to Pick Each Tier

Use case Tier Rationale
Development, staging, demos Small Same benchmark setup as the other tiers (HNSW, ef sweep, ground truth). In these runs, headline import for 100k vectors exceeded 10 minutes and p99 latency was often several times the mean in the full results, which points to a resource-constrained node and more tail variance than on medium or large.
Low-traffic or internal production Medium At the recall >= 0.98 operating point in this run (ef=64), mean latency was 23 ms and p99 was 102 ms (see Medium tier tables). QPS versus ef was not monotonic across the short sweeps, so treat headline QPS as indicative until you rerun with longer query windows.
Latency-sensitive applications Large Full results show lower mean and p99 latencies across most ef values than in the small- and medium-tier tables for this workload. At recall >= 0.98 with ef=64, this run reported 7 ms mean and 15 ms p99 (see Large tier tables).
Multi-million vector datasets Large Shorter headline import on the 100k workload in this benchmark than on small or medium (see tier comparison import row). This doc also includes a follow-up run on the same large-tier SKU with about 990k vectors after filtering.

Large Tier (1,000,000 Vectors)

This run uses the 1,000,000-vector variant dbpedia-openai-1000k-angular.hdf5 (about 990,000 vectors after filtering) on the same large-tier hardware. It is excluded from the cross-tier comparison above because the dataset differs.

Headline Numbers

Metric Value
Import time (990k vectors) 944.3 s (~15:44)
Peak QPS (ef=24) 1090
Best recall (ef=512) 0.997
Best mean latency (ef=24) 7.2 ms
QPS at recall >= 0.98 873

Full Results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 539 14.4 111.3 0.8878 0.9537
24 1090 7.2 12.0 0.9241 0.9698
32 845 9.3 20.8 0.9430 0.9778
48 922 8.6 15.4 0.9628 0.9861
64 815 9.7 20.0 0.9721 0.9896
96 873 9.1 16.6 0.9817 0.9933
128 763 10.4 16.8 0.9866 0.9952
256 551 14.4 28.8 0.9935 0.9977
512 446 17.7 31.7 0.9969 0.9989

Observations

  • ef=16 is anomalously slow. The most likely cause is cold-cache effects on the very first iteration of the sweep. Treat that row with caution.
  • From ef=24 onward the curve decreases roughly as expected with minor jitter, all within about 10-15%.
  • Recall is lower at the same ef than on the 100,000-vector workload, which is expected. With 10 times the vectors, the HNSW graph is denser and the same ef explores a smaller fraction of it. To match a recall target on the 1,000,000-vector dataset, increase ef accordingly.
  • p99 stays under 32 ms across the full sweep (excluding the anomalous ef=16), well within typical interactive-query SLAs for a million-vector index.
  • Import time scaled slightly worse than linear: 944 seconds for 1,000,000 vectors versus 79 seconds for 100,000. HNSW build cost grows super-linearly with vector count.

Key Insight

Increasing dataset size by 10 times reduced throughput by about 25-30% at comparable recall, indicating relatively stable performance on the large tier.

For recall >= 0.98, run with ef=96:

Metric Value
Recall@10 0.982
QPS 873
Mean latency 9 ms
p99 latency 17 ms

For recall >= 0.99, use ef=256:

Metric Value
Recall@10 0.994
QPS 551
Mean latency 14 ms
p99 latency 29 ms

Same Hardware, Different Scale

Side-by-side, large tier, both datasets, at recall >= 0.98:

Dataset ef QPS Mean (ms) p99 (ms) Recall Import
100k 64 1109 7 15 0.983 79 s
1M 96 873 9 17 0.982 944 s

Notes on Results

  • Single-run measurements: Each tier was benchmarked once. Per-ef query sweeps run for only 2-4 seconds, so they are vulnerable to transient noise. The clearest example is the medium tier’s non-monotonic QPS curve. Re-run with --queryDuration 60 or longer per ef value to get statistically stable numbers before quoting any of these results in customer-facing material.
  • Noisy-neighbor variance: Managed multi-tenant infrastructure is subject to co-tenant interference. The unstable medium and small results may understate the steady-state performance of those tiers.
  • No memory metrics: Weaviate’s Prometheus endpoint is not exposed on managed clusters during preview, so heap and resident-set-size measurements are not available.
  • Single dataset and dimensionality: All measurements are on 100,000 vectors at 1,536 dimensions (plus the 1,000,000-vector run on the large tier). A smaller-dimension dataset would round out the picture.
  • Network latency floor: Traffic flows through the managed-cluster load balancer in TOR1, so absolute latencies include several milliseconds of networking overhead that would not be present in an in-process or same-VPC benchmark.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.