Weaviate Benchmarksprivate

Validated on 27 Apr 2026 • Last edited on 27 Apr 2026

DigitalOcean Managed Weaviate is a fully managed Weaviate vector database for retrieval-augmented generation, semantic search, and similarity-based AI workloads. Clusters are provisioned, secured, backed up, and patched by DigitalOcean.

These benchmarks compare the small, medium, and large Managed Weaviate tiers on a 100,000-vector approximate nearest-neighbor (ANN) workload, with a separate run on a 1,000,000-vector workload for the large tier.

Note
The numbers below come from a single run per tier. Per-ef query sweeps are short (2-4 seconds), so absolute QPS and latency values can be skewed by transient infrastructure noise. Treat these results as indicative for tier selection rather than as a vendor performance guarantee. Re-run with longer query durations before quoting any specific number externally.

Methodology

What These Benchmarks Measure

Weaviate’s core operation is approximate nearest-neighbor search: given a query vector, return the k most similar vectors from the indexed set. Three metrics matter:

  • Recall@10: Of the 10 results returned, what fraction are actually in the true top 10. 1.0 means perfect agreement with brute-force search.
  • Throughput (QPS): Queries per second the cluster can sustain under concurrent load.
  • Latency: Wall-clock time per query, reported as the mean and the 99th percentile (p99).

There is a fundamental tradeoff: the same index can be tuned to favor recall or speed, but not both at the same time.

The HNSW Index

Weaviate’s default vector index is HNSW (Hierarchical Navigable Small World graph). It has two phases:

  • Build-time parameters, set when the index is created:
    • efConstruction: How thoroughly the graph is built. Higher values mean slower indexing and a higher recall ceiling. These benchmarks use 256 in all runs.
    • maxConnections (M): How many neighbors each graph node retains. Higher values mean more memory and slightly better recall. These benchmarks use 16 in all runs.
  • Query-time parameter, swept across each benchmark:
    • ef: Search depth at query time. Higher ef explores more of the graph per query, which increases recall and latency and decreases QPS.

By holding efConstruction and maxConnections fixed and sweeping ef, each tier traces a recall-versus-throughput curve.

The dataset

Property Value
Name dbpedia-openai-100k-angular.hdf5
Source ann-benchmarks.com
Vectors 100,000
Dimensions 1,536
Embedding model OpenAI text-embedding-ada-002
Distance metric Cosine (angular)
Test queries 1,000 (973 used after filtering)
Ground truth Pre-computed exact top-100 neighbors per query

Recall numbers from this dataset should generalize reasonably well to comparable semantic-search workloads.

The benchmark tool

These benchmarks use the open-source weaviate-benchmarking tool with the ann-benchmark subcommand. It performs:

  1. Schema setup: Drops and recreates the Vector class with the configured HNSW parameters.
  2. Ingest: Loads all vectors over gRPC using a producer-consumer pipeline with 8 worker goroutines and 100-vector batches.
  3. Quiesce: Waits 30 seconds after ingest so the index can settle.
  4. Query sweep: For each ef in {16, 24, 32, 48, 64, 96, 128, 256, 512}, runs the full test query set with 8 concurrent workers and records latency for every query.
  5. Recall computation: Compares returned neighbors against the ground truth.

Test environment

  • Load generator: A separate VM running the benchmarker CLI in the same region as the target cluster.
  • Network: Traffic flows through the managed-cluster load balancer (TLS, gRPC over HTTPS on port 443). This adds non-trivial latency compared with localhost benchmarks. Keep this in mind when comparing absolute numbers to leaderboards that run in process.
  • Concurrency: --parallel 8 for both ingest and query phases.
  • Top-k: --limit 10. Recall and NDCG are reported at 10.

Small Tier (100k vectors)

Property Value
vCPU 1
Memory 2 GB
Disk 3 GB
Region TOR1
Replicas 1
Shards 1

Headline numbers

Metric Value
Import time (100k vectors) 624.4 s (~10:24)
Peak QPS (ef=24) 158.9
Best recall (ef=512) 0.9993
Best mean latency (ef=24) 48 ms
QPS at recall >= 0.98 129.8

Full results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 141.9 55.5 446.2 0.9140 0.9654
24 158.9 48.4 251.9 0.9447 0.9788
32 110.4 70.1 428.5 0.9594 0.9847
48 135.7 51.8 284.1 0.9763 0.9912
64 129.8 59.4 233.5 0.9846 0.9944
96 108.1 70.9 238.5 0.9912 0.9968
128 86.4 88.2 300.7 0.9933 0.9976
256 56.4 135.8 418.7 0.9986 0.9995
512 40.1 190.6 567.1 0.9993 0.9998

Observations

  • Throughput does not decrease monotonically with ef. The expected ordering would be ef=16 > 24 > 32 > 48, but QPS jumps around. The most likely cause is single-run noise: each ef sweep runs for only a few seconds, so a brief CPU contention spike can dominate the measurement.
  • p99 is consistently 4-8 times the mean, much higher than the medium and large tiers. This is the signature of a resource-constrained instance where slow queries queue behind faster ones.
  • Import takes more than 10 minutes for 100,000 vectors, which is slow enough to matter for production data refresh patterns.

For a target of recall >= 0.98, run with ef=64:

Metric Value
Recall@10 0.985
QPS 130
Mean latency 59 ms
p99 latency 234 ms

If the application can tolerate ~95% recall in exchange for higher throughput, drop to ef=24 for ~159 QPS at 48 ms mean latency.

Suitability

The small tier is appropriate for development, staging, and low-traffic production workloads (up to about 100 sustained QPS). Its high tail latency makes it a poor choice for latency-sensitive interactive applications. For predictable sub-100 ms p99, use the medium or large tier.

Medium Tier (100k vectors)

Property Value
vCPU 2
Memory 4 GB
Disk 11 GB
Region TOR1
Replicas 1
Shards 1

Headline numbers

Metric Value
Import time (100k vectors) 184.5 s (~3:04)
Peak QPS (ef=16) 454.9
Best recall (ef=512) 0.9996
Best mean latency (ef=16) 17 ms
QPS at recall >= 0.98 322.5

Full results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 454.9 17.1 54.2 0.9190 0.9687
24 330.8 22.1 111.6 0.9481 0.9799
32 271.0 25.4 160.2 0.9609 0.9853
48 414.5 17.6 90.9 0.9763 0.9913
64 322.5 23.4 102.2 0.9817 0.9934
96 337.4 22.7 99.8 0.9911 0.9968
128 348.4 21.9 54.4 0.9945 0.9980
256 267.5 29.0 60.1 0.9983 0.9994
512 178.6 43.2 95.4 0.9996 0.9999

Observations

  • The QPS curve is strongly non-monotonic, with throughput rising at ef=48 and ef=128 even though more work is being done. This is not algorithmically plausible. The most likely cause is noisy-neighbor variance on shared infrastructure during the brief window each ef runs.
  • Tail latency ratios are healthier than the small tier (typically 2-7 times the mean), which suggests this tier has the headroom to absorb concurrent load.
  • Import is about 3.4 times faster than the small tier on the same workload.

For recall >= 0.98, run with ef=64:

Metric Value
Recall@10 0.982
QPS 322
Mean latency 23 ms
p99 latency 102 ms

If a 99.5%+ recall target is required, ef=128 gave 348 QPS at 22 ms mean and 54 ms p99 in this run. The result is anomalously good and should be re-validated with a longer query duration before being relied on.

Suitability

The medium tier fits moderate-traffic production workloads with sustained throughput in the low hundreds of QPS at sub-100 ms p99. The unstable QPS curve in this single run means absolute numbers should be re-validated with longer duration sweeps before being quoted in external materials.

Large Tier (100k vectors)

Property Value
vCPU 8
Memory 32 GB
Disk 230 GB
Region TOR1
Replicas 1
Shards 1

Headline numbers

Metric Value
Import time (100k vectors) 79.3 s (~1:19)
Peak QPS (ef=24) 1304
Best recall (ef=512) 0.9994
Best mean latency (ef=24) 6.1 ms
QPS at recall >= 0.98 1109

Full results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 1042 7.5 30.6 0.9201 0.9681
24 1304 6.1 13.9 0.9455 0.9786
32 1052 7.4 17.9 0.9581 0.9842
48 1025 7.6 15.3 0.9756 0.9909
64 1109 7.1 15.4 0.9828 0.9937
96 956 8.1 16.8 0.9888 0.9959
128 919 8.4 17.3 0.9936 0.9977
256 583 13.0 27.2 0.9974 0.9991
512 357 21.6 49.9 0.9994 0.9998

Observations

  • Apart from a small QPS bump at ef=24 (typical of warmup effects in the first measurement window), the QPS curve is well-behaved and decreases monotonically.
  • p99 stays under 30 ms for ef <= 128 and under 50 ms even at ef=512. The p99-to-mean ratio is consistently 2-3 times, which indicates plenty of CPU and memory headroom.
  • Import takes 79 seconds for 100,000 vectors, which is fast enough to support frequent re-indexing or larger datasets.

For recall >= 0.98, run with ef=64:

Metric Value
Recall@10 0.983
QPS 1109
Mean latency 7 ms
p99 latency 15 ms

For recall >= 0.99, step up to ef=128:

Metric Value
Recall@10 0.994
QPS 919
Mean latency 8 ms
p99 latency 17 ms

Suitability

The large tier handles production semantic-search workloads at scale: 1,000+ QPS with double-digit-millisecond p99 at high recall is suitable for latency-sensitive interactive applications such as chat, search, and recommendations. For 100,000-vector workloads the cluster shows substantial CPU and memory headroom, indicating that significantly larger datasets are viable on the same tier.

Tier Comparison

At the recall >= 0.98 operating point (ef=64) on the 100,000-vector workload:

Tier QPS Mean latency p99 latency Import time
Small 130 59 ms 234 ms 624 s
Medium 322 23 ms 102 ms 184 s
Large 1109 7 ms 15 ms 79 s

The large tier delivered about 8.5 times the throughput of the small tier and about 3.4 times the throughput of the medium tier, with dramatically lower tail latency.

Iso-recall comparison

For each tier, the best operating point at common recall targets:

Recall >= 0.95

Tier Best ef QPS Mean (ms) p99 (ms)
Small 24 159 48 252
Medium 24 331 22 112
Large 24 1304 6 14

Recall >= 0.98

Tier Best ef QPS Mean (ms) p99 (ms)
Small 64 130 59 234
Medium 64 322 23 102
Large 64 1109 7 15

Recall >= 0.99

Tier Best ef QPS Mean (ms) p99 (ms)
Small 96 108 71 238
Medium 128 348 22 54
Large 128 919 8 17

Recall >= 0.999

Tier Best ef QPS Mean (ms) p99 (ms)
Small 256 56 136 419
Medium 256 268 29 60
Large 512 357 22 50

When to pick which tier

Use case Recommended tier Rationale
Development, staging, demos Small Cheapest tier, functionally complete. Latency variability is acceptable for non-production traffic.
Internal tools, batch jobs, low-traffic production Medium Three times the throughput of small at sub-100 ms p99 for the recommended operating point. Adequate headroom for traffic spikes.
Latency-sensitive production: chat, search UI, recommendations Large Sub-20 ms p99 at 99% recall and sustained 1000+ QPS. The only tested tier that delivered tight tail-latency guarantees.
Datasets significantly larger than 100,000 vectors Large Small and medium import times suggest those tiers would struggle with multi-million-vector indexing windows.

Bonus: Large Tier on a 1,000,000-Vector Workload

This run uses the 1,000,000-vector variant dbpedia-openai-1000k-angular.hdf5 (about 990,000 vectors after filtering) on the same large-tier hardware. It is excluded from the cross-tier comparison above because the dataset differs.

Headline numbers

Metric Value
Import time (990k vectors) 944.3 s (~15:44)
Peak QPS (ef=24) 1090
Best recall (ef=512) 0.997
Best mean latency (ef=24) 7.2 ms
QPS at recall >= 0.98 873

Full results

ef QPS Mean latency (ms) p99 latency (ms) Recall@10 NDCG@10
16 539 14.4 111.3 0.8878 0.9537
24 1090 7.2 12.0 0.9241 0.9698
32 845 9.3 20.8 0.9430 0.9778
48 922 8.6 15.4 0.9628 0.9861
64 815 9.7 20.0 0.9721 0.9896
96 873 9.1 16.6 0.9817 0.9933
128 763 10.4 16.8 0.9866 0.9952
256 551 14.4 28.8 0.9935 0.9977
512 446 17.7 31.7 0.9969 0.9989

Observations

  • ef=16 is anomalously slow. The most likely cause is cold-cache effects on the very first iteration of the sweep. Treat that row with caution.
  • From ef=24 onward the curve decreases roughly as expected with minor jitter, all within ~10-15%.
  • Recall is lower at the same ef than on the 100,000-vector workload, which is expected. With 10 times the vectors, the HNSW graph is denser and the same ef explores a smaller fraction of it. To match a recall target on the 1,000,000-vector dataset, increase ef accordingly.
  • p99 stays under 32 ms across the full sweep (excluding the anomalous ef=16), well within typical interactive-query SLAs for a million-vector index.
  • Import time scaled slightly worse than linear: 944 seconds for 1,000,000 vectors versus 79 seconds for 100,000. HNSW build cost grows super-linearly with vector count.

For recall >= 0.98, run with ef=96:

Metric Value
Recall@10 0.982
QPS 873
Mean latency 9 ms
p99 latency 17 ms

For recall >= 0.99, use ef=256:

Metric Value
Recall@10 0.994
QPS 551
Mean latency 14 ms
p99 latency 29 ms

Same hardware, different scale

Side-by-side, large tier, both datasets, at recall >= 0.98:

Dataset ef QPS Mean (ms) p99 (ms) Recall Import
100k 64 1109 7 15 0.983 79 s
1M 96 873 9 17 0.982 944 s

A 10 times increase in dataset size reduced peak throughput by roughly 25-30% at comparable recall, which is a graceful degradation curve.

Caveats

  1. Single-run measurements. Each tier was benchmarked once. Per-ef query sweeps run for only 2-4 seconds, so they are vulnerable to transient noise. The clearest example is the medium tier’s non-monotonic QPS curve. Re-run with --queryDuration 60 or longer per ef value to get statistically stable numbers before quoting any of these results in customer-facing material.
  2. Noisy-neighbor variance. Managed multi-tenant infrastructure is subject to co-tenant interference. The unstable medium and small results may understate the steady-state performance of those tiers.
  3. No memory metrics. Weaviate’s Prometheus endpoint is not exposed on managed clusters during preview, so heap and resident-set-size measurements are not available.
  4. Single dataset, single dimensionality. All measurements are on 100,000 vectors at 1,536 dimensions (plus the bonus 1,000,000-vector run on the large tier). A smaller-dimension dataset would round out the picture.
  5. Network latency floor. Traffic flows through the managed-cluster load balancer in TOR1, so absolute latencies include several milliseconds of networking overhead that would not be present in an in-process or same-VPC benchmark.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.