Give Feedback

Advanced Vector Workloads with pgvectorscale and Hybrid Search

Validated on 28 Apr 2026 • Last edited on 8 May 2026

DigitalOcean Managed PostgreSQL for vector search uses the same managed PostgreSQL engine available under Managed Databases, with the pgvector and pgvectorscale extensions for storing and querying vector embeddings alongside relational data.

Copy page as Markdown View page as Markdown

This guide covers workloads that outgrow the pgvector defaults. It shows when to use pgvectorscale, how to build disk-resident vector indexes with StreamingDiskANN, how to combine full-text and vector search for better relevance, and a minimal retrieval-augmented generation (RAG) pattern using DigitalOcean Managed PostgreSQL as the memory layer.

When to Use pgvectorscale

pgvectorscale complements pgvector. It adds two capabilities that matter once your corpus outgrows memory:

StreamingDiskANN index: A graph index designed for disk-resident vector search. It keeps the hot part of the graph in memory and streams the rest from disk, so you can search corpora much larger than RAM without falling back to a sequential scan.
Statistical Binary Quantization (SBQ): Compresses vectors to one bit per dimension for the index, with an optional exact rerank step. Cuts index size dramatically.

Good fits for pgvectorscale:

Tens of millions to hundreds of millions of vectors.
Workloads where the pgvector HNSW index no longer fits comfortably in RAM.
Write-heavy ingest pipelines where HNSW build time becomes a bottleneck.

For most workloads under a few million rows, pgvector’s HNSW index is simpler and performs well. Start there, measure, and move to pgvectorscale when memory, ingest rate, or index build time pushes you off HNSW.

Enable the vectorscale Extension

pgvectorscale depends on pgvector, so enable both. The registered extension name in PostgreSQL is vectorscale, and the CASCADE clause installs pgvector if needed.

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

Verify both extensions are present:

SELECT extname, extversion
FROM   pg_extension
WHERE  extname IN ('vector', 'vectorscale')
ORDER  BY extname;

Create a StreamingDiskANN Index

StreamingDiskANN is created with the same CREATE INDEX syntax you already know. Operator classes follow the pgvector naming convention.

CREATE INDEX documents_embedding_diskann
    ON documents
    USING diskann (embedding vector_cosine_ops)
    WITH (
        storage_layout = 'memory_optimized',
        num_neighbors = 50,
        search_list_size = 100,
        max_alpha = 1.2
    );

Key parameters:

storage_layout: memory_optimized enables SBQ compression (one bit per dimension) with a rerank against the full-precision vector; plain keeps vectors uncompressed.
num_neighbors: Graph degree, analogous to HNSW’s m. Typical range 30-75.
search_list_size: Build-time candidate set size, analogous to ef_construction.
max_alpha: Graph pruning parameter. Values between 1.0 and 1.5 are typical.

Tune recall at query time with the per-session GUC:

SET diskann.query_rescore = 50;
SELECT id, content
FROM   documents
ORDER  BY embedding <=> $1
LIMIT  10;

With memory_optimized storage, the query first walks the SBQ-compressed graph, then reranks the top query_rescore candidates against the exact vectors in the heap. Raise query_rescore to improve recall; lower it to reduce latency.

Hybrid Search: Full-Text Plus Vector

Semantic search with vectors is strong at paraphrase and concept matching. Full-text search is strong at rare tokens, product codes, and exact phrases. Combining the two gives noticeably better relevance than either alone.

Add a generated tsvector column and index it with GIN:

ALTER TABLE documents
    ADD COLUMN content_tsv tsvector
        GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX documents_content_tsv_idx
    ON documents USING gin (content_tsv);

Retrieve candidates from both the vector index and the full-text index, then combine their ranks. Reciprocal Rank Fusion (RRF) is a simple, model-free way to merge the two lists:

WITH
  vec AS (
    SELECT id, row_number() OVER (ORDER BY embedding <=> $1) AS rnk
    FROM   documents
    ORDER  BY embedding <=> $1
    LIMIT  50
  ),
  txt AS (
    SELECT id, row_number() OVER (ORDER BY ts_rank(content_tsv, query) DESC) AS rnk
    FROM   documents, plainto_tsquery('english', $2) query
    WHERE  content_tsv @@ query
    LIMIT  50
  )
SELECT d.id, d.content,
       COALESCE(1.0 / (60 + vec.rnk), 0) + COALESCE(1.0 / (60 + txt.rnk), 0) AS score
FROM   documents d
LEFT   JOIN vec ON vec.id = d.id
LEFT   JOIN txt ON txt.id = d.id
WHERE  vec.id IS NOT NULL OR txt.id IS NOT NULL
ORDER  BY score DESC
LIMIT  10;

$1 is the query embedding; $2 is the raw user query string. The constant 60 is the usual RRF dampening factor. Adjust only if you have measured a better value for your corpus.

A Minimal RAG Pattern

Retrieval-augmented generation (RAG) combines a retriever (your PostgreSQL database) with a generator (an LLM). The retriever pulls the top-k most relevant chunks; the generator answers using those chunks as grounding.

The moving parts are:

A table of chunked documents with embeddings, indexed for similarity search (this guide).
An embedding model to turn the user’s question into a query vector.
A chat-completion model to generate the final answer.

The example below uses Serverless Inference as the chat endpoint and a third-party provider for embeddings.

import os
import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI

embed = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

chat = OpenAI(
    base_url="https://inference.do-ai.run/v1",
    api_key=os.environ["DO_MODEL_ACCESS_KEY"],
)

q = "How do I secure a Managed PostgreSQL cluster?"
q_vec = embed.embeddings.create(
    model="text-embedding-3-small",
    input=[q],
).data[0].embedding

with psycopg.connect(os.environ["DATABASE_URL"]) as conn:
    register_vector(conn)
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT content
            FROM   documents
            ORDER  BY embedding <=> %s
            LIMIT  5
            """,
            (q_vec,),
        )
        context = "\n\n".join(row[0] for row in cur.fetchall())

answer = chat.chat.completions.create(
    model="openai-gpt-oss-120b",
    messages=[
        {"role": "system", "content": "Answer using only the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {q}"},
    ],
).choices[0].message.content

print(answer)

The chat model ID openai-gpt-oss-120b is one of the models available through DigitalOcean Serverless Inference. Check Foundation and Embedding Models on DigitalOcean Inference for the current list, or retrieve model IDs at runtime from GET /v1/models using your Model Access Key.

Operational Notes

Resize before you hit the wall: Vector indexes grow quickly. Watch RAM and disk graphs on the Insights tab and resize ahead of need.
Offload search traffic to read-only nodes: Similarity search is CPU- and memory-heavy; routing it to a read-only node protects write latency.
Snapshot before REINDEX: Rebuilding a large DiskANN index takes time. Managed PostgreSQL runs daily backups, so you always have a point to fall back to.
Monitor recall as well as latency: For ANN indexes, it is easy to drift from high to low recall by changing GUCs or models. Add a recall regression test against a small labeled set.