How to Create a Knowledge Base (KB) Public Preview

DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.


Create a Knowledge Base Using Automation

Creating a knowledge base using the DigitalOcean API requires a name for the knowledge base, an embedding model to use for indexing, data source, the identifier of the project the KB belongs to, and the datacenter region. You can also provide the unique identifier of the DigitalOcean OpenSearch database to store the vector embeddings of your data. If you do not provide one, we create a DigitalOcean OpenSearch database for the knowledge base to use. The size of the new database we create is typically double the size of the data.

You can obtain a list of embedding models with their unique identifiers using the /v2/gen-ai/models endpoint with the usecases query parameter.

How to Create a Knowledge Base Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases

    cURL

    Using cURL:

                                curl -X POST \
      -H "Content-Type: application/json"  \
      -H "Authorization: Bearer $PREVIEW_API_TOKEN" \
      "https://api.digitalocean.com/v2/gen-ai/knowledge_bases" \
      -d '{
        "name": "kb-api-create",
        "embedding_model_uuid": "05700391-7aa8-11ef-bf8f-4e013e2ddde4",
        "project_id": "37455431-84bd-4fa2-94cf-e8486f8f8c5e",
        "tags": [
          "tag1"
        ],
        "database_id": "abf1055a-745d-4c24-a1db-1959ea819264",
        "datasources": [
          {
            "bucket_name": "test-public-gen-ai",
            "bucket_region": "tor1"
          }
        ],
        "region": "tor1",
        "vpc_uuid": "f7176e0b-8c5e-4e32-948e-79327e56225a"
      }'
                            

You can list all available KBs, view details of the KB, or update the KB after creation.

Create a Knowledge Base Using the Control Panel

To create an AI agent from the DigitalOcean Control Panel, click GenAI Platform in the left sidebar, then click the Knowledge bases tab, and the Create Knowledge Base button.

Name Your Knowledge Base

You can leave the automatically-generated name for the database or choose a custom name. Names must be unique, be between 3 and 63 characters long, and only contain alphanumeric characters, dashes, and periods.

Select Data Source

You can organize all knowledge base files either in dedicated Spaces buckets or folders, or on a local storage and only include relevant files to save processing time and money. GenAI Platform supports the .txt, .html, .md, .pdf, .doc, .json, and .csv formats.

Click Select data source to open the Select data source window. From the Data source dropdown list, select one of the following options:

  • Spaces bucket or folder: Select one or more Spaces buckets or folders in a bucket where your data is stored. If you do not have Spaces buckets for your data, see How to Create a Spaces Bucket and How to Migrate Spaces with Flexify.IO.

  • Web crawling: Add a static or dynamic seed URL to extract data with the GenAI crawler. The URL must use HTTPS and be publicly accessible. The crawler indexes up to 5500 links within the defined scope. It follows robots.txt, respects disallow directives, and skips inaccessible links.

    Update your robots.txt file to allow GenAI crawler

    If you want the GenAI crawler to index your site, you need to update your robots.txt file.

    First, find your robots.txt file. It is usually in the root directory of your site, for example, https://www.example.com/robots.txt. If your site does not have a robot.txt file, create one in the root directory of your site and add the following lines to it:

    User-agent: *
    Disallow: /private/
    Disallow: /admin/
    

    This configuration blocks all web crawlers from accessing the site’s /private/ and /admin/ sections. You can edit it to fit your site’s use case.

    To allow the GenAI crawler to access your site, add the following lines to the file:

    User-agent: DigitalOceanGenAICrawler/1.0
    Allow: /
    

    This targets the GenAI crawler and allows it to access and index all content on your site.

    After making the changes, save the robots.txt file in your site’s root directory.

    To verify the changes, in your browser, go to https://www.example.com/robots.txt, then check whether the updates applied.

    Once updated, the GenAI crawler can index your site.

    • Crawling Scope Levels: These rules determine which linked pages the crawler scrapes. The crawler indexes content from the seed URL, including supported media types like .svg, .jpeg, and .png images if specified. However, including images and SVGs may increase the indexing token count. The crawler ignores videos and avoids scraping links in footers, headers, and navigation elements. Downloadable files are processed only if they fall within the defined crawling scope; otherwise, they are ignored.
    Scope Seed URL Example Crawls
    Scoped (Most Narrow)
    Crawls only the seed URL and ignores all links to external pages.
    https://www.example.com/products/ai-ml/ Only this page.
    URL and all linked pages in path (Narrow)
    Crawls the seed URL and all pages within the same URL path, ignoring pages outside this path.
    https://www.example.com/docs/ Includes:
    https://www.example.com/docs/tutorials/

    Excludes:
    https://www.example.com/products/
    URL and all linked pages in domain (Broad)
    Crawls all pages within the same domain as the seed URL but does not include subdomains.
    https://www.example.com/docs/ Includes:
    https://www.example.com/products/

    Excludes:
    https://docs.example.com/
    Subdomains (Most Broad)
    Crawls all pages within the domain and its subdomains, including docs.example.com and marketplace.example.com
    .
    https://www.example.com/docs/ Includes:
    https://community.example.com/

    If you add a seed URL for web crawling, you can check if it’s fully indexed by adding it again and starting a new crawl. If it returns zero tokens, the initial crawl indexed all content.

  • File: Drag and drop data files from your local storage or click Upload to select the files to add in the file browser.

Note
Due to browser limitations, we recommend uploading files smaller than 2GB and batches of less than 100 files using the control panel. For large files and batches of files, use the DigitalOcean API.

Next, click Add selected data source to add the data source.

Add to OpenSearch Cluster

A knowledge base requires a new or existing OpenSearch cluster to store the vector embeddings of your data from the data source. To use an existing cluster, in the OpenSearch database options section, select the Use existing option and then select the existing cluster from the Select an OpenSearch database dropdown.

To create a new one, select the Create new option and select the datacenter region to create the cluster. Embeddings are typically double the size of the data you add. We create an OpenSearch cluster with the smallest size that can store the embeddings of your data.

Select Embedding Model

An embedding model converts your data into vector embeddings, which are then stored in your OpenSearch database cluster. Use the Embeddings Model drop-down menu to select your model.

You currently cannot change embedding models after creating the knowledge base.

Review Pricing

Review the estimated cost of your knowledge base per token count. For reference, each token is comprised of around four characters. Or, at scale, 100 tokens roughly equal 75 words of text. Estimates assume a Latin alphabet dataset. Using non-Latin characters, emojis, or binary data may result in more tokens.

Final Details

Choose the project to add the knowledge base to and any tags you want to use.

  • Select a project: You can leave the default project or choose another one.

  • Tags: You can add a tag by typing it into the text box and pressing enter. Tags can only contain letters, numbers, colons, dashes, and underscores.

When you’re ready, click the Create Knowledge Base button. Knowledge bases typically take five minutes or more to provision.

In this article...