How to Create, Index, List, and Delete Data Sources Public Preview

DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.


Add a Data Source Using Automation

Adding a data source to a knowledge base requires the unique identifier of the knowledge base and the Spaces bucket, folder, or file to use as the data source.

You can list all knowledge bases with their unique identifiers using the /v2/gen-ai/knowledge_bases endpoint.

How to Add a Data Source Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources

    cURL

    Using cURL:

                                curl -X DELETE \
      -H "Content-Type: application/json"  \
      -H "Authorization: Bearer $PREVIEW_API_TOKEN" \
      "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/9a6e3975-b0c6-11ef-bf8f-4e013e2ddde4/data_sources/bd2a2db5-b8b0-11ef-bf8f-4e013e2ddde4"
                            

Add a Data Source Using the Control Panel

To add a data source to a knowledge base after creation, go to your knowledge base in the control panel, click on the Data sources tab, and then click the Add source button to open the Knowledge bases page.

Click Select data source to open the Select data source window. From the Data source dropdown list, select one of the following options:

  • Spaces bucket or folder: Select one or more Spaces buckets or folders in a bucket where your data is stored. If you do not have Spaces buckets for your data, see How to Create a Spaces Bucket and How to Migrate Spaces with Flexify.IO.

  • Web crawling: Add a static or dynamic seed URL to extract data with the GenAI crawler. The URL must use HTTPS and be publicly accessible. The crawler indexes up to 5500 links within the defined scope. It follows robots.txt, respects disallow directives, and skips inaccessible links.

    Update your robots.txt file to allow GenAI crawler

    If you want the GenAI crawler to index your site, you need to update your robots.txt file.

    First, find your robots.txt file. It is usually in the root directory of your site, for example, https://www.example.com/robots.txt. If your site does not have a robot.txt file, create one in the root directory of your site and add the following lines to it:

    User-agent: *
    Disallow: /private/
    Disallow: /admin/
    

    This configuration blocks all web crawlers from accessing the site’s /private/ and /admin/ sections. You can edit it to fit your site’s use case.

    To allow the GenAI crawler to access your site, add the following lines to the file:

    User-agent: DigitalOceanGenAICrawler/1.0
    Allow: /
    

    This targets the GenAI crawler and allows it to access and index all content on your site.

    After making the changes, save the robots.txt file in your site’s root directory.

    To verify the changes, in your browser, go to https://www.example.com/robots.txt, then check whether the updates applied.

    Once updated, the GenAI crawler can index your site.

    • Crawling Scope Levels: These rules determine which linked pages the crawler scrapes. The crawler indexes content from the seed URL, including supported media types like .svg, .jpeg, and .png images if specified. However, including images and SVGs may increase the indexing token count. The crawler ignores videos and avoids scraping links in footers, headers, and navigation elements. Downloadable files are processed only if they fall within the defined crawling scope; otherwise, they are ignored.
    Scope Seed URL Example Crawls
    Scoped (Most Narrow)
    Crawls only the seed URL and ignores all links to external pages.
    https://www.example.com/products/ai-ml/ Only this page.
    URL and all linked pages in path (Narrow)
    Crawls the seed URL and all pages within the same URL path, ignoring pages outside this path.
    https://www.example.com/docs/ Includes:
    https://www.example.com/docs/tutorials/

    Excludes:
    https://www.example.com/products/
    URL and all linked pages in domain (Broad)
    Crawls all pages within the same domain as the seed URL but does not include subdomains.
    https://www.example.com/docs/ Includes:
    https://www.example.com/products/

    Excludes:
    https://docs.example.com/
    Subdomains (Most Broad)
    Crawls all pages within the domain and its subdomains, including docs.example.com and marketplace.example.com
    .
    https://www.example.com/docs/ Includes:
    https://community.example.com/

    If you add a seed URL for web crawling, you can check if it’s fully indexed by adding it again and starting a new crawl. If it returns zero tokens, the initial crawl indexed all content.

  • File: Drag and drop data files from your local storage or click Upload to select the files to add in the file browser.

Note
Due to browser limitations, we recommend uploading files smaller than 2GB and batches of less than 100 files using the control panel. For large files and batches of files, use the DigitalOcean API.

Next, click Add selected data source to add the data source.

Adding data sources automatically indexes them and updates the knowledge base. For multiple data sources, the embedding updates in the knowledge base as soon as indexing finishes for a data source while the other data sources are getting indexed.

Re-index a Data Source

If the data in your data source has changed, you can re-index the data manually to update the vector embeddings.

Organize Data Sources

We index the data you store in your knowledge base. GenAI Platform supports the .txt, .html, .md, .pdf, .doc, .json, and .csv formats.

Spaces buckets seamlessly organize files for your knowledge base (KB) using S3-compatible systems. You need an initial dataset from a Spaces bucket to create a KB, and the system indexes all supported file formats in the bucket, regardless of their privacy settings.

  • Include Only Indexing Data. Keep bucket contents limited to data meant for indexing to reduce costs and prevent errors.
  • Use Five Buckets Maximum. For optimal performance, limit your setup to five or fewer buckets.
  • Use Supported File Formats. Make sure your files are in supported formats (TXT, JSON, Markdown, XML, HTML). For file setup instructions, see the How to Create a Spaces Bucket and create your knowledge base guides.

To upload files from local storage for your data source, due to browser limitations, we recommend uploading files smaller than 2GB and batches of less than 100 files using the control panel. For large files and batches of files, use the DigitalOcean API.

Reindex a Data Source

To re-index all your data sources, click the Update all sources button.

You can view the indexing progress and the job details in the banner on the top of knowledge base page.

For web crawling a data source, you cannot currently re-index a previously crawled seed URL. To re-index the content, delete the seed URL as a data source and re-add it to start a new crawl.

Remove a Data Source

To remove a data source, click the ••• menu to its right and select Remove source.

In this article...