How to Create, Index, List, and Delete Data Sources Public Preview

DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.


Add a Data Source Using Automation

Adding a data source to a knowledge base requires the unique identifier of the knowledge base and the Spaces bucket, folder, or file to use as the data source.

You can list all knowledge bases with their unique identifiers using the /v2/gen-ai/knowledge_bases endpoint.

How to Add a Data Source Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources

    cURL

    Using cURL:

                                curl -X DELETE \
      -H "Content-Type: application/json"  \
      -H "Authorization: Bearer $PREVIEW_API_TOKEN" \
      "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/9a6e3975-b0c6-11ef-bf8f-4e013e2ddde4/data_sources/bd2a2db5-b8b0-11ef-bf8f-4e013e2ddde4"
                            

Add a Data Source Using the Control Panel

To add a data source to a knowledge base after creation, go to your knowledge base in the control panel, click on the Data sources tab, and then click the Add source button to open the Knowledge bases page.

Click Select data source to open the Select data source window. From the Data source dropdown list, select one of the following options:

  • Spaces bucket or folder: Select one or more Spaces buckets or folders in a bucket where your data is stored. If you do not have Spaces buckets for your data, see How to Create a Spaces Bucket and How to Migrate Spaces with Flexify.IO.

  • Web crawling: Add a static or dynamic seed URL to extract data through web crawling with the GenAI crawler. The URL must use HTTPS and be publicly accessible. The crawler indexes up to 1000 links based on the defined crawling scope. The crawler follows robots.txt rules, respects disallow directives and skips links it cannot access.

    Update your robots.txt file to allow GenAI crawler

    If you want the GenAI crawler to index your site, you need to update your robots.txt file.

    First, find your robots.txt file. It is usually in the root directory of your site, for example, https://www.example.com/robots.txt. If your site does not have a robot.txt file, create one in the root directory of your site and add the following lines to it:

    User-agent: *
    Disallow: /private/
    Disallow: /admin/
    

    This configuration blocks all web crawlers from accessing the site’s /private/ and /admin/ sections. You can edit it to fit your site’s use case.

    To allow the GenAI crawler to access your site, add the following lines to the file:

    User-agent: DigitalOceanGenAICrawler/1.0
    Allow: /
    

    This targets the GenAI crawler and allows it to access and index all content on your site.

    After making the changes, save the robots.txt file in your site’s root directory.

    To verify the changes, in your browser, go to https://www.example.com/robots.txt, then check whether the updates applied.

    Once updated, the GenAI crawler can index your site.

    • Crawling Scope Levels: These rules determine which linked pages the crawler scrapes. The crawler indexes content from the seed URL, including supported media types like .svg, .jpeg, and .png images if specified. However, including images and SVGs may increase the indexing token count. The crawler ignores videos and avoids scraping links in footers, headers, and navigation elements. Downloadable files are processed only if they fall within the defined crawling scope; otherwise, they are ignored.
    Scope Seed URL Example Crawls
    Scoped (Most Narrow)
    Crawls only the seed URL and ignores all links to external pages.
    https://www.example.com/products/ai-ml/ Only this page.
    URL and all linked pages in path (Narrow)
    Crawls the seed URL and all pages within the same URL path, ignoring pages outside this path.
    https://www.example.com/docs/ Includes:
    https://www.example.com/docs/tutorials/

    Excludes:
    https://www.example.com/products/
    URL and all linked pages in domain (Broad)
    Crawls all pages within the same domain as the seed URL but does not include subdomains.
    https://www.example.com/docs/ Includes:
    https://www.example.com/products/

    Excludes:
    https://docs.example.com/
    Subdomains (Most Broad)
    Crawls all pages within the domain and its subdomains, including docs.example.com and marketplace.example.com
    .
    https://www.example.com/docs/ Includes:
    https://community.example.com/

    If you add a seed URL for web crawling, you can check if it’s fully indexed by adding it again and starting a new crawl. If it returns zero tokens, the initial crawl indexed all content.

  • File: Drag and drop data files from your local storage or click Upload to select the files to add in the file browser.

Note
Due to browser limitations, we recommend uploading files smaller than 2GB and batches of less than 100 files using the control panel. For large files and batches of files, use the DigitalOcean API.

Next, click Add selected data source to add the data source.

Adding data sources automatically indexes them and updates the knowledge base. For multiple data sources, the embedding updates in the knowledge base as soon as indexing finishes for a data source while the other data sources are getting indexed.

Re-index a Data Source

If the data in your data source has changed, you can re-index the data manually to update the vector embeddings.

To re-index all your data sources, click the Update all sources button.

To re-index a specific data source, click the ••• menu to its right and select Update source. In the Confirm source window, click the Update source button to start re-indexing the data source.

You can view the indexing progress and the job details in the banner on the top of knowledge base page.

For web crawling a data source, you cannot currently re-index a previously crawled seed URL. To re-index the content, delete the seed URL as a data source and re-add it to start a new crawl.

Remove a Data Source

To remove a data source, click the ••• menu to its right and select Remove source.

In this article...