Give Feedback

How to Manage Data Sources

Validated on 15 Apr 2026 • Last edited on 8 May 2026

DigitalOcean Knowledge Bases let you store, index, and retrieve data from private files, websites, Spaces buckets, and other sources to power retrieval-augmented generation with your own content.

Copy page as Markdown View page as Markdown

You can add or remove your data sources as needed.

To manage your data sources, go to the DigitalOcean Control Panel, in the left menu, click DATA SERVICES, and then click Knowledge Bases.

Then, find the knowledge base with the data sources you want to manage, on the right of it, click …, and then click Manage data sources to open the Data sources tab.

Add a Data Source Using the Control Panel

You can add multiple types of data sources and include as many as needed. To save processing time and cost, organize your files in dedicated Spaces buckets, specific folders, or local storage containing only relevant files.

To avoid delays, we recommend uploading fewer than 100 files at a time, each under 2 GB. For larger uploads, use the DigitalOcean API. If uploads continue to stall, contact support.

To add a data source, on the right, click Add source to open the Add Data Source page.

Under the Select data sources to index section, select the type of data you want to add.

Knowledge bases support the following text-based file formats: .csv, .eml, .epub, .xls, .xlsx, .html, .md, .odt, .pdf, .txt, .rst, .rtf, .tsv, .doc, .docx, .xml, .json, and .jsonl. When supported files contain embedded media, such as images or SVGs, we also attempt to index that content.

You can add any of the following data sources:

File Upload

To upload files, click Upload a file to open the Select files to upload window.

For performance and reliability, we recommend uploading files no larger than 2 GB and uploading fewer than 100 files at a time.

Under the Choose Files section, either click Upload, or drag-and-drop at least one file.

If you want to add more files, on the bottom right, click Upload more files.

If you want to remove a file, on the right of it, click the trash icon.

Spaces Bucket or Folder

To add a Spaces bucket or folder, click Pull from a Spaces bucket or folder to open the Select Spaces bucket or folder window.

We can index all supported file formats in selected buckets and folders, regardless of privacy settings.

Then, either choose at least one bucket or folder you want to index, or on the left of a bucket, click + to expand its contents, and then select specific folders. For optimal performance and indexing quality, we recommend using five or fewer buckets and uploading only indexing data to your buckets.

Web or Site Map URL

Note

When you specify a website URL as a data source for your knowledge base, DigitalOcean uses a custom agent named DigitalOceanGradientAICrawler/1.0 to index the website content. The crawler indexes up to 5,500 pages and skips inaccessible or disallowed links to prevent excessively large indexing jobs.

Depending on the behavior you select, the crawler follows HTML links on the site, indexes text and certain image types, and ignores videos and navigation links. It respects the website’s robots.txt rules, including any Disallow directives or the wildcard *.

To add a URL for web crawling, click Add a web or site map URL. You can then choose to specify a Seed URL or a Site map URL.

Specify Seed URL

Specifying a seed URL crawls only the seed URL and linked pages within the same path, domain, or subdomains.

To specify a seed URL, click Seed URL, and then in the Seed URL field, enter the public URL you want to crawl.

Under the Crawling rules section, select the crawl scope (from most narrow to most broad):

Scoped crawls only the seed URL.
URL and all linked pages in path crawls the seed URL and all pages within the same path.
URL and all linked pages in domain crawls all pages in the same domain.
Subdomains crawls the domain and all its subdomains.

Click the Index embedded media option to index supported images and other media encountered during the crawl.

Click the Include headers and footers navigation links option to include each page’s header and footer content, such as links in them.

Specify Site Map URL

Specifying the site map URL crawls only URLs listed in the site map.

To crawl other URLs, use the Seed URL option, or add another web crawling data source.

To specify a site map URL, click Sitemap URL, and then in the Sitemap URL field, enter the URL you want to crawl. For example, docs.digitalocean.com/sitemap.xml.

The site map URL must be in .xml format where you can identify a specific list of URLs to crawl. You can use a site map URL to add scoped URLs all at once instead of adding them individually, or choosing a crawling rule for a seed URL.

Click the Index embedded media option to index supported images and other media encountered during the crawl.

Click the Include headers and footers navigation links option to include each page’s header and footer content, such as links in them.

Dropbox Folder

If you haven’t connected your Dropbox account, on the right of the Pull from a Dropbox folder option, click Connect account to first log in to your Dropbox account and authorize the connection.

To add a Dropbox folder, click Pull from a Dropbox folder, and then choose at least one folder you want to index, or on the left of a folder, click + to expand its contents and select specific folders.

Amazon S3 Bucket or Folder

To add an Amazon S3 bucket or folder, click Pull from an AWS S3 bucket folder.

In the Access Key ID field, enter the IAM access key ID for your S3 bucket or folder.

In the Secret Key field, enter the secret key associated with your access key ID.

In the Bucket Name field, enter the name of the S3 bucket to index.

In the Region field, enter the AWS region where your S3 bucket folder is located, such as us-east-1 or eu-west-1.

On the right of the Region field, click + to add the S3 bucket.

If you want to control how the data source is split into chunks during indexing, click Advanced Options to configure its chunking strategy. By default, all data sources use section-based chunking. For more information about chunking strategies, see our chunking strategy best practices.

Then, click Add selected data source.

Below the Data sources to be indexed section, review each data source’s estimated size, configuration, and status:

Ready: The data source is uploaded and ready for indexing.
Uploading: The data source is still uploading and isn’t ready for indexing.
Error: Upload or processing failed. If your data source failed, remove the data source and try again. If the issue persists, contact support.

Note

Size estimates are available only for sources with known values, such as Spaces buckets and uploaded files. Other sources show a size after the initial indexing job completes.

If you want to remove a data source, click the trash icon next to it.

Under Summary, review the embeddings model and token cost, total estimated dataset size, and number of data sources.

To estimate indexing costs, click How much will I pay for an indexing job? to open the Estimating indexing job costs window. Larger datasets cost more to index, but you only pay for successfully indexed data. Final costs may vary. For details, see embeddings model pricing.

After reviewing the data source, click Index added source.

If you added a seed or site map URL as a data source, verify web crawling is indexed successfully by re-adding the same seed or sitemap URL as a new data source. If the indexing job results of the duplicated data source shows zero tokens, the original crawl indexed all content, and you can delete the duplicate.

Add a Data Source via the API

To add a data source via the API, provide the knowledge base’s unique identifier and specify the source you want to index, such as a bucket, folder, file, or URL.

To retrieve knowledge base IDs, use the /v2/gen-ai/knowledge_bases endpoint.

You can optionally configure chunking for each data source:

chunking_algorithm: The chunking strategy (section, semantic, hierarchical, or fixed).
chunking_options: A configuration object defining parameters such as max_chunk_size, semantic_threshold, parent_chunk_size, or child_chunk_size.

If you decide to change your chunking configuration later, this triggers a re-indexing job, which consumes tokens.

After adding the data source, indexing automatically starts to make its content available for retrieval.

How to Add a Data Source Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources.

Using cURL:

curl -X POST \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/20cd8434-6ea1-11f0-bf8f-4e013e2ddde4/data_sources" \
  -d '{
"knowledge_base_uuid": "20cd8434-6ea1-11f0-bf8f-4e013e2ddde4",
"web_crawler_data_source": {
  "base_url": "https://example.com",
  "crawling_option": "SCOPED",
  "embed_media": false,
  "exclude_tags": ["nav","footer","header","aside","script","style","form","iframe", "noscript"]
},
"chunking_algorithm": "CHUNKING_ALGORITHM_SECTION_BASED",
"chunking_options": {
  "max_chunk_size": 500
}
}'

To confirm the data source was added, list the knowledge base’s data sources.

Remove a Data Source Using the Control Panel

Removing a data source removes it and its configurations, such as chunking settings, from the knowledge base without deleting the original file, folder, bucket, or URL.

To remove a data source, on the right of the data source you want to delete, click the trash icon to open the Remove data source window.

Then, enter the name of the data source to confirm its removal, and then click Destroy.

After removal, the knowledge base automatically reindexes the remaining data sources. You can track the reindexing process in the Activity tab.

Remove a Data Source via the API

To remove a data source via the API, provide the knowledge base ID and the data source ID. Removing a data source removes it and its configurations, such as chunking settings, from the knowledge base without deleting the original file, folder, bucket, or URL.

You can find data source IDs by listing the knowledge base’s data sources.

How to Remove a Data Source Using the DigitalOcean API

Create a personal access token and save it for use with the API.

cURL

Send a DELETE request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources/{data_source_uuid}.

Using cURL:

curl -X DELETE \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/9a6e3975-b0c6-11ef-bf8f-4e013e2ddde4/data_sources/bd2a2db5-b8b0-11ef-bf8f-4e013e2ddde4"

How to Manage Data Sources

Add a Data Source Using the Control Panel

Specify Seed URL

Specify Site Map URL

Add a Data Source via the API

cURL

Remove a Data Source Using the Control Panel

Remove a Data Source via the API

cURL

We can't find any results for your search.