For AI agents: The documentation index is at https://docs.digitalocean.com/llms.txt. Markdown versions of pages use the same URL with index.html.md in place of the HTML page (for example, append index.html.md to the directory path instead of opening the HTML document).
How to Manage Data Sources
Validated on 15 Apr 2026 • Last edited on 27 Apr 2026
DigitalOcean Knowledge Bases let you store, index, and retrieve data from private files, websites, Spaces buckets, and other sources to power retrieval-augmented generation with your own content.
You can add or remove your data sources as needed.
To manage your data sources, go to the DigitalOcean Control Panel, in the left menu, click DATA SERVICES, and then click Knowledge Bases.
Then, find the knowledge base with the data sources you want to manage, on the right of it, click …, and then click Manage data sources to open the Data sources tab.
Add a Data Source Using the Control Panel
You can add multiple types of data sources and include as many as needed. To save processing time and cost, organize your files in dedicated Spaces buckets, specific folders, or local storage containing only relevant files.
To avoid delays, we recommend uploading fewer than 100 files at a time, each under 2 GB. For larger uploads, use the DigitalOcean API. If uploads continue to stall, contact support.
To add a data source, on the right, click Add source to open the Add Data Source page.
Under the Select data sources to index section, select the type of data you want to add.
Knowledge bases support the following text-based file formats: .csv, .eml, .epub, .xls, .xlsx, .html, .md, .odt, .pdf, .txt, .rst, .rtf, .tsv, .doc, .docx, .xml, .json, and .jsonl. When supported files contain embedded media, such as images or SVGs, we also attempt to index that content.
You can add any of the following data sources:
To upload files, click Upload a file to open the Select files to upload window.
For performance and reliability, we recommend uploading files no larger than 2 GB and uploading fewer than 100 files at a time.
Under the Choose Files section, either click Upload, or drag-and-drop at least one file.
If you want to add more files, on the bottom right, click Upload more files.
If you want to remove a file, on the right of it, click the trash icon.
To add a Spaces bucket or folder, click Pull from a Spaces bucket or folder to open the Select Spaces bucket or folder window.
We can index all supported file formats in selected buckets and folders, regardless of privacy settings.
Then, either choose at least one bucket or folder you want to index, or on the left of a bucket, click + to expand its contents, and then select specific folders. For optimal performance and indexing quality, we recommend using five or fewer buckets and uploading only indexing data to your buckets.
Note
When you specify a website URL as a data source for your knowledge base, DigitalOcean uses a custom agent named DigitalOceanGradientAICrawler/1.0 to index the website content. The crawler indexes up to 5,500 pages and skips inaccessible or disallowed links to prevent excessively large indexing jobs.
Depending on the behavior you select, the crawler follows HTML links on the site, indexes text and certain image types, and ignores videos and navigation links. It respects the website’s robots.txt rules, including any Disallow directives or the wildcard *.
To add a URL for web crawling, click Add a web or site map URL. You can then choose to specify a Seed URL or a Site map URL.
Specify Seed URL
Specifying a seed URL crawls only the seed URL and linked pages within the same path, domain, or subdomains.
To specify a seed URL, click Seed URL, and then in the Seed URL field, enter the public URL you want to crawl.
Under the Crawling rules section, select the crawl scope (from most narrow to most broad):
Scoped crawls only the seed URL.
URL and all linked pages in path crawls the seed URL and all pages within the same path.
URL and all linked pages in domain crawls all pages in the same domain.
Subdomains crawls the domain and all its subdomains.
Click the Index embedded media option to index supported images and other media encountered during the crawl.
Click the Include headers and footers navigation links option to include each page’s header and footer content, such as links in them.
Specify Site Map URL
Specifying the site map URL crawls only URLs listed in the site map.
To crawl other URLs, use the Seed URL option, or add another web crawling data source.
To specify a site map URL, click Sitemap URL, and then in the Sitemap URL field, enter the URL you want to crawl. For example, docs.digitalocean.com/sitemap.xml.
The site map URL must be in .xml format where you can identify a specific list of URLs to crawl. You can use a site map URL to add scoped URLs all at once instead of adding them individually, or choosing a crawling rule for a seed URL.
Click the Index embedded media option to index supported images and other media encountered during the crawl.
Click the Include headers and footers navigation links option to include each page’s header and footer content, such as links in them.
If you haven’t connected your Dropbox account, on the right of the Pull from a Dropbox folder option, click Connect account to first log in to your Dropbox account and authorize the connection.
To add a Dropbox folder, click Pull from a Dropbox folder, and then choose at least one folder you want to index, or on the left of a folder, click + to expand its contents and select specific folders.
To add an Amazon S3 bucket or folder, click Pull from an AWS S3 bucket folder.
In the Access Key ID field, enter the IAM access key ID for your S3 bucket or folder.
In the Secret Key field, enter the secret key associated with your access key ID.
In the Bucket Name field, enter the name of the S3 bucket to index.
In the Region field, enter the AWS region where your S3 bucket folder is located, such as us-east-1 or eu-west-1.
On the right of the Region field, click + to add the S3 bucket.
If you want to control how the data source is split into chunks during indexing, click Advanced Options to configure its chunking strategy. By default, all data sources use section-based chunking. For more information about chunking strategies, see our chunking strategy best practices.
Then, click Add selected data source.
Below the Data sources to be indexed section, review each data source’s estimated size, configuration, and status:
Ready: The data source is uploaded and ready for indexing.
Uploading: The data source is still uploading and isn’t ready for indexing.
Note
Size estimates are available only for sources with known values, such as Spaces buckets and uploaded files. Other sources show a size after the initial indexing job completes.
If you want to remove a data source, click the trash icon next to it.
Under Summary, review the embeddings model and token cost, total estimated dataset size, and number of data sources.
To estimate indexing costs, click How much will I pay for an indexing job? to open the Estimating indexing job costs window. Larger datasets cost more to index, but you only pay for successfully indexed data. Final costs may vary. For details, see embeddings model pricing.
After reviewing the data source, click Index added source.
If you added a seed or site map URL as a data source, verify web crawling is indexed successfully by re-adding the same seed or sitemap URL as a new data source. If the indexing job results of the duplicated data source shows zero tokens, the original crawl indexed all content, and you can delete the duplicate.
Add a Data Source via the API
To add a data source via the API, provide the knowledge base’s unique identifier and specify the source you want to index, such as a bucket, folder, file, or URL.
Removing a data source removes it and its configurations, such as chunking settings, from the knowledge base without deleting the original file, folder, bucket, or URL.
To remove a data source, on the right of the data source you want to delete, click the trash icon to open the Remove data source window.
Then, enter the name of the data source to confirm its removal, and then click Destroy.
After removal, the knowledge base automatically reindexes the remaining data sources. You can track the reindexing process in the Activity tab.
Remove a Data Source via the API
To remove a data source via the API, provide the knowledge base ID and the data source ID. Removing a data source removes it and its configurations, such as chunking settings, from the knowledge base without deleting the original file, folder, bucket, or URL.