How to Add, Re-index, and Remove Data Sourcespublic

Validated on 28 Apr 2025 • Last edited on 20 May 2025

DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.

A data source is the custom content you provide, such as files, folders, or URLs, that the knowledge base processes into vector embeddings and stores in an associated OpenSearch database. Each knowledge base requires at least one data source. You can either add a data source to an existing knowledge base or add a one during knowledge base creation.

We support the following data sources:

  • Spaces Bucket or Folder, index files stored in a DigitalOcean Spaces bucket or specific folders.
  • File Upload, upload and index files directly from your local machine.
  • URL for Web Crawling, use a seed URL to crawl and index content from a public website.

Add a Data Source Using the Control Panel

You can add new data sources to an existing knowledge base at any time. Each source is automatically indexed, and the knowledge base updates as indexing completes. If you add multiple sources, the knowledge base starts using embeddings from each source as soon as its indexing finishes.

To add a data source from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, find and then select the knowledge base you want to update. On the right of the knowledge base, click , select Manage data sources, then on the Data Sources page, click Add source to open the Add Data Source page.

Add a Data Source

In the Add Data Source page, under the Select data source to index section, click Select data source to open the selection window. Use the Data source dropdown menu to choose the type of data source to add.

Info
For web crawling data sources, the crawler indexes up to 5500 pages and skips inaccessible or disallowed links to prevent excessively large indexing jobs.

You can add multiple types of data sources to a knowledge base and include as many as needed. To save processing time and cost, organize your files in dedicated Spaces buckets, specific folders, or local storage containing only relevant files.

Supported File Formats

We support a wide range of text-based file formats, including: .csv, .eml, .epub, .xls, .xlsx, .html, .md, .odt, .pdf, .txt, .rst, .rtf, .tsv, .doc, .docx, .xml, .json, and .jsonl.

Info
PowerPoint files (.ppt, .pptx) are partially supported. We extract text but do not process images or other visual content. Image files (such as .png, .jpeg, .tiff, and .bmp) are not currently supported.

You can add any of the following data sources:

Add a Spaces Bucket or Folder

Add a Spaces Bucket or Folder

Add entire Spaces buckets or select specific folders to organize files in your knowledge base. The system indexes all supported file formats in selected buckets and folders, regardless of privacy settings.

For optimal performance and indexing quality:

  • Include only indexing data, keep bucket contents limited to files intended for indexing.
  • Use five buckets maximum, limit usage to five buckets or fewer for best performance.
  • Use supported file formats, ensure your files use supported formats.

Click + next to a bucket to expand and select folders.

Add a File Upload

Add a File Upload

In the Choose Files section, drag and drop files from your local storage, or click Upload to select them manually.

Add a URL for Web Crawling

Add a URL for Web Crawling

The web crawler indexes only publicly accessible content, follows HTML links, supports certain image types, ignores videos and navigation links, and respects robots.txt rules.

In the Seed URL field, type the public URL you want to crawl.

Under the Crawling Rules section, define the crawl scope:

  • Scoped, crawls only the seed URL.
  • Path, crawls the seed URL and all pages within the same path.
  • Domain, crawls all pages in the same domain.
  • Subdomains, crawls the domain and all its subdomains.

To verify the crawl completed, re-add the same seed URL as a new data source. If it shows zero tokens, the original crawl indexed all content and you can delete the duplicate.

For smooth uploads, keep batches under 100 files, each no larger than 2 GB. For larger files or batches, use the DigitalOcean API.

After selecting your data source, click Add selected data source. If needed, you can add more files later.

Review Index Prices

In the How much will I pay? section, indexing costs depend on the embedding model you choose and the size of the data you’re embedding. You cannot change the embedding model for existing knowledge bases. To use a different model, create a new knowledge base.

The pricing table shows estimated token counts and indexing costs based on your dataset size and the selected model’s token rate. Each row shows the Dataset Size, estimated Token Count, and Indexing Cost. Larger datasets produce more tokens, increasing the cost. Pricing scales linearly with both dataset size and token rate, and you only pay for successfully indexed data. Final costs may vary. For more details on pricing, see our embedding model pricing page.

After reviewing your index price summary, click Index added source to begin indexing.

View Indexing Job

To track indexing progress, go to the Knowledge Bases tab, find your knowledge base, then check the last indexing time. Click the knowledge base to view detailed progress, including updates for each data source, tokens indexed, and any sources still processing. The list updates automatically, and agents begin using the updated embeddings as soon as they become available.

Provisioning typically takes five minutes or longer while the system processes, embeds, and stores your data. After indexing completes, go to the knowledge base’s Overview tab, then under the EMBEDDINGS DETAILS section, see a summary of the indexing results, including final costs.

If indexing takes longer than expected, click Stop job to cancel it, then Re-run job to restart it. If issues persist, contact support.

Add a Data Source Using the API

To add a data source using the API, provide the knowledge bases unique identifier and specify the Spaces bucket, folder, file, or URL to use. To retrieve knowledge base IDs, use the /v2/gen-ai/knowledge_bases endpoint.

After adding a data source, start indexing it using the API to make the content available for retrieval.

How to Add a Data Source Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources.

cURL

Using cURL:

curl -X DELETE \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/9a6e3975-b0c6-11ef-bf8f-4e013e2ddde4/data_sources/bd2a2db5-b8b0-11ef-bf8f-4e013e2ddde4"

To confirm the data source was added, list the knowledge base’s data sources.

Index a Data Source Using the Control Panel

If your data sources change, such as updated file contents or new folder contents, you may need to re-index to keep the knowledge base up to date. Re-indexing regenerates the vector embeddings, allowing your agent to retrieve the most current information.

You cannot currently re-index a previously crawled seed URL. To re-index the content, delete the seed URL, and then add it again to start a new crawl.

To re-index from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, select the knowledge base you want to update, click the knowledge base’s Data Sources tab, then click Update all sources to re-index all attached data sources. You can only update all data sources at once within a knowledge base.

Note
The Update all sources button is disabled if indexing is already in progress or if the knowledge base has no attached data sources.

To track indexing progress, go to the Knowledge Bases tab, find your knowledge base, then check the last indexing time. Click the knowledge base to view detailed progress, including updates for each data source, tokens indexed, and any sources still processing. The list updates automatically, and agents begin using the updated embeddings as soon as they become available.

Provisioning typically takes five minutes or longer while the system processes, embeds, and stores your data. After indexing completes, go to the knowledge base’s Overview tab, then under the EMBEDDINGS DETAILS section, see a summary of the indexing results, including final costs.

If indexing takes longer than expected, click Stop job to cancel it, then Re-run job to restart it. If issues persist, contact support.

Index a Data Source Using the API

To index a data source using the API, create an indexing job with the knowledge base ID and data source ID. Use the Create Indexing Job endpoint to start the process.

How to Start an Indexing Job Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a POST request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources/{data_source_uuid}/indexing_jobs.

Indexing typically takes five minutes or longer while the system processes, embeds, and stores your data. During this time, check the job status using the Get Indexing Job endpoint. Agents begin using the embedded data as soon as it’s available.

After indexing completes, use the Get Knowledge Base endpoint to confirm completion and review the final token count and indexing cost.

If the job takes longer than expected, cancel it using the Cancel Indexing Job endpoint, then restart it. If issues persist, contact support for assistance.

Remove a Data Source Using the Control Panel

You can remove a data source from a knowledge base if it’s no longer needed. Removing a data source triggers re-indexing to update the knowledge base with the remaining content.

To remove a data source from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, find and then select the knowledge base you need, then click the Data Sources tab.

In the Data Sources page, find the data source you want to remove, on the right of it, click the trash icon to open the Remove Data Source window, confirm removal by typing the data source name, then click Destroy to remove it.

After removal, the knowledge base automatically re-indexes the remaining data sources.

Remove a Data Source Using the API

To remove a data source using the API, provide the knowledge base ID and the specific data source ID. This detaches the data source from the knowledge base but does not delete the original source file or URL.

You can find data source IDs by listing the knowledge base’s data sources.

How to Remove a Data Source Using the DigitalOcean API
  1. Create a personal access token and save it for use with the API.
  2. Send a DELETE request to https://api.digitalocean.com/v2/gen-ai/knowledge_bases/{knowledge_base_uuid}/data_sources/{data_source_uuid}.

cURL

Using cURL:

curl -X DELETE \
  -H "Content-Type: application/json"  \
  -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
  "https://api.digitalocean.com/v2/gen-ai/knowledge_bases/9a6e3975-b0c6-11ef-bf8f-4e013e2ddde4/data_sources/bd2a2db5-b8b0-11ef-bf8f-4e013e2ddde4"

We can't find any results for your search.

Try using different keywords or simplifying your search terms.