How to Add, Re-index, and Remove Data Sourcespublic
Validated on 28 Apr 2025 • Last edited on 20 May 2025
DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.
A data source is the custom content you provide, such as files, folders, or URLs, that the knowledge base processes into vector embeddings and stores in an associated OpenSearch database. Each knowledge base requires at least one data source. You can either add a data source to an existing knowledge base or add a one during knowledge base creation.
We support the following data sources:
- Spaces Bucket or Folder, index files stored in a DigitalOcean Spaces bucket or specific folders.
- File Upload, upload and index files directly from your local machine.
- URL for Web Crawling, use a seed URL to crawl and index content from a public website.
Add a Data Source Using the Control Panel
You can add new data sources to an existing knowledge base at any time. Each source is automatically indexed, and the knowledge base updates as indexing completes. If you add multiple sources, the knowledge base starts using embeddings from each source as soon as its indexing finishes.
To add a data source from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, find and then select the knowledge base you want to update. On the right of the knowledge base, click …, select Manage data sources, then on the Data Sources page, click Add source to open the Add Data Source page.
Add a Data Source
In the Add Data Source page, under the Select data source to index section, click Select data source to open the selection window. Use the Data source dropdown menu to choose the type of data source to add.
You can add multiple types of data sources to a knowledge base and include as many as needed. To save processing time and cost, organize your files in dedicated Spaces buckets, specific folders, or local storage containing only relevant files.
Supported File Formats
We support a wide range of text-based file formats, including: .csv
, .eml
, .epub
, .xls
, .xlsx
, .html
, .md
, .odt
, .pdf
, .txt
, .rst
, .rtf
, .tsv
, .doc
, .docx
, .xml
, .json
, and .jsonl
.
.ppt
, .pptx
) are partially supported. We extract text but do not process images or other visual content. Image files (such as .png
, .jpeg
, .tiff
, and .bmp
) are not currently supported.
You can add any of the following data sources:
For smooth uploads, keep batches under 100 files, each no larger than 2 GB. For larger files or batches, use the DigitalOcean API.
After selecting your data source, click Add selected data source. If needed, you can add more files later.
Review Index Prices
In the How much will I pay? section, indexing costs depend on the embedding model you choose and the size of the data you’re embedding. You cannot change the embedding model for existing knowledge bases. To use a different model, create a new knowledge base.
The pricing table shows estimated token counts and indexing costs based on your dataset size and the selected model’s token rate. Each row shows the Dataset Size, estimated Token Count, and Indexing Cost. Larger datasets produce more tokens, increasing the cost. Pricing scales linearly with both dataset size and token rate, and you only pay for successfully indexed data. Final costs may vary. For more details on pricing, see our embedding model pricing page.
After reviewing your index price summary, click Index added source to begin indexing.
View Indexing Job
To track indexing progress, go to the Knowledge Bases tab, find your knowledge base, then check the last indexing time. Click the knowledge base to view detailed progress, including updates for each data source, tokens indexed, and any sources still processing. The list updates automatically, and agents begin using the updated embeddings as soon as they become available.
Provisioning typically takes five minutes or longer while the system processes, embeds, and stores your data. After indexing completes, go to the knowledge base’s Overview tab, then under the EMBEDDINGS DETAILS section, see a summary of the indexing results, including final costs.
If indexing takes longer than expected, click Stop job to cancel it, then Re-run job to restart it. If issues persist, contact support.
Add a Data Source Using the API
To add a data source using the API, provide the knowledge bases unique identifier and specify the Spaces bucket, folder, file, or URL to use. To retrieve knowledge base IDs, use the /v2/gen-ai/knowledge_bases
endpoint.
After adding a data source, start indexing it using the API to make the content available for retrieval.
To confirm the data source was added, list the knowledge base’s data sources.
Index a Data Source Using the Control Panel
If your data sources change, such as updated file contents or new folder contents, you may need to re-index to keep the knowledge base up to date. Re-indexing regenerates the vector embeddings, allowing your agent to retrieve the most current information.
You cannot currently re-index a previously crawled seed URL. To re-index the content, delete the seed URL, and then add it again to start a new crawl.
To re-index from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, select the knowledge base you want to update, click the knowledge base’s Data Sources tab, then click Update all sources to re-index all attached data sources. You can only update all data sources at once within a knowledge base.
To track indexing progress, go to the Knowledge Bases tab, find your knowledge base, then check the last indexing time. Click the knowledge base to view detailed progress, including updates for each data source, tokens indexed, and any sources still processing. The list updates automatically, and agents begin using the updated embeddings as soon as they become available.
Provisioning typically takes five minutes or longer while the system processes, embeds, and stores your data. After indexing completes, go to the knowledge base’s Overview tab, then under the EMBEDDINGS DETAILS section, see a summary of the indexing results, including final costs.
If indexing takes longer than expected, click Stop job to cancel it, then Re-run job to restart it. If issues persist, contact support.
Index a Data Source Using the API
To index a data source using the API, create an indexing job with the knowledge base ID and data source ID. Use the Create Indexing Job endpoint to start the process.
Indexing typically takes five minutes or longer while the system processes, embeds, and stores your data. During this time, check the job status using the Get Indexing Job endpoint. Agents begin using the embedded data as soon as it’s available.
After indexing completes, use the Get Knowledge Base endpoint to confirm completion and review the final token count and indexing cost.
If the job takes longer than expected, cancel it using the Cancel Indexing Job endpoint, then restart it. If issues persist, contact support for assistance.
Remove a Data Source Using the Control Panel
You can remove a data source from a knowledge base if it’s no longer needed. Removing a data source triggers re-indexing to update the knowledge base with the remaining content.
To remove a data source from the DigitalOcean Control Panel, in the left-hand menu, click GenAI Platform, click the Knowledge Bases tab, find and then select the knowledge base you need, then click the Data Sources tab.
In the Data Sources page, find the data source you want to remove, on the right of it, click the trash icon to open the Remove Data Source window, confirm removal by typing the data source name, then click Destroy to remove it.
After removal, the knowledge base automatically re-indexes the remaining data sources.
Remove a Data Source Using the API
To remove a data source using the API, provide the knowledge base ID and the specific data source ID. This detaches the data source from the knowledge base but does not delete the original source file or URL.
You can find data source IDs by listing the knowledge base’s data sources.