DigitalOcean GenAI Platform lets you build GPU-powered AI agents with fully-managed deployment. Agents can use pre-built or custom foundation models, incorporate function and agent routes, and implement RAG pipelines with knowledge bases.
Adding a data source to a knowledge base requires the unique identifier of the knowledge base and the Spaces bucket, folder, or file to use as the data source.
You can list all knowledge bases with their unique identifiers using the /v2/gen-ai/knowledge_bases
endpoint.
To add a data source to a knowledge base after creation, go to your knowledge base in the control panel, click on the Data sources tab, and then click the Add source button to open the Knowledge bases page.
Click Select data source to open the Select data source window. From the Data source dropdown list, select one of the following options:
Spaces bucket or folder: Select one or more Spaces buckets or folders in a bucket where your data is stored. If you do not have Spaces buckets for your data, see How to Create a Spaces Bucket and How to Migrate Spaces with Flexify.IO.
Web crawling: Add a static or dynamic seed URL to extract data with the GenAI crawler. The URL must use HTTPS and be publicly accessible. The crawler indexes up to 5500 links within the defined scope. It follows robots.txt
, respects disallow directives, and skips inaccessible links.
.svg
, .jpeg
, and .png
images if specified. However, including images and SVGs may increase the indexing token count. The crawler ignores videos and avoids scraping links in footers, headers, and navigation elements. Downloadable files are processed only if they fall within the defined crawling scope; otherwise, they are ignored.Scope | Seed URL Example | Crawls |
---|---|---|
Scoped (Most Narrow) Crawls only the seed URL and ignores all links to external pages. |
https://www.example.com/products/ai-ml/ |
Only this page. |
URL and all linked pages in path (Narrow) Crawls the seed URL and all pages within the same URL path, ignoring pages outside this path. |
https://www.example.com/docs/ |
Includes:https://www.example.com/docs/tutorials/ Excludes: https://www.example.com/products/ |
URL and all linked pages in domain (Broad) Crawls all pages within the same domain as the seed URL but does not include subdomains. |
https://www.example.com/docs/ |
Includes:https://www.example.com/products/ Excludes: https://docs.example.com/ |
Subdomains (Most Broad) Crawls all pages within the domain and its subdomains, including docs.example.com and marketplace.example.com . |
https://www.example.com/docs/ |
Includes:https://community.example.com/ |
If you add a seed URL for web crawling, you can check if it’s fully indexed by adding it again and starting a new crawl. If it returns zero tokens, the initial crawl indexed all content.
File: Drag and drop data files from your local storage or click Upload to select the files to add in the file browser.
Next, click Add selected data source to add the data source.
Adding data sources automatically indexes them and updates the knowledge base. For multiple data sources, the embedding updates in the knowledge base as soon as indexing finishes for a data source while the other data sources are getting indexed.
If the data in your data source has changed, you can re-index the data manually to update the vector embeddings.
We index the data you store in your knowledge base. GenAI Platform supports the .txt
, .html
, .md
, .pdf
, .doc
, .json
, and .csv
formats.
Spaces buckets seamlessly organize files for your knowledge base (KB) using S3-compatible systems. You need an initial dataset from a Spaces bucket to create a KB, and the system indexes all supported file formats in the bucket, regardless of their privacy settings.
To upload files from local storage for your data source, due to browser limitations, we recommend uploading files smaller than 2GB and batches of less than 100 files using the control panel. For large files and batches of files, use the DigitalOcean API.
To re-index all your data sources, click the Update all sources button.
You can view the indexing progress and the job details in the banner on the top of knowledge base page.
For web crawling a data source, you cannot currently re-index a previously crawled seed URL. To re-index the content, delete the seed URL as a data source and re-add it to start a new crawl.
To remove a data source, click the ••• menu to its right and select Remove source.