Agent Evaluations
Generated on 28 Apr 2026
This content is automatically generated from https://github.com/digitalocean-labs/mcp-digitalocean/blob/main/pkg/registry/genai/README.md.
GenAI Evaluation Tools
This package provides MCP tools for managing and running evaluation workflows in DigitalOcean’s GenAI platform.
Overview
The evaluation tools enable users to:
- List available evaluation metrics
- Manage evaluation datasets (upload CSV files)
- Create and update evaluation test cases
- Run evaluations against agent deployments
- Monitor evaluation run status
Tools
Atomic Tools (One API Call Each)
genai-list-evaluation-metrics
Lists all available evaluation metrics that can be used in test cases.
Arguments: None
Returns: JSON object with array of metrics and metadata
{
"metrics": [
{
"metric_uuid": "...",
"metric_name": "correctness",
"metric_type": "...",
"category": "METRIC_CATEGORY_CORRECTNESS",
...
}
],
"count": 5
}genai-list-evaluation-test-cases
Lists evaluation test cases for a specific workspace.
Arguments:
workspace_uuid(string, optional): Workspace UUIDagent_workspace_name(string, optional): Workspace name
At least one of workspace_uuid or agent_workspace_name must be provided.
Returns: JSON object with array of test cases
{
"test_cases": [
{
"test_case_uuid": "...",
"name": "my_test",
"description": "...",
...
}
],
"count": 2
}genai-create-evaluation-dataset
Creates an evaluation dataset by uploading a CSV file. The file is validated to ensure:
- File extension is
.csv - Contains a
querycolumn - All query column values are valid JSON objects
Arguments:
name(string, required): Name for the datasetfile_path(string, required): Path to the CSV file to upload
Returns: JSON object with dataset UUID and metadata
{
"dataset_uuid": "...",
"name": "my_dataset",
"file_size": 1024
}genai-create-evaluation-test-case
Creates a new evaluation test case.
Arguments:
name(string, required): Name of the test casedescription(string, optional): Descriptiondataset_uuid(string, required): Dataset UUID to usemetrics(array of strings, optional): Metric UUIDs to includeworkspace_uuid(string, optional): Workspace UUIDagent_workspace_name(string, optional): Workspace name
At least one of workspace_uuid or agent_workspace_name must be provided.
Returns: JSON object with test case UUID
{
"test_case_uuid": "...",
"name": "my_test",
"dataset_uuid": "..."
}genai-update-evaluation-test-case
Updates an existing evaluation test case.
Arguments:
test_case_uuid(string, required): Test case UUID to updatename(string, optional): New namedescription(string, optional): New descriptiondataset_uuid(string, optional): New dataset UUIDmetrics(array of strings, optional): New metric UUIDs
Returns: JSON object with test case UUID and new version
{
"test_case_uuid": "...",
"version": 2
}genai-run-evaluation-test-case
Runs an evaluation test case against specified agent deployments.
Arguments:
test_case_uuid(string, required): Test case UUID to runagent_deployment_names(array of strings, required): Deployment names to evaluaterun_name(string, required): Name for this evaluation run
Returns: JSON object with evaluation run UUIDs
{
"evaluation_run_uuids": ["uuid1", "uuid2"],
"count": 2
}genai-get-evaluation-run
Gets the status and results of an evaluation run.
Arguments:
evaluation_run_uuid(string, required): Evaluation run UUID
Returns: JSON object with full evaluation run details
{
"evaluation_run": {
"evaluation_run_uuid": "...",
"status": "EVALUATION_RUN_SUCCESSFUL",
"run_level_metric_results": [
{
"metric_name": "correctness",
"number_value": 0.95,
"reasoning": "..."
}
],
...
}
}High-Level Orchestrated Tool
genai-run-evaluation-workflow
Runs a complete end-to-end evaluation workflow. This tool orchestrates all the steps:
- Validates the dataset CSV
- Lists available metrics and filters by category
- Uploads the dataset
- Creates or updates the test case
- Runs the evaluation
- Polls for results until completion (with configurable timeout)
This tool is ideal for users unfamiliar with the multi-step evaluation process, as it handles all orchestration internally.
Arguments:
dataset_file_path(string, required): Path to CSV evaluation datasetworkspace_name(string, required): Agent workspace nametest_case_name(string, required): Name for the test caseagent_deployment_names(array of strings, required): Deployment names to evaluaterun_name(string, required): Name for the evaluation rundescription(string, optional): Test case descriptionmetric_categories(array of strings, optional): Filter by metric categories (e.g.,"METRIC_CATEGORY_CORRECTNESS","METRIC_CATEGORY_SAFETY_AND_SECURITY"). If empty, all metrics are used.timeout_seconds(number, optional): Timeout for polling results (default: 300 seconds)poll_interval_seconds(number, optional): Interval between status polls (default: 5 seconds)
Returns: JSON object with complete workflow results
{
"dataset_uuid": "...",
"test_case_uuid": "...",
"evaluation_run_uuid": "...",
"status": "EVALUATION_RUN_SUCCESSFUL",
"metric_results": [
{
"metric_name": "correctness",
"number_value": 0.95,
"reasoning": "..."
}
],
"duration_seconds": 45.3,
"error_message": null
}Dataset CSV Format
Evaluation datasets must be CSV files with:
- A
querycolumn containing JSON objects as strings - Additional columns for ground truth, expected outputs, etc. (optional)
Example:
query,expected_output
"{\"question\": \"What is 2+2?\"}",4
"{\"question\": \"What is the capital of France?\"}","Paris"Workflow Example
Using Atomic Tools (Step-by-Step)
1. List metrics to see available options:
genai-list-evaluation-metrics
2. Upload your dataset:
genai-create-evaluation-dataset
name: "my_dataset"
file_path: "/path/to/queries.csv"
3. Create a test case:
genai-create-evaluation-test-case
name: "test_my_agent"
description: "Testing agent correctness"
dataset_uuid: "<uuid from step 2>"
metrics: ["<metric_uuid_1>", "<metric_uuid_2>"]
agent_workspace_name: "my_workspace"
4. Run the evaluation:
genai-run-evaluation-test-case
test_case_uuid: "<uuid from step 3>"
agent_deployment_names: ["my_agent_deployment"]
run_name: "run_1"
5. Poll for results:
genai-get-evaluation-run
evaluation_run_uuid: "<uuid from step 4>"
Using the Orchestrated Workflow Tool (All-in-One)
genai-run-evaluation-workflow
dataset_file_path: "/path/to/queries.csv"
workspace_name: "my_workspace"
test_case_name: "test_my_agent"
agent_deployment_names: ["my_agent_deployment"]
run_name: "run_1"
description: "Testing agent correctness"
metric_categories: ["METRIC_CATEGORY_CORRECTNESS"]
timeout_seconds: 300
poll_interval_seconds: 5
CSV Validation
The CSV dataset is validated to ensure:
- File has
.csvextension - File contains a
querycolumn - All
querycolumn values are valid JSON - File is readable and not empty
If validation fails, a detailed error message is returned describing the issue.
Error Handling
All tools return structured error messages. Errors from API calls are wrapped with context about which step failed:
{
"error": "failed to create evaluation dataset: service error"
}For workflow tool, errors include the step number:
"step 4: failed to create presigned URL: ..."
"step 7: evaluation polling timed out"
Metric Categories
Available metric categories (when filtering in workflow tool):
METRIC_CATEGORY_CORRECTNESS: Correctness and accuracy metricsMETRIC_CATEGORY_USER_OUTCOMES: User satisfaction and engagement metricsMETRIC_CATEGORY_SAFETY_AND_SECURITY: Safety and security related metricsMETRIC_CATEGORY_CONTEXT_QUALITY: Context and retrieval quality metricsMETRIC_CATEGORY_MODEL_FIT: Model fit and performance metrics
Evaluation Run Status Values
EVALUATION_RUN_QUEUED: Run is waiting to startEVALUATION_RUN_RUNNING: Run is currently executingEVALUATION_RUN_RUNNING_DATASET: Processing datasetEVALUATION_RUN_EVALUATING_RESULTS: Evaluating metric resultsEVALUATION_RUN_SUCCESSFUL: Run completed successfullyEVALUATION_RUN_PARTIALLY_SUCCESSFUL: Some metrics were evaluated, others failedEVALUATION_RUN_FAILED: Run failed completelyEVALUATION_RUN_CANCELLED: Run was cancelled
Terminal statuses: SUCCESSFUL, FAILED, CANCELLED, PARTIALLY_SUCCESSFUL