Agent Evaluations
Generated on 26 Jun 2026
This content is automatically generated from https://github.com/digitalocean-labs/mcp-digitalocean/blob/main/pkg/registry/genai/README.md.
GenAI Tools
This package provides MCP tools for DigitalOcean’s GenAI platform.
Overview
The package contains two sets of evaluation tools:
Agent Evaluation (genai-evaluation) — evaluate deployed agents end-to-end:
- List available evaluation metrics
- Manage evaluation datasets (upload CSV files)
- Create and update evaluation test cases
- Run evaluations against agent deployments
- Monitor evaluation run status
Model Evaluation (under genai-evaluation) — evaluate raw models directly:
- List available model evaluation metrics
- Upload evaluation datasets
- Create and run model evaluation runs
- Download evaluation results
- Monitor model evaluation run status
Tools
Atomic Tools (One API Call Each)
genai-list-evaluation-metrics
Lists all available evaluation metrics that can be used in test cases.
Arguments: None
Returns: JSON object with array of metrics and metadata
{
"metrics": [
{
"metric_uuid": "...",
"metric_name": "correctness",
"metric_type": "...",
"category": "METRIC_CATEGORY_CORRECTNESS",
...
}
],
"count": 5
}genai-list-evaluation-test-cases
Lists evaluation test cases for a specific workspace.
Arguments:
workspace_uuid(string, optional): Workspace UUIDagent_workspace_name(string, optional): Workspace name
At least one of workspace_uuid or agent_workspace_name must be provided.
Returns: JSON object with array of test cases
{
"test_cases": [
{
"test_case_uuid": "...",
"name": "my_test",
"description": "...",
...
}
],
"count": 2
}genai-create-evaluation-dataset
Creates an evaluation dataset by uploading a CSV file. The file is validated to ensure:
- File extension is
.csv - Contains a
querycolumn - All query column values are valid JSON objects
Arguments:
name(string, required): Name for the datasetfile_path(string, required): Path to the CSV file to upload
Returns: JSON object with dataset UUID and metadata
{
"dataset_uuid": "...",
"name": "my_dataset",
"file_size": 1024
}genai-create-evaluation-test-case
Creates a new evaluation test case.
Arguments:
name(string, required): Name of the test casedescription(string, optional): Descriptiondataset_uuid(string, required): Dataset UUID to usemetrics(array of strings, optional): Metric UUIDs to includeworkspace_uuid(string, optional): Workspace UUIDagent_workspace_name(string, optional): Workspace name
At least one of workspace_uuid or agent_workspace_name must be provided.
Returns: JSON object with test case UUID
{
"test_case_uuid": "...",
"name": "my_test",
"dataset_uuid": "..."
}genai-update-evaluation-test-case
Updates an existing evaluation test case.
Arguments:
test_case_uuid(string, required): Test case UUID to updatename(string, optional): New namedescription(string, optional): New descriptiondataset_uuid(string, optional): New dataset UUIDmetrics(array of strings, optional): New metric UUIDs
Returns: JSON object with test case UUID and new version
{
"test_case_uuid": "...",
"version": 2
}genai-run-evaluation-test-case
Runs an evaluation test case against specified agent deployments.
Arguments:
test_case_uuid(string, required): Test case UUID to runagent_deployment_names(array of strings, required): Deployment names to evaluaterun_name(string, required): Name for this evaluation run
Returns: JSON object with evaluation run UUIDs
{
"evaluation_run_uuids": ["uuid1", "uuid2"],
"count": 2
}genai-get-evaluation-run
Gets the status and results of an evaluation run.
Arguments:
evaluation_run_uuid(string, required): Evaluation run UUID
Returns: JSON object with full evaluation run details
{
"evaluation_run": {
"evaluation_run_uuid": "...",
"status": "EVALUATION_RUN_SUCCESSFUL",
"run_level_metric_results": [
{
"metric_name": "correctness",
"number_value": 0.95,
"reasoning": "..."
}
],
...
}
}High-Level Orchestrated Tool
genai-run-evaluation-workflow
Runs a complete end-to-end evaluation workflow. This tool orchestrates all the steps:
- Validates the dataset CSV
- Lists available metrics and filters by category
- Uploads the dataset
- Creates or updates the test case
- Runs the evaluation
- Polls for results until completion (with configurable timeout)
This tool is ideal for users unfamiliar with the multi-step evaluation process, as it handles all orchestration internally.
Arguments:
dataset_file_path(string, required): Path to CSV evaluation datasetworkspace_name(string, required): Agent workspace nametest_case_name(string, required): Name for the test caseagent_deployment_names(array of strings, required): Deployment names to evaluaterun_name(string, required): Name for the evaluation rundescription(string, optional): Test case descriptionmetric_categories(array of strings, optional): Filter by metric categories (e.g.,"METRIC_CATEGORY_CORRECTNESS","METRIC_CATEGORY_SAFETY_AND_SECURITY"). If empty, all metrics are used.timeout_seconds(number, optional): Timeout for polling results (default: 300 seconds)poll_interval_seconds(number, optional): Interval between status polls (default: 5 seconds)
Returns: JSON object with complete workflow results
{
"dataset_uuid": "...",
"test_case_uuid": "...",
"evaluation_run_uuid": "...",
"status": "EVALUATION_RUN_SUCCESSFUL",
"metric_results": [
{
"metric_name": "correctness",
"number_value": 0.95,
"reasoning": "..."
}
],
"duration_seconds": 45.3,
"error_message": null
}Dataset CSV Format
Evaluation datasets must be CSV files with:
- A
querycolumn containing JSON objects as strings - Additional columns for ground truth, expected outputs, etc. (optional)
Example:
query,expected_output
"{\"question\": \"What is 2+2?\"}",4
"{\"question\": \"What is the capital of France?\"}","Paris"Workflow Example
Using Atomic Tools (Step-by-Step)
1. List metrics to see available options:
genai-list-evaluation-metrics
2. Upload your dataset:
genai-create-evaluation-dataset
name: "my_dataset"
file_path: "/path/to/queries.csv"
3. Create a test case:
genai-create-evaluation-test-case
name: "test_my_agent"
description: "Testing agent correctness"
dataset_uuid: "<uuid from step 2>"
metrics: ["<metric_uuid_1>", "<metric_uuid_2>"]
agent_workspace_name: "my_workspace"
4. Run the evaluation:
genai-run-evaluation-test-case
test_case_uuid: "<uuid from step 3>"
agent_deployment_names: ["my_agent_deployment"]
run_name: "run_1"
5. Poll for results:
genai-get-evaluation-run
evaluation_run_uuid: "<uuid from step 4>"
Using the Orchestrated Workflow Tool (All-in-One)
genai-run-evaluation-workflow
dataset_file_path: "/path/to/queries.csv"
workspace_name: "my_workspace"
test_case_name: "test_my_agent"
agent_deployment_names: ["my_agent_deployment"]
run_name: "run_1"
description: "Testing agent correctness"
metric_categories: ["METRIC_CATEGORY_CORRECTNESS"]
timeout_seconds: 300
poll_interval_seconds: 5
CSV Validation
The CSV dataset is validated to ensure:
- File has
.csvextension - File contains a
querycolumn - All
querycolumn values are valid JSON - File is readable and not empty
If validation fails, a detailed error message is returned describing the issue.
Error Handling
All tools return structured error messages. Errors from API calls are wrapped with context about which step failed:
{
"error": "failed to create evaluation dataset: service error"
}For workflow tool, errors include the step number:
"step 4: failed to create presigned URL: ..."
"step 7: evaluation polling timed out"
Metric Categories
Available metric categories (when filtering in workflow tool):
METRIC_CATEGORY_CORRECTNESS: Correctness and accuracy metricsMETRIC_CATEGORY_USER_OUTCOMES: User satisfaction and engagement metricsMETRIC_CATEGORY_SAFETY_AND_SECURITY: Safety and security related metricsMETRIC_CATEGORY_CONTEXT_QUALITY: Context and retrieval quality metricsMETRIC_CATEGORY_MODEL_FIT: Model fit and performance metrics
Evaluation Run Status Values
EVALUATION_RUN_QUEUED: Run is waiting to startEVALUATION_RUN_RUNNING: Run is currently executingEVALUATION_RUN_RUNNING_DATASET: Processing datasetEVALUATION_RUN_EVALUATING_RESULTS: Evaluating metric resultsEVALUATION_RUN_SUCCESSFUL: Run completed successfullyEVALUATION_RUN_PARTIALLY_SUCCESSFUL: Some metrics were evaluated, others failedEVALUATION_RUN_FAILED: Run failed completelyEVALUATION_RUN_CANCELLED: Run was cancelled
Terminal statuses: SUCCESSFUL, FAILED, CANCELLED, PARTIALLY_SUCCESSFUL
Model Evaluation Tools
These tools evaluate raw models directly (not full agent deployments). They use the /v2/genai/model_evaluation* API endpoints.
Key Concepts
- Candidate Model: The model being evaluated.
- Judge Model: An LLM that scores the candidate model’s responses.
Tools
Atomic Tools
genai-model-eval-list-metrics
List all available model evaluation metrics.
Arguments: None
Returns: JSON object with array of metrics and count
{
"metrics": [
{
"metric_uuid": "...",
"metric_name": "correctness",
"category": "METRIC_CATEGORY_CORRECTNESS"
}
],
"count": 5
}genai-model-eval-list-datasets
List previously uploaded evaluation datasets so you can reuse an existing dataset’s UUID in genai-model-eval-create-run (instead of uploading a new one). Defaults to model-evaluation datasets.
Arguments:
dataset_type(string, optional): Filter by dataset type. Defaults toEVALUATION_DATASET_TYPE_MODEL. Other values:EVALUATION_DATASET_TYPE_UNKNOWN,EVALUATION_DATASET_TYPE_ADK,EVALUATION_DATASET_TYPE_NON_ADK.
Returns: JSON object with an array of datasets and a count. Each dataset includes dataset_uuid, dataset_name, created_at, row_count, file_size, and has_ground_truth.
{
"datasets": [
{
"dataset_uuid": "...",
"dataset_name": "queries.csv",
"row_count": 20,
"has_ground_truth": true,
"created_at": "2025-01-01T00:00:00Z"
}
],
"count": 1
}Backed by
GET /v2/gen-ai/evaluation_datasets. Use the returneddataset_uuiddirectly as thedataset_uuidargument ofgenai-model-eval-create-run.
genai-model-eval-create-dataset
Upload and register a model evaluation dataset (presign → Spaces upload → database record).
Arguments:
name(string, required): Name for the datasetfile_path(string, required): Path to a.csvor.jsonlfile to upload. CSV must include aninputcolumn; JSONL must be one JSON object per line with aninputfield.ground_truthis optional in both formats.
Returns: JSON object with the registered dataset UUID and upload metadata
{
"evaluation_dataset_uuid": "...",
"dataset_uuid": "...",
"object_key": "...",
"name": "my_dataset",
"file_name": "queries.csv",
"file_size": 1024
}genai-model-eval-create-run
Create a model evaluation run.
User confirmation (chat, two steps): (1) Call without user_message — returns a preview with prompt_for_user. Post that to the end user and wait for their chat reply. (2) Call again with user_message set to their verbatim reply (typically yes) and the same arguments. The run is not created until step 2.
Arguments:
name(string, required): Name for the evaluation runcandidate_model_name(string, required): Exact candidate model name (partial names return match list)candidate_model_uuid(string, optional): Exact full candidate UUID (optional when name is exact)eval_preset_uuid(string, optional): Preset UUID (dataset/judge/metrics from preset; judge name not required)dataset_uuid(string, required without preset): Dataset UUID. Get it fromgenai-model-eval-create-dataset(returnsevaluation_dataset_uuid) or, for an already-uploaded dataset, fromgenai-model-eval-list-datasets.judge_model_name(string, required without preset): Exact judge model namejudge_model_uuid(string, optional): Exact full judge UUIDmetric_uuids(array of strings, optional): Metric UUIDs to evaluatestar_metric(object, optional): Primary success metriccandidate_inference_config(object, optional): Inference params (max_tokens, temperature, top_p)user_message(string, optional): End user’s verbatim chat reply after preview (second call; typicallyyes)
Returns: JSON object with the evaluation run UUID
{
"eval_run_uuid": "...",
"name": "my_eval_run"
}genai-model-eval-list-runs
List model evaluation runs with optional filters.
Arguments:
status(string, optional): Filter by status (e.g., MODEL_EVALUATION_RUN_SUCCESSFUL, FAILED, QUEUED)page(number, optional): Page numberper_page(number, optional): Results per page
Returns: JSON object with array of run summaries
genai-model-eval-get-run
Get a single model evaluation run with per-prompt results.
Arguments:
eval_run_uuid(string, required): UUID of the evaluation runpage(number, optional): Page for per-prompt resultsper_page(number, optional): Per-prompt results per page
Returns: JSON object with full run detail and per-prompt results
genai-model-eval-get-results-download-url
Get a presigned download URL for the full results of an evaluation run.
Arguments:
eval_run_uuid(string, required): UUID of the evaluation run
Returns: JSON object with download URL and expiry
{
"download_url": "https://...",
"expires_at": "2025-01-01T00:00:00Z"
}genai-model-eval-delete-run
Delete a model evaluation run by UUID. Deletion is permanent: the run record and its results cannot be recovered.
User consent: confirm_deletion must be true. Present eval_run_uuid and that deletion is permanent, then ask for yes/no in chat. Only set confirm_deletion: true after the user explicitly agrees.
Arguments:
eval_run_uuid(string, required): UUID of the run to deleteconfirm_deletion(boolean, required): Must be true; only after the user has agreed in chat
Returns: JSON status from the delete API.
genai-model-eval-cancel-run
Cancel an in-progress model evaluation run by UUID. The run transitions to MODEL_EVALUATION_RUN_CANCELLING and then MODEL_EVALUATION_RUN_CANCELLED. Any partial results may be lost.
User consent: confirm_cancel must be true. Present eval_run_uuid and that partial results may be lost, then ask for yes/no in chat. Only set confirm_cancel: true after the user explicitly agrees.
Arguments:
eval_run_uuid(string, required): UUID of the run to cancelconfirm_cancel(boolean, required): Must be true; only after the user has agreed in chat
Returns: JSON object with the cancelled run summary.
genai-model-eval-delete-preset
Delete a saved model evaluation preset by UUID. Existing runs that referenced the preset are not affected.
User consent: confirm_deletion must be true. Present eval_preset_uuid and that deletion is permanent, then ask for yes/no in chat. Only set confirm_deletion: true after the user explicitly agrees.
Arguments:
eval_preset_uuid(string, required): UUID of the preset to deleteconfirm_deletion(boolean, required): Must be true; only after the user has agreed in chat
Returns: JSON object confirming the deletion.
genai-model-eval-delete-dataset
Delete an evaluation dataset by UUID. Works for both model and agent evaluation datasets. Deletion is permanent: the dataset record cannot be recovered.
User consent: confirm_deletion must be true. Present dataset_uuid and that deletion is permanent, then ask for yes/no in chat. Only set confirm_deletion: true after the user explicitly agrees.
Arguments:
dataset_uuid(string, required): UUID of the dataset to delete. Get it fromgenai-model-eval-list-datasets(dataset_uuid) orgenai-model-eval-create-dataset(evaluation_dataset_uuid).confirm_deletion(boolean, required): Must be true; only after the user has agreed in chat
Returns: JSON object confirming the deletion.
Orchestrated Workflow Tool
genai-model-eval-run-workflow
Run a complete model evaluation workflow: upload dataset, create run, and poll for results.
User consent: Same two-step chat confirmation as genai-model-eval-create-run.
Arguments:
dataset_file_path(string, required): Path to the.csvor.jsonlevaluation datasetname(string, required): Name for the evaluation runcandidate_model_name(string, required): Exact candidate model namecandidate_model_uuid(string, optional): Exact full candidate UUIDjudge_model_name(string, required): Exact judge model namejudge_model_uuid(string, optional): Exact full judge UUIDmetric_uuids(array of strings, optional): Metric UUIDs (if empty, all available metrics are used)candidate_inference_config(object, optional): Inference paramstimeout_seconds(number, optional): Polling timeout (default: 300)poll_interval_seconds(number, optional): Poll interval (default: 5)user_message(string, optional): End user’s verbatim chat reply after preview (second call; typicallyyes)
Returns: JSON object with complete workflow results
{
"eval_run_uuid": "...",
"status": "SUCCESSFUL",
"metric_results": [
{
"metric_name": "correctness",
"number_value": 0.92
}
],
"duration_seconds": 45.3,
"error_message": ""
}Model Evaluation Dataset Format
Model evaluation datasets accept CSV or JSONL:
CSV — header row with input column (and optional ground_truth):
input,ground_truth
What is 2+2?,4
What is the capital of France?,ParisJSONL — one JSON object per line with an input field (and optional ground_truth):
{"input":"What is 2+2?","ground_truth":"4"}
{"input":"What is the capital of France?","ground_truth":"Paris"}
Model Evaluation Workflow Examples
Using Atomic Tools (Step-by-Step)
1. Upload dataset:
genai-model-eval-create-dataset
name: "my_dataset"
file_path: "/path/to/queries.csv"
2. List metrics:
genai-model-eval-list-metrics
3. Create run (first call — preview; post prompt_for_user and wait for user to type yes):
genai-model-eval-create-run
name: "eval_run_1"
candidate_model_name: "Llama 3.3 70B"
judge_model_name: "GPT-4o"
dataset_uuid: "<evaluation_dataset_uuid from step 1>"
metric_uuids: ["<metric-uuid-1>", "<metric-uuid-2>"]
4. Create run (second call — after user types yes):
genai-model-eval-create-run
(same arguments as step 3)
user_message: "yes"
5. Poll for results:
genai-model-eval-get-run
eval_run_uuid: "<uuid from step 3>"
Using the Orchestrated Workflow (All-in-One)
# First call — preview; post prompt_for_user and wait for yes in chat
genai-model-eval-run-workflow
dataset_file_path: "/path/to/queries.csv"
name: "eval_llama_v1"
candidate_model_name: "Llama 3.3 70B"
judge_model_name: "GPT-4o"
# Second call — same args plus user_message (verbatim reply from end user)
genai-model-eval-run-workflow
dataset_file_path: "/path/to/queries.csv"
name: "eval_llama_v1"
candidate_model_name: "Llama 3.3 70B"
judge_model_name: "GPT-4o"
user_message: "yes"
timeout_seconds: 300
poll_interval_seconds: 5
Model Evaluation Run Status Values
QUEUED: Run is waiting to startRUNNING_DATASET: Processing dataset queries through the candidate modelEVALUATING_RESULTS: Judge model is scoring the responsesCANCELLING: Run cancellation in progressCANCELLED: Run was cancelledSUCCESSFUL: Run completed successfullyPARTIALLY_SUCCESSFUL: Some prompts were evaluated, others failedFAILED: Run failed completely
Terminal statuses: SUCCESSFUL, FAILED, CANCELLED, PARTIALLY_SUCCESSFUL