Reference for Agent Evaluation Metricspublic
Validated on 1 Jul 2025 • Last edited on 1 Jul 2025
The DigitalOcean GenAI Platform lets you work with popular foundation models and build GPU-powered AI agents with fully-managed deployment, or send direct requests using serverless inference. Create agents that incorporate guardrails, functions, agent routing, and retrieval-augmented generation (RAG) pipelines with knowledge bases.
General Agent Quality
General agent quality metrics measure the overall quality of an agent’s responses, including correctness, instruction following, and safety. These metrics help ensure that agents provide accurate, relevant, and safe responses to user queries.
Metric | Description | Returns | Recommendations |
---|---|---|---|
Correctness (General Hallucinations) | Measures how factually accurate the agent’s response is without using context. High score = likely accurate; low score = possible hallucinations or errors. | Number (0-100); high = likely accurate | Flag low scores, adjust prompt, prevent non-factual answers |
Instruction Following | Measures how well the agent follows instructions. High = followed closely; low = ignored parts. | Number (0-100); Boolean (Yes/No) | Flag ignored instructions, reword or vary instructions, add safeguards |
Ground Truth Faithfulness (includes BLEU and ROUGE-1) |
Compares response to known correct output. High = semantically equivalent; low = different meaning. Measures n-gram overlap with ground truth (BLEU = multi-word, ROUGE-1 = single-word). | % Yes judgments | Use with other metrics for full picture |
Prompt Perplexity | Measures input prompt complexity and model confidence. Lower perplexity = better performance. | Number (0-100) | Lower perplexity, revise complex prompts |
PII Leaks | Detects if input/output contains personally identifiable info (PII). | Boolean (Yes/No) | No recommendations |
Toxicity | Flags hateful, offensive, or harmful content. | Boolean (Yes/No) | Apply guardrails, retrain agents, change models |
Tone | Identifies emotional tone (Neutral, Joy, Love, Fear, etc.). | String | Align tone via agent instructions |
Sexism | Flags sexist content; identifies harmful gender-based language. | Boolean (Yes/No) | Apply guardrails, retrain agents |
Prompt Injection | Flags input designed to manipulate agent behavior. | Boolean (Yes/No) | Apply guardrails |
User Goal Progress (Action Advancement) | Measures if the agent advanced the user’s task or question (partial/full answer, clarification, confirm action). | Number (0 to 100); 100 = advanced or accomplished at least one goal | No Recommendations |
User Goal Completion (Action Completion) | Measures if the agent fully accomplished the user’s goal; must be accurate, comprehensive, aligned with tool outputs. | Number (0 to 100) | No Recommendations |
RAG and Tool Correctness
RAG and Tool Correctness metrics evaluate how well agents use Retrieval-Augmented Generation (RAG) pipelines and external tools to provide accurate, relevant, and contextually grounded responses. These metrics help ensure that agents effectively leverage external knowledge and tools to enhance their responses.
Metric | Description | Returns | Recommendations |
---|---|---|---|
Context Adherence (Context Hallucinations) | Measures whether the agent stays within the retrieved context when generating a response. High = relies only on provided facts; low = introduces unsupported information. | Number (0 to 1); score close to 1 means fully adherent; close to 0 means hallucinations likely. | No Recommendations |
Response-Context Completeness (Completeness) | Measures how thoroughly the agent covers key details from the provided context. | Number (0 to 1) | Rewrite the prompt to explicitly ask for full inclusion of relevant info Adjust prompt to encourage thorough coverage of key details |
Retrieved Chunk Usage (Chunk Attribution) | Shows whether each chunk influenced the response. | Boolean (Yes/No) per chunk; also Number (N of K) for count of attributed chunks | Reduce number of retrieved chunks if many unused Tune retrieval to balance recall and performance Use attribution for debugging |
Retrieved Context Relevance (Context Relevance) | Measures how relevant the retrieved context is to the input prompt; checks if the context supports the query. | Number (0 or 100); high = significant similarity or relevance. | No Recommendations |