Reference for Agent Evaluation Metricspublic

Validated on 1 Jul 2025 • Last edited on 1 Jul 2025

The DigitalOcean GenAI Platform lets you work with popular foundation models and build GPU-powered AI agents with fully-managed deployment, or send direct requests using serverless inference. Create agents that incorporate guardrails, functions, agent routing, and retrieval-augmented generation (RAG) pipelines with knowledge bases.

General Agent Quality

General agent quality metrics measure the overall quality of an agent’s responses, including correctness, instruction following, and safety. These metrics help ensure that agents provide accurate, relevant, and safe responses to user queries.

Metric Description Returns Recommendations
Correctness (General Hallucinations) Measures how factually accurate the agent’s response is without using context. High score = likely accurate; low score = possible hallucinations or errors. Number (0-100); high = likely accurate Flag low scores, adjust prompt, prevent non-factual answers
Instruction Following Measures how well the agent follows instructions. High = followed closely; low = ignored parts. Number (0-100); Boolean (Yes/No) Flag ignored instructions, reword or vary instructions, add safeguards
Ground Truth Faithfulness
(includes BLEU and ROUGE-1)
Compares response to known correct output. High = semantically equivalent; low = different meaning. Measures n-gram overlap with ground truth (BLEU = multi-word, ROUGE-1 = single-word). % Yes judgments Use with other metrics for full picture
Prompt Perplexity Measures input prompt complexity and model confidence. Lower perplexity = better performance. Number (0-100) Lower perplexity, revise complex prompts
PII Leaks Detects if input/output contains personally identifiable info (PII). Boolean (Yes/No) No recommendations
Toxicity Flags hateful, offensive, or harmful content. Boolean (Yes/No) Apply guardrails, retrain agents, change models
Tone Identifies emotional tone (Neutral, Joy, Love, Fear, etc.). String Align tone via agent instructions
Sexism Flags sexist content; identifies harmful gender-based language. Boolean (Yes/No) Apply guardrails, retrain agents
Prompt Injection Flags input designed to manipulate agent behavior. Boolean (Yes/No) Apply guardrails
User Goal Progress (Action Advancement) Measures if the agent advanced the user’s task or question (partial/full answer, clarification, confirm action). Number (0 to 100); 100 = advanced or accomplished at least one goal No Recommendations
User Goal Completion (Action Completion) Measures if the agent fully accomplished the user’s goal; must be accurate, comprehensive, aligned with tool outputs. Number (0 to 100) No Recommendations

RAG and Tool Correctness

RAG and Tool Correctness metrics evaluate how well agents use Retrieval-Augmented Generation (RAG) pipelines and external tools to provide accurate, relevant, and contextually grounded responses. These metrics help ensure that agents effectively leverage external knowledge and tools to enhance their responses.

Metric Description Returns Recommendations
Context Adherence (Context Hallucinations) Measures whether the agent stays within the retrieved context when generating a response. High = relies only on provided facts; low = introduces unsupported information. Number (0 to 1); score close to 1 means fully adherent; close to 0 means hallucinations likely. No Recommendations
Response-Context Completeness (Completeness) Measures how thoroughly the agent covers key details from the provided context. Number (0 to 1) Rewrite the prompt to explicitly ask for full inclusion of relevant info
Adjust prompt to encourage thorough coverage of key details
Retrieved Chunk Usage (Chunk Attribution) Shows whether each chunk influenced the response. Boolean (Yes/No) per chunk; also Number (N of K) for count of attributed chunks Reduce number of retrieved chunks if many unused
Tune retrieval to balance recall and performance
Use attribution for debugging
Retrieved Context Relevance (Context Relevance) Measures how relevant the retrieved context is to the input prompt; checks if the context supports the query. Number (0 or 100); high = significant similarity or relevance. No Recommendations

We can't find any results for your search.

Try using different keywords or simplifying your search terms.