Give Feedback

Reference for Agent Evaluation Metricspublic

Validated on 1 Jul 2025 • Last edited on 1 Jul 2025

The DigitalOcean GenAI Platform lets you work with popular foundation models and build GPU-powered AI agents with fully-managed deployment, or send direct requests using serverless inference. Create agents that incorporate guardrails, functions, agent routing, and retrieval-augmented generation (RAG) pipelines with knowledge bases.

General Agent Quality

General agent quality metrics measure the overall quality of an agent’s responses, including correctness, instruction following, and safety. These metrics help ensure that agents provide accurate, relevant, and safe responses to user queries.

Metric	Description	Returns	Recommendations
Correctness (General Hallucinations)	Measures how factually accurate the agent’s response is without using context. High score = likely accurate; low score = possible hallucinations or errors.	Number (0-100); high = likely accurate	Flag low scores, adjust prompt, prevent non-factual answers
Instruction Following	Measures how well the agent follows instructions. High = followed closely; low = ignored parts.	Number (0-100); Boolean (Yes/No)	Flag ignored instructions, reword or vary instructions, add safeguards
Ground Truth Faithfulness (includes BLEU and ROUGE-1)	Compares response to known correct output. High = semantically equivalent; low = different meaning. Measures n-gram overlap with ground truth (BLEU = multi-word, ROUGE-1 = single-word).	% Yes judgments	Use with other metrics for full picture
Prompt Perplexity	Measures input prompt complexity and model confidence. Lower perplexity = better performance.	Number (0-100)	Lower perplexity, revise complex prompts
PII Leaks	Detects if input/output contains personally identifiable info (PII).	Boolean (Yes/No)	No recommendations
Toxicity	Flags hateful, offensive, or harmful content.	Boolean (Yes/No)	Apply guardrails, retrain agents, change models
Tone	Identifies emotional tone (Neutral, Joy, Love, Fear, etc.).	String	Align tone via agent instructions
Sexism	Flags sexist content; identifies harmful gender-based language.	Boolean (Yes/No)	Apply guardrails, retrain agents
Prompt Injection	Flags input designed to manipulate agent behavior.	Boolean (Yes/No)	Apply guardrails
User Goal Progress (Action Advancement)	Measures if the agent advanced the user’s task or question (partial/full answer, clarification, confirm action).	Number (0 to 100); 100 = advanced or accomplished at least one goal	No Recommendations
User Goal Completion (Action Completion)	Measures if the agent fully accomplished the user’s goal; must be accurate, comprehensive, aligned with tool outputs.	Number (0 to 100)	No Recommendations

RAG and Tool Correctness

RAG and Tool Correctness metrics evaluate how well agents use Retrieval-Augmented Generation (RAG) pipelines and external tools to provide accurate, relevant, and contextually grounded responses. These metrics help ensure that agents effectively leverage external knowledge and tools to enhance their responses.

Metric	Description	Returns	Recommendations
Context Adherence (Context Hallucinations)	Measures whether the agent stays within the retrieved context when generating a response. High = relies only on provided facts; low = introduces unsupported information.	Number (0 to 1); score close to 1 means fully adherent; close to 0 means hallucinations likely.	No Recommendations
Response-Context Completeness (Completeness)	Measures how thoroughly the agent covers key details from the provided context.	Number (0 to 1)	Rewrite the prompt to explicitly ask for full inclusion of relevant info Adjust prompt to encourage thorough coverage of key details
Retrieved Chunk Usage (Chunk Attribution)	Shows whether each chunk influenced the response.	Boolean (Yes/No) per chunk; also Number (N of K) for count of attributed chunks	Reduce number of retrieved chunks if many unused Tune retrieval to balance recall and performance Use attribution for debugging
Retrieved Context Relevance (Context Relevance)	Measures how relevant the retrieved context is to the input prompt; checks if the context supports the query.	Number (0 or 100); high = significant similarity or relevance.	No Recommendations

Reference for Agent Evaluation Metricspublic

General Agent Quality

RAG and Tool Correctness

We can't find any results for your search.