Microsoft for StartupsRead more
AI Agent Evaluation Framework

The evaluationengine for everyagent you deploy.

AgentFit is the open-source framework that gives teams a structured, reproducible way to evaluate any AI agent against their specific business requirements — across 7 behavioral dimensions, with explanations you can actually act on.

Capabilities

Six capabilities. One evaluation layer.
Built to make every deployment defensible.

01

Business Need Profiles

Define your organization's agent requirements in a structured markdown file — which capabilities matter, how they should be weighted, what compliance standards apply. Every evaluation is anchored to your context, not an abstract benchmark.

All weights: 100%
02

Seven Evaluation Dimensions

Task Competence, Tool Use, Autonomy & Escalation, Safety & Alignment, Compliance & Auditability, Operational Performance, and Deployment Compatibility — each producing a 0–1 score with sub-metrics and weighted feedback.

5-axis culture alignment
03

LLM-Powered Interpretability

After scoring, AgentFit packages the full evaluation — scores, sub-metric breakdowns, exact arithmetic, and your BNP context — into a structured prompt sent to your LLM. The result: explanations grounded in your requirements, not post-hoc commentary.

82747865−8Fit Score76/100
04

Framework-Agnostic Protocol

Bring any OpenAI, Anthropic, Google, or fully custom agent. AgentFit evaluates through a universal protocol without you changing a single line of agent code. Pre-built adapters for the major providers — custom adapters in 3 async methods.

Export PDF
05

Reproducible & Auditable

Every evaluation locked, timestamped, and exportable. Compare agent versions side by side. Share results across teams. Full calculation trail for enterprise governance — no black box, no silent weighting.

Mo 1–2RampMo 3–5ContributionMo 6–9Integration
06

Scalable Architecture

From a single laptop run to a multi-tenant evaluation platform. All seven dimensions run concurrently via asyncio. REST API with background tasks supports CI/CD pipelines, webhooks, and batch evaluation on every agent commit.

SelfObservedConsistency delta flagged
Workflow

Four steps to a defensible deployment.
Every score traceable to its source.

I

Define Your BNP

Write a Business Need Profile — a lightweight markdown file expressing your organization's agent requirements: which capabilities matter, how they should be weighted, what compliance standards apply, and at what task complexity you're operating.

Define capabilities, weights, and compliance requirements
Machine-readable markdown — version-control it alongside your agent
Anchored to your domain and organization, not a generic benchmark
35%25%25%15%Total weight: 100% — BNP defined
II

Connect Your Agent

Wrap any agent in the universal protocol using pre-built adapters for OpenAI, Anthropic, and Google — or write a custom adapter in 3 async methods. Zero changes to your existing agent code required.

Pre-built adapters for OpenAI, Anthropic, Google, and more
Universal protocol — no changes to existing agent code
Custom adapters built in 3 async methods
123451234512345Results locked — reproducible and auditable
III

Run the Evaluation

Seven behavioral dimensions evaluated concurrently. Each produces a 0–1 score with sub-metrics, weighted feedback, and pass/fail thresholds — all anchored to your BNP, not an abstract benchmark.

All 7 dimensions run concurrently via asyncio.gather
Per-dimension score with every sub-metric and its weight contribution
CLI, Python SDK, or REST API — fits any workflow
82747865−8Agent Score76 / 100
IV

Get the Interpretation

The LLM receives your complete evaluation — scores, sub-metric breakdowns, exact weighted arithmetic, and BNP context — and returns business-grounded explanations with prioritized, actionable recommendations.

Full calculation trail passed to your LLM of choice
Explanations arithmetically grounded — not hallucinated summaries
Prioritized recommendations tied to your weakest dimensions
Step 1DefineStep 2EvaluateStep 3InterpretStructured, not subjective
Task Comp.Tool UseSafetyComplianceAutonomyBNP targetAgent score
Evaluation Dimensions

Seven dimensions. One agent score.

AgentFit evaluates agents across seven behavioral dimensions, each weighted according to your Business Need Profile. A fintech running compliance workflows weights Compliance differently than a DevOps agent — and AgentFit adapts accordingly.

  • Task Competence
    82
  • Tool Use
    74
  • Safety & Alignment
    88
  • Compliance
    68
  • Autonomy
    55
Explore the dimensions
Evaluation Audit Trail

Every evaluation. Logged, timestamped, reproducible.

The full evaluation record — from BNP definition to interpretation — in a reproducible, immutable log. Compare agent versions over time. Share results across teams. Export for enterprise governance without losing the calculation trail.

  • Every score, sub-metric, and weight contribution attributed to its source
  • Full calculation trail — no black-box aggregation, ever
  • Version your evaluations alongside your agent code
  • JSON and PDF export for compliance and governance teams
View on GitHub
Export PDF
LLM-Powered Interpretability

The gap between your BNP target and agent score is the signal.

After scoring, AgentFit packages the complete evaluation — every sub-metric, weight, and arithmetic step — into a structured prompt for your chosen LLM. What comes back are business-grounded explanations of exactly why the agent scored as it did, not a generic summary.

  • Explanations grounded in your exact calculation trail — arithmetically verifiable
  • Per-dimension summaries with identified strengths and weaknesses
  • Prioritized recommendations tied to your lowest-scoring dimensions
  • Supports 10+ LLM providers — OpenAI, Anthropic, Groq, Ollama, and more
See interpretability in action
BNP TargetAgent Score5/52/5gap4/54/53/53/55/53/5gap4/54/5Gap detected: 2 dimensions underperforming BNP target
Integrations

Bring any agent.
Evaluate it the same way.

AgentFit works with any AI provider or custom agent through a universal protocol. No vendor lock-in — compare OpenAI, Anthropic, Google, or your own agent implementation side by side.

OpenAI
LLM Provider
Anthropic
LLM Provider
Google Gemini
LLM Provider
Mistral
LLM Provider
DeepSeek
LLM Provider
Groq
Inference
Together AI
Inference
Ollama
Local
LM Studio
Local
vLLM
Self-hosted
LangChain
Framework
AutoGen
Framework
OpenAI
LLM Provider
Anthropic
LLM Provider
Google Gemini
LLM Provider
Mistral
LLM Provider
DeepSeek
LLM Provider
Groq
Inference
Together AI
Inference
Ollama
Local
LM Studio
Local
vLLM
Self-hosted
LangChain
Framework
AutoGen
Framework
AutoGen
Framework
LangChain
Framework
vLLM
Self-hosted
LM Studio
Local
Ollama
Local
Together AI
Inference
Groq
Inference
DeepSeek
LLM Provider
Mistral
LLM Provider
Google Gemini
LLM Provider
Anthropic
LLM Provider
OpenAI
LLM Provider
AutoGen
Framework
LangChain
Framework
vLLM
Self-hosted
LM Studio
Local
Ollama
Local
Together AI
Inference
Groq
Inference
DeepSeek
LLM Provider
Mistral
LLM Provider
Google Gemini
LLM Provider
Anthropic
LLM Provider
OpenAI
LLM Provider
Pricing

Open-source.
Enterprise-ready.

AgentFit is free and open-source for every team. Enterprise support is available for organizations running agents at scale.

Open Source

Self-hosted · Apache 2.0 · No usage caps

Free

Forever. No credit card required.

Download Open-Source (Beta)
All 7 evaluation dimensions
Business Need Profiles (BNPs)
LLM-powered interpretability
10+ LLM provider support
CLI, Python SDK, and REST API
Framework-agnostic universal protocol
Full evaluation audit trail
Apache 2.0 — use it, fork it, build on it
For teams at scale

Enterprise

Managed service · Dedicated support · Custom SLA

Custom

Pricing based on your usage and requirements.

Request Demo
Everything in Open Source
Managed cloud evaluation service
Dedicated account manager
SLA guarantee
Custom integrations
Compliance and governance exports
Priority support
Professional services

Not sure which fits your team? Book a 30-minute call and we will walk you through the right setup.

Start evaluating
your agents.

Free and open-source · Apache 2.0