AI Agent Evaluation

Trust agents by testing them.

A dependency-light eval harness for memory drift, stale facts, tool honesty, current-truth override, incomplete replies, and transcript health.

View Repo View Leaderboard Try the Free Reliability Checker Read Security Field Report v0.2 Analyzer View Maxima Trend Run Locally

Reliability field / live 3D

Agent Reliability Arena dashboard screenshot

100Foundation Score

40Drift Demo Score

—Live Maxima Score

v0.2Transcript Analyzer

v0.3 Direction

Reliability Leaderboard

Same suite, different agents. The seed board uses real deterministic Arena rows now; provider scores stay pending until Axiom captures actual Claude, GPT, Gemini, and Groq runs.

Rank	Agent	Score	Verdict	Cost	Latency
Loading leaderboard...

1. Probe15-20 prompts target memory drift, stale truth, bloat, tool claims, and reasoning gaps.

2. RunAxiom sends the same suite through each provider endpoint and captures transcripts.

3. ScoreArena applies deterministic checks first, then optional judge layers later.

4. PublishShare an HTML/PDF scorecard for clients, posts, and portfolio proof.

v0.2 Public Tool

Paste Transcript -> Get Reliability Score

Paste a chat transcript from an AI agent. The analyzer runs locally in your browser and checks for memory drift, tool honesty, incomplete replies, stale timeline language, and response bloat.

Transcript

-- Awaiting Transcript

Findings

No report yetPaste a transcript and run the analyzer.

What It Catches

Stale memory leaks stated as current truth
Unsupported claims of web or live tool access
Incomplete replies and dangling thoughts
Missing reasoning on complex recommendations
Transcript-level drift and answer bloat

Why It Matters

Agent demos are easy. Reliable agents need repeatable checks, failure-path examples, and evidence that quality is improving instead of drifting.

This project came from hardening Project Maxima, a persistent AI journal partner and memory system.

Security Field Report

What a real account takeover taught me about reliable systems

A sanitized public write-up on OAuth persistence, hidden alert filters, recovery paths, and why trustworthy systems need failure-mode thinking before launch.

Read the field report

Why It Belongs In The Arena

Reliability is not just “did the demo work?” It is whether the system stays truthful, recoverable, and auditable when something goes wrong.

The same discipline powers the Arena: assume drift happens, test the hidden path, and verify the fix.

Productized Service

AI Agent Reliability Audit

For founders, builders, and teams shipping AI agents. I test your agent for memory drift, hallucinated tool access, stale facts, incomplete replies, and RAG recall quality.

Starter $99

Transcript audit, score, failure modes, and quick fix checklist.

Deep Audit $299

Multi-case eval pack, RAG/memory risks, and prioritized hardening plan.

Implementation $500+

I help patch the agent: prompts, memory rules, evals, tool honesty, and dashboards.

Book an Audit

Quick Start

python -m agent_reliability_arena run --cases cases/maxima_foundation.json --transcript examples/maxima_transcript_sample.jsonl --out runs/latest.json
python -m agent_reliability_arena dashboard --report runs/latest.json --out runs/dashboard.html

Proof

Foundation suite passes cleanly at 100/100. The intentional drift demo drops to 40/100, proving the arena catches failures instead of only celebrating happy paths.

Maxima Daily Trend

The Arena can import Maxima's private Eval Lab, strip token URLs, append compact daily trend rows, and publish a static dashboard.

Open trend dashboard

Automation

The scheduled GitHub Action runs at 09:00 IST when MAXIMA_SYNC_SECRET is configured in repo secrets.