VAKRA is a new executable benchmark testing AI agents across 8,000+ APIs and 62 domains, revealing systematic failure in multi-step enterprise workflows.
Researchers released VAKRA, an executable benchmark designed to stress-test AI agents in enterprise-like environments using over 8,000 locally hosted APIs backed by real databases across 62 domains. Unlike existing benchmarks, VAKRA evaluates compositional reasoning through full execution traces — requiring 3–7 step reasoning chains that combine structured API calls with unstructured document retrieval. The benchmark includes four distinct task types covering 2,077+ test instances, and current frontier models perform poorly across all of them. VAKRA is publicly accessible and runnable against real agent systems.
VAKRA is the first benchmark that tests what actually breaks agents in production: compositional reasoning across chained API calls, document retrieval, dialog context, and policy constraints — all within a live execution environment. Current models fail not on individual tool calls but on multi-hop sequences where errors compound. If you're building any agentic pipeline, this is the closest thing to a real stress test available today.
Run your current agent stack against VAKRA's SLOT-BIRD task suite this week — specifically the 1–12 tool-call chaining scenarios — and log where it first drops context or selects the wrong tool. Use the failure trace to identify whether the breakdown is in tool selection, argument passing, or state tracking.
Navigate to the VAKRA benchmark environment via the link in the blog post
Tags
Also today
Signals by role
Also today
Tools mentioned