Exploit Coverage by Benchmark Percentage of tasks exploitable without solving any task 25% 50% 75% 100% Terminal-Bench 100% (89 tasks) SWE-bench Verified 100% (500 tasks) SWE-bench Pro 100% (731 tasks) FieldWorkArena 100% (890 tasks) WebArena 100% (812 tasks) CAR-bench 100% (hallu.) GAIA ~98% (165 tasks) OSWorld 73% (270/369 tasks) Zero tasks solved. Zero capability required.