Exploit Coverage by Benchmark
Percentage of tasks exploitable without solving any task
25%
50%
75%
100%
Terminal-Bench
100%
(89 tasks)
SWE-bench Verified
100%
(500 tasks)
SWE-bench Pro
100%
(731 tasks)
FieldWorkArena
100%
(890 tasks)
WebArena
100%
(812 tasks)
CAR-bench
100%
(hallu.)
GAIA
~98%
(165 tasks)
OSWorld
73%
(270/369 tasks)
Zero tasks solved. Zero capability required.