Leaderboard
Per-model pass rates on V2 (130 newer everyday tasks across 63 platforms) and V1 (153 tasks across 144 platforms). Two-stage scoring: HTTP-request interception + LLM judge on the intercepted payload. Scoring details ↗
V2 Snapshot — 6 models
| Rank | Model | Harness | Intercepted | Reward | Pass / Total |
|---|---|---|---|---|---|
| 1 | claude-opus-4-7·partial | hermes | 54.7% | 13.3% | 10 / 75 |
| 2 | glm-5.1 | hermes | 48.5% | 18.5% | 24 / 130 |
| 3 | gpt-5.5·partial | hermes | 48.1% | 11.1% | 9 / 81 |
| 4 | deepseek-v4-pro | hermes | 43.8% | 10.0% | 13 / 130 |
| 5 | openrouter-owl-alpha | hermes | 14.6% | 4.6% | 6 / 130 |
| 6 | deepseek-v4-flash | hermes | 3.1% | 1.5% | 2 / 130 |
Intercepted (the headline sort key) = fraction whose final HTTP request matched the per-task URL/method schema — Stage 1, deterministic, no judge. Reward additionally requires an LLM judge (default deepseek/deepseek-v4-pro) to confirm the payload fulfilled the instruction — Stage 2, payload-correct. Rows are ranked by Intercepted with Reward as tiebreak. Rows marked ·partial attempted fewer than the full 130 V2 tasks; the displayed % is over attempted, but ranking treats unattempted tasks as failures — so a partial 54.7% Intercepted (10/75) ranks below a complete 48.5% (63/130).
Snapshot generated 2026-05-12. Scoring details: eval/scoring.md ↗.
Fresh runs + V1 results: interactive HF Space ↗.
New here? About ClawBench — how it works ↗.
Browser-agent execution traces curated open for download Apache-2.0
What will you do with them?
Sample the corpus before you download
Browse the 283 task definitions these traces capture — searchable, filterable, no download. Each row is a prompt that one of the 13 frontier models attempted.
Powered by the Hugging Face Datasets Viewer · Open full dataset
A real turn from this corpus
Excerpted from agent-messages.jsonl of one V2 run (z-ai/glm-5 · task 001 · Uber Eats / Pad Thai). Every trace bundle has hundreds of these, time-aligned with the recording, actions, and HTTP requests.
shared/alex_green_personal_info.json
https://ubereats.com
Inside every trace — 6 time-synchronized signals per run multi-track recorder
Every signal is timestamped against the same clock — click frame 1872 of recording.mp4 and you can find the exact actions.jsonl event, the LLM turn that triggered it, and the HTTP requests it fired. Cross-org mirrors:
NAIL-Group ·
TIGER-Lab
· Apache-2.0 · Bundle format: tar.gz per run, jsonl within.
Cite this benchmark
Using ClawBench in your research? Please cite the arXiv paper:
@misc{zhang2026clawbench,
title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
year = {2026},
eprint = {2604.08523},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2604.08523}
}