# ClawBench

> ClawBench is a comprehensive benchmark for evaluating AI browser agents on 153 real-world everyday online tasks across 144 live websites and 15 life categories. It captures 5 layers of behavioral data (session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions), includes human ground-truth for every task, and scores with an agentic evaluator providing step-level traceable diagnostics. The best-performing model (Claude Sonnet 4.6) achieves only 33.3% success rate, revealing a large gap between current AI agents and human-level web task completion.

## Links

- [Paper](https://arxiv.org/abs/2604.08523): ClawBench: Can AI Agents Complete Everyday Online Tasks? (arXiv:2604.08523)
- [PDF](https://arxiv.org/pdf/2604.08523): Full paper PDF
- [Website](https://claw-bench.com): Interactive leaderboard, task browser, trace viewer, and agent demo gallery
- [GitHub](https://github.com/reacher-z/ClawBench): Source code — framework, evaluators, test driver, and Chrome extension
- [Dataset](https://huggingface.co/datasets/NAIL-Group/ClawBench): 153 tasks in Parquet format on Hugging Face
- [Hugging Face Papers](https://huggingface.co/papers/2604.08523): Community discussion page
- [PyPI](https://pypi.org/project/clawbench-eval/): Install with `pip install clawbench-eval`

## Interactive Pages

- [Leaderboard](https://claw-bench.com/#results): Model rankings + per-task × per-model heatmap
- [Difficulty](https://claw-bench.com/#difficulty): 153 tasks sorted by cross-model pass rate, filterable by category
- [Categories](https://claw-bench.com/#categories): 8 meta-categories (Daily, Work, Dev, Social, Academic, Travel, Pets, Finance)
- [Task Detail](https://claw-bench.com/#task/1): Per-task page with full instruction + per-model pass/fail
- [Model Profile](https://claw-bench.com/#model/claude-opus-4-6): Per-model page with category strengths, unique solves, failures
- [Traces](https://claw-bench.com/#traces): Step-by-step agent execution with screenshots and HTTP requests
- [Gallery](https://claw-bench.com/#gallery): Video recordings of successful agent runs at 16× speed
- [Compare](https://claw-bench.com/#compare): How ClawBench compares to WebArena, REAL Bench, Claw-Eval
- [Contribute](https://claw-bench.com/#contribute): Community task proposal form (Low / Mid / High tier)

## Key Facts

- 153 tasks across 144 live websites in 15 life categories
- Life categories include: food delivery, travel booking, job applications, shopping, housing search, email and calendar management, academic research, software development, learning platforms, office tasks, finance, entertainment, home services, pets, government services, and more
- 5 layers of behavioral data: session replay (rrweb), screenshots, HTTP traffic, agent reasoning traces, browser actions
- Human ground-truth recorded for every task
- Agentic evaluator with VLM, LLM, and Human-Agent evaluation modes providing step-level traceable diagnostics
- Request interceptor prevents irreversible real-world actions (payments, form submissions) during evaluation
- 7 models evaluated: Claude Sonnet 4.6, GLM-5, Gemini 3 Flash, Claude Haiku 4.5, GPT-5.4, Kimi K2.5, Gemini 3.1 Flash Lite
- Apache 2.0 license
- COLM 2026 submission
- 21 authors from 11 institutions (UBC, Vector Institute, Etude AI, CMU, U Waterloo, SJTU, UniPat AI, ZJU, HKUST, Tsinghua, Netmind.ai)

## Leaderboard (Overall Success Rate %)

| Model | Provider | Score |
|-------|----------|-------|
| Claude Sonnet 4.6 | Anthropic | 33.3% |
| GLM-5 | Zhipu AI | 24.2% |
| Gemini 3 Flash | Google | 19.0% |
| Claude Haiku 4.5 | Anthropic | 18.3% |
| Kimi K2.5 | Moonshot AI | 15.0% |
| GPT-5.4 | OpenAI | 6.5% |
| Gemini 3.1 Flash Lite | Google | 3.3% |

## What Makes ClawBench Different

- **Real websites, not simulations**: Tasks run on 144 actual live platforms (Airbnb, Uber Eats, Coursera, Indeed, etc.), not synthetic environments
- **Everyday tasks**: Booking flights, ordering groceries, applying for jobs, scheduling appointments — tasks people actually do online
- **Safe evaluation**: Request interceptor blocks the final HTTP request before irreversible actions, allowing evaluation on production websites without side effects
- **Rich behavioral data**: 5 complementary data layers enable fine-grained analysis of where and why agents fail
- **Human baseline**: Every task has a human-completed ground-truth recording for direct comparison

## Citation

```bibtex
@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Zhang, Yuxuan and Wang, Yubo and Zhu, Yipeng and Du, Penghui and Miao, Junwen and Lu, Xuan and Xu, Wendong and Hao, Yunzhuo and Cai, Songcheng and Wang, Xiaochen and Zhang, Huaisong and Wu, Xian and Lu, Yi and Lei, Minyi and Zou, Kai and Yin, Huifeng and Nie, Ping and Chen, Liang and Jiang, Dongfu and Chen, Wenhu and Allen, Kelsey R.},
  journal={arXiv preprint arXiv:2604.08523},
  year={2026}
}
```