Methodology
Data Collection, Ranking Rules & Model Classification
Data Sources
All benchmark results are collected from published papers and official repositories. We do not re-run experiments.
Ranking Rules
Models are ranked by their primary metric on each benchmark. For LIBERO, Meta-World and RoboCasa-GR1-Tabletop, this is the Average Success Rate. For CALVIN, this is the Average Length (Avg. Len.) on the ABC→D setting. For RoboChallenge, this is the Score. For RoboTwin 2.0, this is the Hard Success Rate.
Known Limitations
Results across different benchmarks are not directly comparable. Different papers may use slightly different evaluation protocols.
Model Classification
We classify models into two categories based on their open-source status:
Open-Source Models
Models with publicly available code, marked with an "Open Source" badge. These models provide the highest level of reproducibility and transparency.
Other Models
Models without the "Open Source" badge include: (1) models whose code repository we could not find, and (2) models that were in "Coming Soon" status before the data collection deadline. These models are hidden by default but can be shown using the "Include All Models" toggle.
Disclaimer
Cross-benchmark comparisons should be avoided. Each benchmark has its own evaluation protocol and metrics.
Supported Benchmarks
⭐ Support This Project
If you find this leaderboard helpful for your research, please consider giving us a star on GitHub!
Contact Us
Found errors or want to submit your model? Reach out via GitHub Issue, email or Wechat group!