Methodology
Data Collection, Ranking Rules & Model Classification
Data Sources
All benchmark results are collected from published papers and official repositories. We do not re-run experiments.
Ranking Rules
Models are ranked by their primary metric on each benchmark. For LIBERO, Meta-World and RoboCasa-GR1-Tabletop, this is the Average Success Rate. For CALVIN, this is the Average Length (Avg. Len.) on the ABC→D setting. For RoboChallenge, this is the Score.
Known Limitations
Results across different benchmarks are not directly comparable. Different papers may use slightly different evaluation protocols.
Model Classification
We classify models into two categories based on their open-source status:
Open-Source Models
Models with publicly available code, marked with an "Open Source" badge. These models provide the highest level of reproducibility and transparency.
Other Models
Models without the "Open Source" badge include: (1) models whose code repository we could not find, and (2) models that were in "Coming Soon" status before the data collection deadline. These models are hidden by default but can be shown using the "Include All Models" toggle.
Disclaimer
Cross-benchmark comparisons should be avoided. Each benchmark has its own evaluation protocol and metrics.
Supported Benchmarks
LIBERO
Average Success Rate (%)
130 language-conditioned tasks
Meta-World
Average Success Rate (%)
50 robotic manipulation tasks
CALVIN
Average Length (Mainly ABC→D)
Long-horizon manipulation
LIBERO Plus
Average Success Rate (%)
Extended LIBERO with 6 categories
RoboChallenge
Score
Real-world robotic manipulation
RoboCasa-GR1-Tabletop
Average Success Rate (%)
Tabletop manipulation tasks
⭐ Support This Project
If you find this leaderboard helpful for your research, please consider giving us a star on GitHub!
Contact Us
Found errors or want to submit your model? Reach out via GitHub Issue, email or Wechat group!