Methodology

Data Collection, Ranking Rules & Model Classification

Data Sources

All benchmark results are collected from published papers and official repositories. We do not re-run experiments.

Ranking Rules

Models are ranked by their primary metric on each benchmark. For LIBERO, Meta-World and RoboCasa-GR1-Tabletop, this is the Average Success Rate. For CALVIN, this is the Average Length (Avg. Len.) on the ABC→D setting. For RoboChallenge, this is the Score.

Known Limitations

Results across different benchmarks are not directly comparable. Different papers may use slightly different evaluation protocols.

Model Classification

We classify models into two categories based on their open-source status:

Open-Source Models

Default

Models with publicly available code, marked with an "Open Source" badge. These models provide the highest level of reproducibility and transparency.

Other Models

Optional

Models without the "Open Source" badge include: (1) models whose code repository we could not find, and (2) models that were in "Coming Soon" status before the data collection deadline. These models are hidden by default but can be shown using the "Include All Models" toggle.

Data Notice

  • If you find any errors or omissions, please let us know by creating an issue on GitHub, contacting us via email: business@evomind-tech.com or joining our Wechat group!

Disclaimer

Cross-benchmark comparisons should be avoided. Each benchmark has its own evaluation protocol and metrics.

Support This Project

If you find this leaderboard helpful for your research, please consider giving us a star on GitHub!

Star on GitHub

Contact Us

Found errors or want to submit your model? Reach out via GitHub Issue, email or Wechat group!