PAIWorld
A 3D-Consistent World Foundation Model for Robotic Manipulation
Abstract
World foundation models have emerged as powerful simulators of physical environments, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required by robotic manipulation. Robotic systems rely on egocentric, eye-to-hand, and wrist-mounted cameras to capture complementary viewpoints for policy learning, but current multi-view world models often concatenate view tokens without explicit geometric reasoning, leading to cross-view object drift, depth inconsistency, and texture misalignment. PAIWorld addresses these failures through two coupled technical pillars: an inter-view communication pathway and a geometric learning signal. Geometry-Aware Cross-View Attention and Geometric Rotary Position Embedding establish geometry-aware feature exchange across views, while Latent 3D-REPA distills 3D-aware features from frozen foundation models to keep exchanged content 3D-consistent. Built on a DiT-based world foundation model, PAIWorld ranks 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while supporting downstream applications including model-based planning, world action models, and multi-view policy post-training.
Overview
Method
PAIWorld injects 3D consistency into flow-matching world foundation models through two coupled pillars: an inter-view communication pathway and a geometric training objective.
Two technical pillars realized by three lightweight components.
Geometry-Aware Cross-View Attention
Cross-view attention blocks open an explicit communication pathway across viewpoints, letting each view exchange features with the others during world-model generation.
Geometric Rotary Position Embedding
Geo-RoPE encodes camera ray directions and extrinsic poses into attention, biasing the inter-view pathway toward geometrically corresponding tokens.
Latent 3D-REPA
Intermediate DiT features are aligned with 3D-aware features from Depth Anything 3, supplying the geometric objective that makes exchanged content 3D-consistent.
Core claim
Communication and geometry must work together.
Cross-view attention alone provides a path for information flow but does not guarantee physically correct structure. A 3D prior alone improves each view independently but cannot propagate constraints across cameras. PAIWorld combines both conditions with Geo-RoPE as the shared reference frame.
Results
Stronger generation quality and cross-view consistency across world-model leaderboards and benchmarks.
WorldArena Leaderboard
PAIWorld ranks first by EWMScore_P on the WorldArena leaderboard.
| Method | EWMScore_P | Rank | Visual Quality | Motion Quality | Content Consistency | Physics Adherence | 3D Accuracy | Controllability | Aesthetic Quality | Background Consistency | Depth Accuracy | Dynamic Degree | Flow Score | Image Quality | Instruction Following | Interaction Quality | JEPA Similarity | Motion Smoothness | Perspectivity | Photometric Consistency | Semantic Alignment | Subject Consistency | Trajectory Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PAIWorld | 72.31 | 1 | 63.04 | 80.45 | 57.85 | 61.66 | 91.51 | 87.16 | 40.39 | 89.13 | 86.94 | 68.76 | 77.18 | 51.88 | 85.22 | 77.32 | 96.84 | 95.41 | 96.08 | 3.01 | 89.09 | 81.40 | 45.99 |
| UNIS | 72.16 | 2 | 60.85 | 81.60 | 56.44 | 61.56 | 91.16 | 90.19 | 40.94 | 86.88 | 83.91 | 68.41 | 83.00 | 54.49 | 91.40 | 84.52 | 87.12 | 93.38 | 98.42 | 2.49 | 88.98 | 79.94 | 38.59 |
| BWM-Fast | 72.15 | 3 | 62.79 | 78.79 | 58.30 | 61.18 | 91.53 | 88.58 | 39.94 | 90.27 | 86.76 | 69.14 | 73.14 | 50.62 | 88.02 | 78.02 | 97.81 | 94.10 | 96.30 | 3.08 | 89.15 | 81.54 | 44.35 |
| FlowWAM-FiveAges | 72.00 | 4 | 63.43 | 79.45 | 57.43 | 59.80 | 91.60 | 88.16 | 40.57 | 88.55 | 86.18 | 69.00 | 74.79 | 55.70 | 87.46 | 78.48 | 94.02 | 94.55 | 97.02 | 3.09 | 88.87 | 80.65 | 41.12 |
| EvoPhysWorld | 71.73 | 5 | 60.07 | 84.15 | 58.04 | 59.74 | 89.20 | 85.62 | 39.34 | 88.46 | 82.37 | 79.61 | 77.27 | 58.73 | 82.84 | 76.00 | 82.15 | 95.58 | 96.02 | 3.59 | 88.39 | 82.07 | 43.49 |
| Inspatio-Curious | 71.60 | 6 | 62.39 | 73.32 | 58.37 | 65.78 | 92.57 | 87.52 | 40.66 | 90.24 | 89.71 | 61.18 | 66.43 | 50.34 | 85.54 | 76.46 | 96.16 | 92.35 | 95.44 | 3.25 | 89.49 | 81.61 | 55.11 |
| BetaBWM | 71.22 | 7 | 62.85 | 68.83 | 61.36 | 63.78 | 92.04 | 88.78 | 40.16 | 90.78 | 87.41 | 58.21 | 59.99 | 51.00 | 88.48 | 78.24 | 97.39 | 88.29 | 96.68 | 10.55 | 89.08 | 82.75 | 49.31 |
| MAI | 70.96 | 8 | 63.42 | 64.05 | 59.10 | 67.35 | 94.42 | 90.56 | 40.16 | 88.82 | 90.18 | 54.32 | 53.18 | 60.46 | 92.36 | 80.32 | 89.64 | 84.66 | 98.66 | 6.88 | 88.75 | 81.60 | 54.38 |
| SynapX | 70.93 | 9 | 63.21 | 76.39 | 57.51 | 57.69 | 90.08 | 88.54 | 39.77 | 88.72 | 85.06 | 65.97 | 69.14 | 53.37 | 87.80 | 75.74 | 96.50 | 94.06 | 95.10 | 3.01 | 89.28 | 80.79 | 39.65 |
| WAI | 70.89 | 10 | 63.35 | 63.90 | 59.11 | 67.24 | 94.44 | 90.44 | 40.10 | 88.85 | 90.27 | 54.18 | 52.92 | 60.40 | 92.18 | 80.04 | 89.54 | 84.60 | 98.60 | 6.85 | 88.70 | 81.63 | 54.45 |
AgiBot-Challenge2026
PAIWorld ranks second overall and achieves the best scene consistency among reported teams.
| Team | EWMScore | PSNR | Scene Consistency | nDTW |
|---|---|---|---|---|
| Direction | Higher | Higher | Higher | Higher |
| NeoVerse-ABot | 0.829 | 0.6246 | 0.8974 | 0.9651 |
| Loop | 0.8241 | 0.6207 | 0.9024 | 0.9492 |
| Wild Path | 0.8232 | - | - | - |
| VIPL-GENUN | 0.8195 | - | - | - |
| PAIWorld | 0.8245 | 0.6161 | 0.9041 | 0.9531 |
AgiBot-World Model
Text-conditioned multi-view generation results from the LaTeX manuscript.
| Method | SSIM | LPIPS | FID | FVD | Semantic | Geometric | MEt3R |
|---|---|---|---|---|---|---|---|
| Direction | Higher | Lower | Lower | Lower | Higher | Higher | Lower |
| Genie-Envisioner | 0.7445 | 0.3345 | 83.7847 | 207.2025 | 0.9231 | 0.5327 | 15.75 |
| Cosmos-Predict2 | 0.5870 | 0.3251 | 58.2837 | 188.6350 | 0.8456 | 0.4824 | 17.47 |
| Wan2.1 | 0.5715 | 0.3354 | 56.4735 | 174.2186 | 0.8617 | 0.4716 | 16.59 |
| PAIWorld | 0.7683 | 0.1844 | 45.0389 | 175.7778 | 0.9041 | 0.4056 | 14.20 |
Demos
WorldArena and AgiBot-Challenge generation examples.
WorldArena
Representative action-conditioned world model generation episodes.
World model generation
Reconstruction
AgiBot-Challenge
World model generation, GT image/depth comparison, and multi-seed consistency examples.
World model generation
Multi-view generation
GT Image and Depth Comparison
Multi-Seed Generation Consistency
Contribution
Core Contributors
Yuhang Huang, Jiazhao Zhang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Ruizhen Hu, Kai Xu
Contributors
Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Zhibin Zhu, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu
Corresponding Author
Kai Xu
Citation
BibTeX
@misc{huang2026paiworld3dconsistentworldfoundation,
title={PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation},
author={Yuhang Huang and Xuan Lv and Junyan Xu and Zhiyuan Yu and Jiazhao Zhang and Ruizhen Hu and Wancheng Feng and Shilong Zou and Hewen Xiao and Ziqiao Zhou and Kaiyun Huang and Zhiyu Peng and Juzhan Xu and Hang Zhao and Chenyang Zhu and Renjiao Yi and Yifei Huang and Douhui Wu and Yan Zhang and Kexu Cheng and Chunhe Song and Yunzhi Xue and Xiuhong Zhang and Leitao Guo and Yunji Chen and Bin Wu and Haibin Yu and Kai Xu},
year={2026},
eprint={2606.18375},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.18375},
}