PAIWorld

A 3D-Consistent World Foundation Model for Robotic Manipulation

The PAIWorld Team
Institute of AI for Industries, Chinese Academy of Sciences
Institute emblem The PAI Lab emblem

Abstract

World foundation models have emerged as powerful simulators of physical environments, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required by robotic manipulation. Robotic systems rely on egocentric, eye-to-hand, and wrist-mounted cameras to capture complementary viewpoints for policy learning, but current multi-view world models often concatenate view tokens without explicit geometric reasoning, leading to cross-view object drift, depth inconsistency, and texture misalignment. PAIWorld addresses these failures through two coupled technical pillars: an inter-view communication pathway and a geometric learning signal. Geometry-Aware Cross-View Attention and Geometric Rotary Position Embedding establish geometry-aware feature exchange across views, while Latent 3D-REPA distills 3D-aware features from frozen foundation models to keep exchanged content 3D-consistent. Built on a DiT-based world foundation model, PAIWorld ranks 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while supporting downstream applications including model-based planning, world action models, and multi-view policy post-training.

WorldArena Rank #1
72.31
AgiBot-Challenge2026 Rank #2
82.45
WorldArena leaderboard
WorldArena Leaderboard
AgiBot-Challenge2026 leaderboard
AgiBot-Challenge2026 Leaderboard

Overview

PAIWorld overview teaser
PAIWorld is a 3D-consistent multi-view world foundation model for robotic manipulation. Pre-trained on 2.5M multi-view video clips, PAIWorld serves as a versatile backbone for a range of downstream applications: multi-view world generation, world action modeling, robotic planning, and policy post-training. Across these settings, PAIWorld maintains cross-view 3D consistency: coherent object placement, depth, and texture across all viewpoints, making its imagined rollouts physically plausible for embodied decision-making.

Method

PAIWorld injects 3D consistency into flow-matching world foundation models through two coupled pillars: an inter-view communication pathway and a geometric training objective.

PAIWorld framework pipeline
Overview of the PAIWorld framework. Built on a DiT-based flow matching backbone, PAIWorld rests on two technical pillars realized by three components. The inter-view pathway combines Geometry-Aware Cross-View Attention and Geometric Rotary Position Embedding (Geo-RoPE) to enable communication across views while biasing attention toward geometrically corresponding tokens. The geometric objective uses Latent 3D-REPA to align intermediate DiT representations with 3D-aware features from Depth Anything 3, encouraging the exchanged information to remain 3D-consistent.

Two technical pillars realized by three lightweight components.

01

Geometry-Aware Cross-View Attention

Cross-view attention blocks open an explicit communication pathway across viewpoints, letting each view exchange features with the others during world-model generation.

02

Geometric Rotary Position Embedding

Geo-RoPE encodes camera ray directions and extrinsic poses into attention, biasing the inter-view pathway toward geometrically corresponding tokens.

03

Latent 3D-REPA

Intermediate DiT features are aligned with 3D-aware features from Depth Anything 3, supplying the geometric objective that makes exchanged content 3D-consistent.

Core claim

Communication and geometry must work together.

Cross-view attention alone provides a path for information flow but does not guarantee physically correct structure. A 3D prior alone improves each view independently but cannot propagate constraints across cameras. PAIWorld combines both conditions with Geo-RoPE as the shared reference frame.

Results

Stronger generation quality and cross-view consistency across world-model leaderboards and benchmarks.

WorldArena Leaderboard

PAIWorld ranks first by EWMScore_P on the WorldArena leaderboard.

Method EWMScore_P Rank Visual Quality Motion Quality Content Consistency Physics Adherence 3D Accuracy Controllability Aesthetic Quality Background Consistency Depth Accuracy Dynamic Degree Flow Score Image Quality Instruction Following Interaction Quality JEPA Similarity Motion Smoothness Perspectivity Photometric Consistency Semantic Alignment Subject Consistency Trajectory Accuracy
PAIWorld 72.31 1 63.04 80.45 57.85 61.66 91.51 87.16 40.39 89.13 86.94 68.76 77.18 51.88 85.22 77.32 96.84 95.41 96.08 3.01 89.09 81.40 45.99
UNIS 72.16 2 60.85 81.60 56.44 61.56 91.16 90.19 40.94 86.88 83.91 68.41 83.00 54.49 91.40 84.52 87.12 93.38 98.42 2.49 88.98 79.94 38.59
BWM-Fast 72.15 3 62.79 78.79 58.30 61.18 91.53 88.58 39.94 90.27 86.76 69.14 73.14 50.62 88.02 78.02 97.81 94.10 96.30 3.08 89.15 81.54 44.35
FlowWAM-FiveAges 72.00 4 63.43 79.45 57.43 59.80 91.60 88.16 40.57 88.55 86.18 69.00 74.79 55.70 87.46 78.48 94.02 94.55 97.02 3.09 88.87 80.65 41.12
EvoPhysWorld 71.73 5 60.07 84.15 58.04 59.74 89.20 85.62 39.34 88.46 82.37 79.61 77.27 58.73 82.84 76.00 82.15 95.58 96.02 3.59 88.39 82.07 43.49
Inspatio-Curious 71.60 6 62.39 73.32 58.37 65.78 92.57 87.52 40.66 90.24 89.71 61.18 66.43 50.34 85.54 76.46 96.16 92.35 95.44 3.25 89.49 81.61 55.11
BetaBWM 71.22 7 62.85 68.83 61.36 63.78 92.04 88.78 40.16 90.78 87.41 58.21 59.99 51.00 88.48 78.24 97.39 88.29 96.68 10.55 89.08 82.75 49.31
MAI 70.96 8 63.42 64.05 59.10 67.35 94.42 90.56 40.16 88.82 90.18 54.32 53.18 60.46 92.36 80.32 89.64 84.66 98.66 6.88 88.75 81.60 54.38
SynapX 70.93 9 63.21 76.39 57.51 57.69 90.08 88.54 39.77 88.72 85.06 65.97 69.14 53.37 87.80 75.74 96.50 94.06 95.10 3.01 89.28 80.79 39.65
WAI 70.89 10 63.35 63.90 59.11 67.24 94.44 90.44 40.10 88.85 90.27 54.18 52.92 60.40 92.18 80.04 89.54 84.60 98.60 6.85 88.70 81.63 54.45

AgiBot-Challenge2026

PAIWorld ranks second overall and achieves the best scene consistency among reported teams.

Team EWMScore PSNR Scene Consistency nDTW
Direction Higher Higher Higher Higher
NeoVerse-ABot 0.829 0.6246 0.8974 0.9651
Loop 0.8241 0.6207 0.9024 0.9492
Wild Path 0.8232 - - -
VIPL-GENUN 0.8195 - - -
PAIWorld 0.8245 0.6161 0.9041 0.9531

AgiBot-World Model

Text-conditioned multi-view generation results from the LaTeX manuscript.

Method SSIM LPIPS FID FVD Semantic Geometric MEt3R
Direction Higher Lower Lower Lower Higher Higher Lower
Genie-Envisioner 0.7445 0.3345 83.7847 207.2025 0.9231 0.5327 15.75
Cosmos-Predict2 0.5870 0.3251 58.2837 188.6350 0.8456 0.4824 17.47
Wan2.1 0.5715 0.3354 56.4735 174.2186 0.8617 0.4716 16.59
PAIWorld 0.7683 0.1844 45.0389 175.7778 0.9041 0.4056 14.20

Demos

WorldArena and AgiBot-Challenge generation examples.

WorldArena

Representative action-conditioned world model generation episodes.

World model generation

Reconstruction

WorldArena episode 381 reconstruction
WorldArena episode 401 reconstruction
WorldArena episode 421 reconstruction
WorldArena episode 441 reconstruction
WorldArena episode 461 reconstruction
WorldArena episode 481 reconstruction

AgiBot-Challenge

World model generation, GT image/depth comparison, and multi-seed consistency examples.

World model generation

AgiBot-Challenge world model generation example 1
AgiBot-Challenge world model generation example 2
Multi-view generation
Multiview 1 3D reconstruction
Multiview 2 3D reconstruction
Multiview 3 3D reconstruction

GT Image and Depth Comparison

AgiBot-Challenge GT image and depth comparison 1
AgiBot-Challenge GT image and depth comparison 2

Multi-Seed Generation Consistency

AgiBot-Challenge multi-seed generation consistency example 1
AgiBot-Challenge multi-seed generation consistency example 2

Contribution

Core Contributors

Yuhang Huang, Jiazhao Zhang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Ruizhen Hu, Kai Xu

Contributors

Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Zhibin Zhu, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu

Corresponding Author

Kai Xu

Citation

BibTeX

@misc{huang2026paiworld3dconsistentworldfoundation,
  title={PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation},
  author={Yuhang Huang and Xuan Lv and Junyan Xu and Zhiyuan Yu and Jiazhao Zhang and Ruizhen Hu and Wancheng Feng and Shilong Zou and Hewen Xiao and Ziqiao Zhou and Kaiyun Huang and Zhiyu Peng and Juzhan Xu and Hang Zhao and Chenyang Zhu and Renjiao Yi and Yifei Huang and Douhui Wu and Yan Zhang and Kexu Cheng and Chunhe Song and Yunzhi Xue and Xiuhong Zhang and Leitao Guo and Yunji Chen and Bin Wu and Haibin Yu and Kai Xu},
  year={2026},
  eprint={2606.18375},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2606.18375},
}