PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Abstract

World foundation models have emerged as powerful simulators of physical environments, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required by robotic manipulation. Robotic systems rely on egocentric, eye-to-hand, and wrist-mounted cameras to capture complementary viewpoints for policy learning, but current multi-view world models often concatenate view tokens without explicit geometric reasoning, leading to cross-view object drift, depth inconsistency, and texture misalignment. PAIWorld addresses these failures through two coupled technical pillars: an inter-view communication pathway and a geometric learning signal. Geometry-Aware Cross-View Attention and Geometric Rotary Position Embedding establish geometry-aware feature exchange across views, while Latent 3D-REPA distills 3D-aware features from frozen foundation models to keep exchanged content 3D-consistent. Built on a DiT-based world foundation model, PAIWorld ranks 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while supporting downstream applications including model-based planning, world action models, and multi-view policy post-training.

WorldArena Rank #1

72.31

AgiBot-Challenge2026 Rank #2

82.45

WorldArena leaderboard — WorldArena Leaderboard

AgiBot-Challenge2026 leaderboard — AgiBot-Challenge2026 Leaderboard

Overview

Method

PAIWorld injects 3D consistency into flow-matching world foundation models through two coupled pillars: an inter-view communication pathway and a geometric training objective.

PAIWorld framework pipeline — Overview of the PAIWorld framework. Built on a DiT-based flow matching backbone, PAIWorld rests on two technical pillars realized by three components. The inter-view pathway combines Geometry-Aware Cross-View Attention and Geometric Rotary Position Embedding (Geo-RoPE) to enable communication across views while biasing attention toward geometrically corresponding tokens. The geometric objective uses Latent 3D-REPA to align intermediate DiT representations with 3D-aware features from Depth Anything 3, encouraging the exchanged information to remain 3D-consistent.

Two technical pillars realized by three lightweight components.

01

Geometry-Aware Cross-View Attention

Cross-view attention blocks open an explicit communication pathway across viewpoints, letting each view exchange features with the others during world-model generation.

02

Geometric Rotary Position Embedding

Geo-RoPE encodes camera ray directions and extrinsic poses into attention, biasing the inter-view pathway toward geometrically corresponding tokens.

03

Latent 3D-REPA

Intermediate DiT features are aligned with 3D-aware features from Depth Anything 3, supplying the geometric objective that makes exchanged content 3D-consistent.

Core claim

Communication and geometry must work together.

Cross-view attention alone provides a path for information flow but does not guarantee physically correct structure. A 3D prior alone improves each view independently but cannot propagate constraints across cameras. PAIWorld combines both conditions with Geo-RoPE as the shared reference frame.

Results

Stronger generation quality and cross-view consistency across world-model leaderboards and benchmarks.

WorldArena Leaderboard

PAIWorld ranks first by EWMScore_P on the WorldArena leaderboard.

Method	EWMScore_P	Rank	Visual Quality	Motion Quality	Content Consistency	Physics Adherence	3D Accuracy	Controllability	Aesthetic Quality	Background Consistency	Depth Accuracy	Dynamic Degree	Flow Score	Image Quality	Instruction Following	Interaction Quality	JEPA Similarity	Motion Smoothness	Perspectivity	Photometric Consistency	Semantic Alignment	Subject Consistency	Trajectory Accuracy
PAIWorld	72.31	1	63.04	80.45	57.85	61.66	91.51	87.16	40.39	89.13	86.94	68.76	77.18	51.88	85.22	77.32	96.84	95.41	96.08	3.01	89.09	81.40	45.99
UNIS	72.16	2	60.85	81.60	56.44	61.56	91.16	90.19	40.94	86.88	83.91	68.41	83.00	54.49	91.40	84.52	87.12	93.38	98.42	2.49	88.98	79.94	38.59
BWM-Fast	72.15	3	62.79	78.79	58.30	61.18	91.53	88.58	39.94	90.27	86.76	69.14	73.14	50.62	88.02	78.02	97.81	94.10	96.30	3.08	89.15	81.54	44.35
FlowWAM-FiveAges	72.00	4	63.43	79.45	57.43	59.80	91.60	88.16	40.57	88.55	86.18	69.00	74.79	55.70	87.46	78.48	94.02	94.55	97.02	3.09	88.87	80.65	41.12
EvoPhysWorld	71.73	5	60.07	84.15	58.04	59.74	89.20	85.62	39.34	88.46	82.37	79.61	77.27	58.73	82.84	76.00	82.15	95.58	96.02	3.59	88.39	82.07	43.49
Inspatio-Curious	71.60	6	62.39	73.32	58.37	65.78	92.57	87.52	40.66	90.24	89.71	61.18	66.43	50.34	85.54	76.46	96.16	92.35	95.44	3.25	89.49	81.61	55.11
BetaBWM	71.22	7	62.85	68.83	61.36	63.78	92.04	88.78	40.16	90.78	87.41	58.21	59.99	51.00	88.48	78.24	97.39	88.29	96.68	10.55	89.08	82.75	49.31
MAI	70.96	8	63.42	64.05	59.10	67.35	94.42	90.56	40.16	88.82	90.18	54.32	53.18	60.46	92.36	80.32	89.64	84.66	98.66	6.88	88.75	81.60	54.38
SynapX	70.93	9	63.21	76.39	57.51	57.69	90.08	88.54	39.77	88.72	85.06	65.97	69.14	53.37	87.80	75.74	96.50	94.06	95.10	3.01	89.28	80.79	39.65
WAI	70.89	10	63.35	63.90	59.11	67.24	94.44	90.44	40.10	88.85	90.27	54.18	52.92	60.40	92.18	80.04	89.54	84.60	98.60	6.85	88.70	81.63	54.45

AgiBot-Challenge2026

PAIWorld ranks second overall and achieves the best scene consistency among reported teams.

Team	EWMScore	PSNR	Scene Consistency	nDTW
Direction	Higher	Higher	Higher	Higher
NeoVerse-ABot	0.829	0.6246	0.8974	0.9651
Loop	0.8241	0.6207	0.9024	0.9492
Wild Path	0.8232	-	-	-
VIPL-GENUN	0.8195	-	-	-
PAIWorld	0.8245	0.6161	0.9041	0.9531

AgiBot-World Model

Text-conditioned multi-view generation results from the LaTeX manuscript.

Method	SSIM	LPIPS	FID	FVD	Semantic	Geometric	MEt3R
Direction	Higher	Lower	Lower	Lower	Higher	Higher	Lower
Genie-Envisioner	0.7445	0.3345	83.7847	207.2025	0.9231	0.5327	15.75
Cosmos-Predict2	0.5870	0.3251	58.2837	188.6350	0.8456	0.4824	17.47
Wan2.1	0.5715	0.3354	56.4735	174.2186	0.8617	0.4716	16.59
PAIWorld	0.7683	0.1844	45.0389	175.7778	0.9041	0.4056	14.20

Demos

WorldArena and AgiBot-Challenge generation examples.

WorldArena

Representative action-conditioned world model generation episodes.

World model generation

Reconstruction

AgiBot-Challenge

World model generation, GT image/depth comparison, and multi-seed consistency examples.

World model generation

Multi-view generation

GT Image and Depth Comparison

Multi-Seed Generation Consistency

Contribution

Core Contributors

Yuhang Huang, Jiazhao Zhang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Ruizhen Hu, Kai Xu

Contributors

Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Zhibin Zhu, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu

Corresponding Author

Kai Xu

Citation

BibTeX

@misc{huang2026paiworld3dconsistentworldfoundation,
  title={PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation},
  author={Yuhang Huang and Xuan Lv and Junyan Xu and Zhiyuan Yu and Jiazhao Zhang and Ruizhen Hu and Wancheng Feng and Shilong Zou and Hewen Xiao and Ziqiao Zhou and Kaiyun Huang and Zhiyu Peng and Juzhan Xu and Hang Zhao and Chenyang Zhu and Renjiao Yi and Yifei Huang and Douhui Wu and Yan Zhang and Kexu Cheng and Chunhe Song and Yunzhi Xue and Xiuhong Zhang and Leitao Guo and Yunji Chen and Bin Wu and Haibin Yu and Kai Xu},
  year={2026},
  eprint={2606.18375},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2606.18375},
}