LaDiWM: A Latent Diffusion-based World Model for Predictive Manipulation

Yuhang Huang1, Jiazhao Zhang2, Shilong Zou1, Xinwang Liu1, Ruizhen Hu3*, Kai Xu1*
National University of Defense Technology1, Peking University2, Shenzhen University3
CoRL 2025

* Co-corresponding Author
Teaser image demonstrating Marigold depth estimation.

Our method consists of a latent diffusion-based world model and a corresponding predictive manipulation policy. (a): ① Given an initial action sequence generated by the policy, our world model produces imagined future states. ② These imagined states provide efficient guidance for the policy model, ③ yielding refined actions for improved manipulation. (b): Benefiting from our world model, we achieve significant improvements in both simulation and real-world performance.

Abstract

Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations.

To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images.

Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.

Method

Quantitative Results

Qualitative results

BibTeX

@inproceedings{huang2025ladi,
        title={LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation},
        author={Huang, Yuhang and Zhang, Jiazhao and Zou, Shilong and Liu, Xinwang and Hu, Ruizhen and Xu, Kai},
        booktitle={CoRL},
        year={2025}
      }