LaDiWM: A Latent Diffusion-based World Model for Predictive Manipulation

Abstract

Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations.

To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images.

Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.

Method

Overall framework of the proposed method. We first train LaDi-WM on task-agnostic clips, which benefits cross-task generalization. The trained world model is used to generate future imagined states, which serve as effective guidance and input to the policy network, improving the manipulation performance significantly.

We propose to learn the environmental dynamics in a more compact and generalizable latent space. To capture both geometric and semantic information from the environment, we utilize pretrained foundation models (DINO and Siglip) to extract these two types of features as latent states. Specifically, we introduce an interactive diffusion process to model the two types of latent dynamics, which follow different distributions. The interactive diffusion avoids the difficulty of learning two mismatched distributions simultaneously and allows for effective interaction between two dynamic predictions.

We employ a diffusion policy to learn the robotic manipulation. There are two processes in the diffusion policy: the noising and denoising processes. In the noising stage, we gradually add the Gaussian noise to the ground truth action until it is pure noise; while we leverage the transformer layers to learn the denoising process for recovering the action. We first tokenize the different modalities of inputs, which are fed into the transformer encoder and decoder to obtain the final output.

Quantitative Results

Quantitative comparisons with state-of-the-art methods. Note the results on LIBERO-LONG are obtained using only 10 demonstrations.

Scalability of our method. (a) scaling up the training data of the world model; (b) scaling up the training data of the policy model; (c) scaling up the model size of the policy model.

Qualitative results

BibTeX

@inproceedings{huang2025ladi,
        title={LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation},
        author={Huang, Yuhang and Zhang, Jiazhao and Zou, Shilong and Liu, Xinwang and Hu, Ruizhen and Xu, Kai},
        booktitle={CoRL},
        year={2025}
      }