AdaWorld

Learning Adaptable World Models with Latent Actions

HKUST1, Harvard University2, Google Deepmind3, UMass Amherst4, MIT-IBM Watson AI Lab5

TL;DR: AdaWorld is a highly adaptable world model trained with continuous latent actions, enabling efficient action transfer, world model adaptation, and visual planning.

Abstract

World models aim to learn action-controlled prediction models and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains.

To overcome this challenge, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions.

This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.

Teaser

We introduce latent actions as a unified condition for action-aware pretraining from videos. Our world model, dubbed AdaWorld, can readily transfer actions across contexts without training. By initializing action embeddings with corresponding latent actions, AdaWorld can also be adapted into specialized world models efficiently.

Select one set below to view different action transfer results.



source video → target scene
source video → target scene
source video → target scene

source video → target scene

source video → target scene

source video → target scene


source video → target scene
source video → target scene
source video → target scene

source video → target scene

source video → target scene

source video → target scene


source video → target scene
source video → target scene
source video → target scene

source video → target scene

source video → target scene

source video → target scene

AdaWorld enables more effective agent planning.

[Left]: action-agnostic baseline; [Right]: AdaWorld (ours).



Latent Action Autoencoder


Latent Action Autoencoder

With an information bottleneck design, our latent action autoencoder is able to extract the most critical action information from videos and compresses it into a continuous latent action.

Autoregressive World Model

Autoregressive World Model

We extract latent actions from videos using the latent action encoder. By leveraging the extracted actions as unified conditions, we pretrain a world model that can perform autoregressive rollouts at inference.


Experiment Results


Efficient World Model Adapation

Experiment 1-1
Experiment 1-2

Visual Planning

Experiment 2

Action Composition

Experiment 3

Customizable Actions with Strong Controllability

Experiment 4

Qualitative Comparison with Other Variants

Experiment 5

BibTeX

@article{gao2025adaworld,
  title={AdaWorld: Learning Adaptable World Models with Latent Actions}, 
  author={Gao, Shenyuan and Zhou, Siyuan and Du, Yilun and Zhang, Jun and Gan, Chuang},
  journal={arXiv preprint arXiv:2503.18938},
  year={2025}
}