AdaWorld

Learning Adaptable World Models with Latent Actions

Shenyuan Gao¹, Siyuan Zhou¹, Yilun Du², Jun Zhang¹, Chuang Gan^3,4

HKUST¹, Harvard², UMass Amherst³, MIT-IBM Watson AI Lab⁴

ICML 2025

Paper Code Model

TL;DR: AdaWorld is a highly adaptable world model pretrained with continuous latent actions from thousands of environments, enabling zero-shot action transfer, fast adaptation, and new skill acquisition with minimal finetuning.

Select one set below to view different action transfer results.

source video → target scene

AdaWorld enables more effective planning.

On Procgen: [Left] action-agnostic baseline; [Right] AdaWorld (ours).

On Robosuite: [Left] goal image; [Mid] action-agnostic baseline; [Right] AdaWorld (ours).

Latent Action Autoencoder

With an information bottleneck design, our latent action autoencoder is able to extract the most critical action information from videos and compresses it into a continuous latent action.

Autoregressive World Model

We extract latent actions from unlabeled videos using the latent action encoder. By leveraging the extracted actions as a unified condition, we pretrain a world model that can perform autoregressive rollouts at inference.

AdaWorld

Learning Adaptable World Models with Latent Actions

TL;DR: AdaWorld is a highly adaptable world model pretrained with continuous latent actions from thousands of environments, enabling zero-shot action transfer, fast adaptation, and new skill acquisition with minimal finetuning.

Select one set below to view different action transfer results.

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

source video → target scene

AdaWorld enables more effective planning.

Latent Action Autoencoder

With an information bottleneck design, our latent action autoencoder is able to extract the most critical action information from videos and compresses it into a continuous latent action.

Autoregressive World Model

We extract latent actions from unlabeled videos using the latent action encoder. By leveraging the extracted actions as a unified condition, we pretrain a world model that can perform autoregressive rollouts at inference.

Experiment Results

Efficient World Model Adapation to Unseen Environments

Visual Planning

Action Composition

Customizable Actions with Strong Controllability

Qualitative Comparison with Other Variants

Latent Action Visualization

BibTeX