Latent Wasserstein Adversarial Imitation Learning

Siqi Yang1 , Kai Yan1 , Alexander G. Schwing1 , Yu-Xiong Wang1

1 University of Illinois Urbana-Champaign (UIUC)

ICLR 2026


PDF | Code

Abstract

Traditional distance metrics fail to capture environment dynamics in Wasserstein AIL;
we solve this by computing the Wasserstein distance in a dynamics-aware latent space.

Full Abstract Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

Why the Distance Metric Fails

Many prior Wasserstein IL [1] works that employ the Kantorovich-Rubinstein (KR) dual overlook an important issue: the distance metric between individual states is rather simplistic. For this, the Euclidean distance is common. However, it fails to capture the environment’s dynamics. For example, a state might be physically close to an expert state in Euclidean space, but unreachable due to an obstacle, making it a poor metric for the learning process.


Illustration of a case where the Euclidean distance between states is not a good metric (State B is closer to Expert State C in Euclidean distance, but actually State A is closer to Expert State C in real dynamic world).

We propose a two-stage process:

  • Pre-training stage: We leverage a small (1% of online rollouts) number of unstructured, low-quality (e.g., random) state-only data to train an Intention-Conditioned Value Function [2]. The resulting embedding captures a rich, dynamics-aware notion of reachability between states.
  • Imitation stage: We freeze this ICVF embedding and use the Euclidean distance in this new latent space as the cost function within a standard Wasserstein AIL framework.

In the adversarial imitation learning stage, we optimize the following objective:

$$ \min_\pi\max_{\|f\|_L\leq 1} \left( \mathbb{E}_{(s, s')\sim d^\pi_{ss}}[f(\phi(s),\phi(s'))] - \mathbb{E}_{(s,s')\sim d^E_{ss}}[f(\phi(s),\phi(s'))] \right). $$ where $\pi$ is the policy to be learned, $f$ is the critic constrained by \( \lVert f \rVert_L \le 1 \) (1-Lipschitz), $d^\pi_{ss}$ is the state-transition pair distribution $(s,s')$ induced by policy $\pi$, $d^E_{ss}$ is the expert state-transition pair distribution $(s,s')$, and $\phi(\cdot)$ is the frozen ICVF embedding that maps raw states to the dynamics-aware latent space. Intuitively, the critic maximizes the Wasserstein discrepancy between policy and expert transition-pair distributions in latent space, while the policy minimizes it.

Performance

We validate our approach on pointmaze, antmaze, and challenging locomotion tasks in the MuJoCo environment from the D4RL benchmark, achieving strong results using only a single trajectory of state-based expert data. The results show that the latent space grasps the transition dynamics much better than the vanilla Euclidean distance.


tsne_halfcheetah tsne_walker
t-SNE visualizations in the original state space and the embedding latent space on HalfCheetah and Walker2d. The color of the points represents the ground-truth reward of the state (greener is higher). States connected by lines are adjacent in the trajectory. The ICVF-trained embedding provides a more dynamics-aware metric.

MuJoCo Environments (1 Expert Trajectory)

Normalized Rewards (Higher is Better)

References

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.

[2] Dibya Ghosh, Chethan Anand Bhateja, and Sergey Levine. Reinforcement learning from passive data via latent intentions. In ICML, 2023.