MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

*Denotes equal contribution
1University of California, Berkeley, 2The University of Texas at Austin, 3Sony AI

Conference on Robot Learning - CoRL, 2025
Main Figure

MEReQ aligns the prior policy with human preferences efficiently by learning the residual reward through max-ent inverse reinforcement learning and updating it with residual Q-Learning.

Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency.

In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. Residual Q-Learning (RQL) is then employed to align the policy with human preferences using the inferred reward function. Extensive evaluations on simulated and real-world tasks show that MEReQ achieves sample-efficient alignment from human intervention compared to baselines.

Video

Tasks

We design multiple simulated and real-world tasks. In simulated tasks, we evaluate the sub-optimality of the learned policy via a synthesized expert. In real-world tasks, we validate MEReQ with human-in-the-loop (HITL) experiments.

Highway Sim

Highway Simulation Task

The task is to control a vehicle to navigate through highway traffic in the highway-env. The prior policy can change lanes arbitrarily to maximize progress, while the expert policy encourages the vehicle to stay in the right-most lane.

Bottle Pushing Sim

Bottle Pushing Simulation Task

The task is to control a robot arm to push a wine bottle to a goal position in MuJoCo. The prior policy can push the bottle anywhere along the height of the bottle, while the expert policy encourages pushing near the bottom of the bottle.

Erasing Sim

Erasing Simulation Task

The task is to control a robot arm to erase a marker on a whiteboard in MuJoCo. The prior policy applies insufficient force for effective erasing, whereas the expert encourages greater contact force to ensure the marker is fully erased.

Pillow Grasping Sim

Pillow Grasping Simulation Task

The task is to control a robot arm to grasp a pillow in MuJoCo. The prior policy does not have a grasping point preference, whereas the expert favors grasping from the center.

Highway Human

Real-world Highway Task with Human

A human expert monitors the task execution through a GUI and intervenes using a keyboard. The human is instructed to keep the vehicle in the rightmost lane if possible.

Bottle Pushing Human

Real-world Bottle Pushing Task with Human

This experiment is conducted on a Fanuc LR Mate 200iD/7L 6-DoF robot arm with a customized tooltip to push the bottle. The human expert intervenes with a SpaceMouse when the robot does not aim for the bottom of the bottle.

Pillow Grasping Human

Real-world Pillow Grasping Task with Human

The experiment configuration is similar to bottle pushing, but the robot arm is equipped with a two-finger gripper.

Results

Sample Efficiency

MEReQ converges faster and maintains at low intervention rate throughout the sample collection iterations. MEReQ requires fewer total expert samples to achieve comparable policy performance compared to baselines under varying intervention rate thresholds δ.

Sample Efficiency Results

Human Effort

MEReQ aligns the prior policy with human preferences in fewer sample collection iterations and with fewer human intervention samples.

Human Effort Analysis

Behavior Alignment

We evaluate the policy distribution of all methods with a convergence threshold of 0.1 for each feature in the Bottle-Pushing-Sim environment. All methods align well with the Expert in the feature table dist except for IWR-FT. Additionally, MEReQ aligns better with the Expert across the other three features compared to other baselines.

Behavior Alignment Results

Reward Alignment

We visualize the reward distributions of all methods with a convergence threshold of 0.1 for each feature in the Bottle-Pushing-Sim environment. MEReQ aligns best with the Expert compared to other baselines.

Reward Alignment Analysis

BibTeX

@article{chen2025mereq,
  author    = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Ran and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi},
  title     = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention},
  journal   = {CoRL},
  year      = {2025},
}