Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency.
In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. Residual Q-Learning (RQL) is then employed to align the policy with human preferences using the inferred reward function. Extensive evaluations on simulated and real-world tasks show that MEReQ achieves sample-efficient alignment from human intervention compared to baselines.
The task is to control a vehicle to navigate through highway traffic in the highway-env. The prior policy can change lanes arbitrarily to maximize progress, while the expert policy encourages the vehicle to stay in the right-most lane.
The task is to control a robot arm to push a wine bottle to a goal position in MuJoCo. The prior policy can push the bottle anywhere along the height of the bottle, while the expert policy encourages pushing near the bottom of the bottle.
The task is to control a robot arm to erase a marker on a whiteboard in MuJoCo. The prior policy applies insufficient force for effective erasing, whereas the expert encourages greater contact force to ensure the marker is fully erased.
The task is to control a robot arm to grasp a pillow in MuJoCo. The prior policy does not have a grasping point preference, whereas the expert favors grasping from the center.
A human expert monitors the task execution through a GUI and intervenes using a keyboard. The human is instructed to keep the vehicle in the rightmost lane if possible.
This experiment is conducted on a Fanuc LR Mate 200iD/7L 6-DoF robot arm with a customized tooltip to push the bottle. The human expert intervenes with a SpaceMouse when the robot does not aim for the bottom of the bottle.
The experiment configuration is similar to bottle pushing, but the robot arm is equipped with a two-finger gripper.
MEReQ converges faster and maintains at low intervention rate throughout the sample collection iterations. MEReQ requires fewer total expert samples to achieve comparable policy performance compared to baselines under varying intervention rate thresholds δ.
MEReQ aligns the prior policy with human preferences in fewer sample collection iterations and with fewer human intervention samples.
We evaluate the policy distribution of all methods with a convergence threshold of 0.1 for each feature in the Bottle-Pushing-Sim environment. All methods align well with the Expert in the feature table dist except for IWR-FT. Additionally, MEReQ aligns better with the Expert across the other three features compared to other baselines.
We visualize the reward distributions of all methods with a convergence threshold of 0.1 for each feature in the Bottle-Pushing-Sim environment. MEReQ aligns best with the Expert compared to other baselines.
@article{chen2025mereq,
author = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Ran and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi},
title = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention},
journal = {CoRL},
year = {2025},
}