HOPMan: Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

HOPMan exhibiting its skills across diverse tasks in unseen scenarios

Towards Generalizable Zero-Shot Manipulation
via Translating Human Interaction Plans

Accepted at 2024 IEEE International Conference on Robotics and Automation
Carnegie Mellon University and Meta AI

We develop a framework for generalizable zero-shot manipulation (HOPMan) that can efficiently acquire a wide diversity of non-trivial skills and generalize them to diverse unseen scenarios.

Trained with large datasets of passive human videos, and small paired datasets of human-robot trajectories, HOPMan can exhibit a diverse set of 16 non-trivial manipulation skills (beyond picking/pushing, including articulated object manipulation and object re-orientation) across 100 tasks and can generalize them to diverse unseen scenarios (involving unseen objects, unseen tasks, and to completely unseen kitchens and offices).

In-The-Wild Manipulation Capabilities of HOPMan

Different Levels of Generalization

Mild Generalization (MG): This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces and seen skills, along with mild variations in the scene like lighting changes and camera pose changes.
Standard Generalization (G): Here we have two categories.
Instance Generalization (Ga) : In addition to variations in MG, in Ga we evaluate unseen object instance for seen skills. For example, only a red mug is seen with the pushing verb in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.
Unseen Combinations (Gb) : In addition to variations in MG, in Ga we evaluate unseen object instance for seen skills. For example, only a red mug is seen with the pushing verb in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.
Strong Generalization (SG): Here we have two categories.
Object category completely unsee (SGa) : This includes scenarios where a particular object category e.g. microwave is never seen in training
Skill completely unseen (SGb) : This includes scenarios where a particular skill e.g. turning is never seen in any context during training.

Explanation of HOPMan and Results

HOPMan exhibits skills across diverse tasks in diverse scenes

Strong Generalization (SGb)
Strong Generalization (SGa)
Standard Generalization (Gb)
Standard Generalization (Ga)
Mild Generalization (MG)
@misc{bharadhwaj2023generalizable,
      title={Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans}, 
      author={Homanga Bharadhwaj and Abhinav Gupta and Vikash Kumar and Shubham Tulsiani},
      year={2023},
      eprint={2312.00775},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}