Glimpse of the diverse Manipulation Capabilities enabled by Gen2Act
We present Gen2Act: a framework for diverse generalizable manipulation via human video generation. For solving a manipulation task in a new scene, Gen2Act first imagines how a human would perform the task through video generation with a pre-trained model, and then executes a common policy conditioned on the generated video.
Gen2Act decomposes the overall policy learning into human video generation followed by robot execution conditioned on the generated video. Gen2Act leverage a pre-trained video generation model zero-shot for generating human videos in novel scenarios. The closed-loop policy of Gen2Act is trained through behavior cloning on a robot interaction dataset combined with a point track prediction loss for utilizing motion information from the generated video. The video model generalizes well to new scenarios by virtue of large web-scale training, and the policy conditioned on the generated video also generalizes to tasks beyond that in the robot data since it is tasked with a much simpler job of translating the generated video to robot actions by following motion cues from the video.
We show manipulation results for lon-horizon activities "cleaning the table" and "making coffee" that consists of several tasks. We chain Gen2Act for the task sequences by using the last image of the previous policy rollout as the first frame for generating a human video of the subsequent task.
We characterize different levels of generalization based on the robot interaction data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The language instructions for each task are shown at the bottom. These videos are at 1x speed.
Here we generalize to completely unseen motion types where the tasks require moving the object with a novel motion as described by the language instruction.
In OTG, we generalize to completely unseen object types. These include new categories of objects like scooper, door, chairs etc. whose instances have never been seen in the robot data.
In G, we evaluate unseen object instances in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.
This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.
The rows respectively show the generated videos, robot executions from on-board camera, and the same executions from a third-person camera.
We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles
@article{bharadhwaj2024gen2act,
title={Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation},
author={Bharadhwaj, Homanga and Dwibedi, Debidatta and Gupta, Abhinav and Tulsiani, Shubham and Doersch, Carl and Xiao, Ted and Shah, Dhruv and Xia, Fei and Sadigh, Dorsa and Kirmani, Sean},
journal={arXiv preprint arXiv:2409.16283},
year={2024}
}