Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Glimpse of the diverse Manipulation Capabilities enabled by Gen2Act

Gen2Act: Human Video Generation in Novel Scenarios
enables Generalizable Robot Manipulation

Google DeepMind, Carnegie Mellon University, Stanford University

We present Gen2Act: a framework for diverse generalizable manipulation via human video generation. For solving a manipulation task in a new scene, Gen2Act first imagines how a human would perform the task through video generation with a pre-trained model, and then executes a common policy conditioned on the generated video.

Gen2Act decomposes the overall policy learning into human video generation followed by robot execution conditioned on the generated video. Gen2Act leverage a pre-trained video generation model zero-shot for generating human videos in novel scenarios. The closed-loop policy of Gen2Act is trained through behavior cloning on a robot interaction dataset combined with a point track prediction loss for utilizing motion information from the generated video. The video model generalizes well to new scenarios by virtue of large web-scale training, and the policy conditioned on the generated video also generalizes to tasks beyond that in the robot data since it is tasked with a much simpler job of translating the generated video to robot actions by following motion cues from the video.

Chaining Gen2Act for Long-Horizon Activities

We show manipulation results for lon-horizon activities "cleaning the table" and "making coffee" that consists of several tasks. We chain Gen2Act for the task sequences by using the last image of the previous policy rollout as the first frame for generating a human video of the subsequent task.

Qualitative Results for Gen2Act Policy Executions

We characterize different levels of generalization based on the robot interaction data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The language instructions for each task are shown at the bottom. These videos are at 1x speed.

Motion Type Generalization (MTG)

Here we generalize to completely unseen motion types where the tasks require moving the object with a novel motion as described by the language instruction.

Object Type Generalization (OTG)

In OTG, we generalize to completely unseen object types. These include new categories of objects like scooper, door, chairs etc. whose instances have never been seen in the robot data.

Standard Generalization (G)

In G, we evaluate unseen object instances in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.

Mild Generalization (MG)

This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.

Detailed Results for Video Generation and Robot Execution

The rows respectively show the generated videos, robot executions from on-board camera, and the same executions from a third-person camera.

Initial Image
Description Description Description Description Description
Video Gen
Executions (1st person)
Executions (3rd person)

Failures

We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles

Acknowledgements

We thank Jie Tan for feedback and guidance throughout the project. We are grateful to Peng Xu, Alex Kim, Alexander Herzog, Paul Wohlhart, Alex Irpan, Justice Carbajal, Clayton Tan for help with robot and compute infrastructures. We thank David Ross, Bryan Seybold, Xiuye Gu, and Ozgun Bursalioglu for helpful pointers regarding video generation. We enjoyed discussions with Chen Wang, Jason Ma, Laura Smith, Danny Driess, Soroush Nasiriany, Coline Devin, Keerthana Gopalakrishnan, and Joey Hejna that were helpful for the project. Finally, we thank Jacky Liang and Carolina Parada for feedback on the paper.