Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Qualitative Results for Gen2Act Policy Executions

We characterize different levels of generalization based on the robot interaction data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The language instructions for each task are shown at the bottom. These videos are at 1x speed.

Motion Type Generalization (MTG)

Here we generalize to completely unseen motion types where the tasks require moving the object with a novel motion as described by the language instruction.

Pour from the Cup to the Pot

Drag the Chair to the Left

Pour from the Cup to the Pot

Swirl the Green Tea Can

Swirl the Coke Can

Drag the Chair to the Right

Object Type Generalization (OTG)

In OTG, we generalize to completely unseen object types. These include new categories of objects like scooper, door, chairs etc. whose instances have never been seen in the robot data.

Pick up Blue Bag from Chair

Wipe Table with the Brush

Close the Office Door

Turn the Water Bottle Flat

Lift Up the Scooper

Wipe Table with Cloth

Drag Bowl to the Right

Standard Generalization (G)

In G, we evaluate unseen object instances in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.

Wipe table with Tissue

Close the Drawer

Open the Drawer

Pick Tissues from the Box

Wipe Counter with Sponge

Pick Tissues from Box

Open the Drawer

Close the Drawer

Mild Generalization (MG)

This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.

Open the Middle Drawer

Close the Top Drawer

Close the Cabinet Door

Open the Top Drawer

Place Apple in the Drawer

Move Coke Can near Orange

Open the Top Drawer

Detailed Results for Video Generation and Robot Execution

The rows respectively show the generated videos, robot executions from on-board camera, and the same executions from a third-person camera.

Initial Image

Video Gen

Executions (1st person)

Executions (3rd person)

Failures

We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles

Acknowledgements

We thank Jie Tan for feedback and guidance throughout the project. We are grateful to Peng Xu, Alex Kim, Alexander Herzog, Paul Wohlhart, Alex Irpan, Justice Carbajal, Clayton Tan for help with robot and compute infrastructures. We thank David Ross, Bryan Seybold, Xiuye Gu, and Ozgun Bursalioglu for helpful pointers regarding video generation. We enjoyed discussions with Chen Wang, Jason Ma, Laura Smith, Danny Driess, Soroush Nasiriany, Coline Devin, Keerthana Gopalakrishnan, and Joey Hejna that were helpful for the project. Finally, we thank Jacky Liang and Carolina Parada for feedback on the paper.

BibTeX (click to copy) Paper

@article{bharadhwaj2024gen2act,
  title={Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation},
  author={Bharadhwaj, Homanga and Dwibedi, Debidatta and Gupta, Abhinav and Tulsiani, Shubham and Doersch, Carl and Xiao, Ted and Shah, Dhruv and Xia, Fei and Sadigh, Dorsa and Kirmani, Sean},
  journal={arXiv preprint arXiv:2409.16283},
  year={2024}
}