Track2Act: Predicting Point Tracks from Internet Videos Enables Diverse Zero-shot Manipulation

Glimpse of the diverse Manipulation Capabilities enabled by Track2Act

Track2Act: Predicting Point Tracks from Internet
Videos Enables Diverse Zero-shot Manipulation

Carnegie Mellon University, University of Washington, Meta

We present Track2Act: a framework for generalizable zero-shot manipulation. Track2Act learns a common goal-conditioned policy that can be deployed zero-shot (without any test-time adaptation) across a wide variety of manipulation tasks and environments.

Track2Act decomposes the overall policy learning into visual track prediction followed by residual correction for refining an obtained open-loop plan. Trained with large datasets of passive web videos (for visual track prediction), and a small dataset robot interactions (for residual policy), Track2Act can exhibit diverse manipulation capbilities (beyond picking/pushing, including articulated object manipulation, object re-orientation) over 30 tasks with a common goal-conditioned policy and can generalize them to diverse unseen scenarios (involving unseen objects, unseen tasks, and to completely unseen kitchens and offices).

Glimpse of Manipulation Behaviors

We show manipulation results for different kitchen/office tasks illustrating navigation (teleoperated) followed by manipulation (autonomous) of different objects in their natural scenes. The autonomous manipulation is with a single goal-conditioned policy, with subsequent goals shown as inset in the top right.

Qualitative Results for Residual Policy Executions

We characterize different levels of generalization based on the small amount of robot training data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The corresponding goal images are shown as an inset at the top right.

Type Generalization (TG)

Here we generalize to completely unseen object types in unseen scenes. For example, unseen object types like jackets, microwaves, waffle makers, armoires, refrigerators etc.

Compositional Generalization (CG)

In CG, we generalize to unseen activity-object type combinations. For example, we have seen opening a room's door in the robot data, but not closing a room's door. Similarly, we have closing a cabinet door, flipping close a spice box, closing a trash box etc.

Standard Generalization (G)

In G, we evaluate unseen object instance in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.

Mild Generalization (MG)

This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.

Detailed Results for Track Prediction and Robot Execution

We show results categorized by different levels of Generalization. Note that the generalization level is for robot execution as described in the paper. The track prediction model is only trained on web videos. The videos and images are from the robot's on-board RealSense camera.

Mild Generalization (MG)

Initial Image
Description Description Description Description Description
Goal Image
Description Description Description Description Description
Predictions
Executions

Standard Generalization (G)

Initial Image
Description Description Description Description Description
Goal Image
Description Description Description Description Description
Predictions
Executions

Compositional Generalization (CG)

Initial Image
Description Description Description Description Description
Goal Image
Description Description Description Description Description
Predictions
Executions

Type Generalization (TG)

Initial Image
Description Description Description Description Description
Goal Image
Description Description Description Description Description
Predictions
Executions

Failures

We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles

Acknowledgements

We thank Yufei Ye, Himangi Mittal, Devendra Chaplot, Jason Zhang, Abitha Thankaraj, Tarasha Khurana, Akash Sharma, Sally Chen, Jay Vakil, Chen Bao, Unnat Jain, Swaminathan Gurumurthy for helpful discussions and feedback. We thank Carl Doersch and Nikita Karaev for insightful discussions about point tracking. This research was partially supported by a Google gift award.
@misc{bharadhwaj2024track2act,
      title={Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation}, 
      author={Homanga Bharadhwaj and Roozbeh Mottaghi and Abhinav Gupta and Shubham Tulsiani},
      year={2024},
      eprint={2405.01527},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}