Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Qualitative Results for Residual Policy Executions

We characterize different levels of generalization based on the small amount of robot training data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The corresponding goal images are shown as an inset at the top right.

Type Generalization (TG)

Here we generalize to completely unseen object types in unseen scenes. For example, unseen object types like jackets, microwaves, waffle makers, armoires, refrigerators etc.

Compositional Generalization (CG)

In CG, we generalize to unseen activity-object type combinations. For example, we have seen opening a room's door in the robot data, but not closing a room's door. Similarly, we have closing a cabinet door, flipping close a spice box, closing a trash box etc.

Standard Generalization (G)

In G, we evaluate unseen object instance in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.

Mild Generalization (MG)

This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.

Detailed Results for Track Prediction and Robot Execution

We show results categorized by different levels of Generalization. Note that the generalization level is for robot execution as described in the paper. The track prediction model is only trained on web videos. The videos and images are from the robot's on-board RealSense camera.

Mild Generalization (MG)

Initial Image

Goal Image

Predictions

Executions

Standard Generalization (G)

Initial Image

Goal Image

Predictions

Executions

Compositional Generalization (CG)

Initial Image

Goal Image

Predictions

Executions

Type Generalization (TG)

Initial Image

Goal Image

Predictions

Executions

Failures

We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles

Acknowledgements

We thank Yufei Ye, Himangi Mittal, Devendra Chaplot, Jason Zhang, Abitha Thankaraj, Tarasha Khurana, Akash Sharma, Sally Chen, Jay Vakil, Chen Bao, Unnat Jain, Swaminathan Gurumurthy for helpful discussions and feedback. We thank Carl Doersch and Nikita Karaev for insightful discussions about point tracking. This research was partially supported by a Google gift award.

BibTeX (click to copy) Paper

@inproceedings{bharadhwaj2024track2act,
  author    = {Homanga Bharadhwaj and Roozbeh Mottaghi and Abhinav Gupta and Shubham Tulsiani},
  title     = {Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}
}

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation

Track2Act: Predicting Point Tracks from Internet
Videos Enables Generalizable Robot Manipulation

ECCV 2024

Glimpse of Manipulation Behaviors

Qualitative Results for Residual Policy Executions

Type Generalization (TG)

Compositional Generalization (CG)

Standard Generalization (G)

Mild Generalization (MG)

Detailed Results for Track Prediction and Robot Execution

Mild Generalization (MG)

Standard Generalization (G)

Compositional Generalization (CG)

Type Generalization (TG)

Failures

Acknowledgements