Glimpse of the diverse Manipulation Capabilities enabled by Track2Act
We present Track2Act: a framework for diverse generalizable manipulation. Track2Act learns a common goal-conditioned policy that can be deployed directly (without any test-time adaptation) across a wide variety of manipulation tasks and environments.
Track2Act decomposes the overall policy learning into visual track prediction followed by residual correction for refining an obtained open-loop plan. Trained with large datasets of passive web videos (for visual track prediction), and a small dataset robot interactions (for residual policy), Track2Act can exhibit diverse manipulation capbilities (beyond picking/pushing, including articulated object manipulation, object re-orientation) over 30 tasks with a common goal-conditioned policy and can generalize them to diverse unseen scenarios (involving unseen objects, unseen tasks, and to completely unseen kitchens and offices).
We show manipulation results for different kitchen/office tasks illustrating navigation (teleoperated) followed by manipulation (autonomous) of different objects in their natural scenes. The autonomous manipulation is with a single goal-conditioned policy, with subsequent goals shown as inset in the top right.
We characterize different levels of generalization based on the small amount of robot training data and show robot evaluations for each level. The videos are from a third person camera and show the robot executing different tasks. The corresponding goal images are shown as an inset at the top right.
Here we generalize to completely unseen object types in unseen scenes. For example, unseen object types like jackets, microwaves, waffle makers, armoires, refrigerators etc.
In CG, we generalize to unseen activity-object type combinations. For example, we have seen opening a room's door in the robot data, but not closing a room's door. Similarly, we have closing a cabinet door, flipping close a spice box, closing a trash box etc.
In G, we evaluate unseen object instance in seen and unseen scenes. For example, only a red mug is seen in the context of a pushing activity in training, and we generalize to pushing motions for green, and purple mugs of different shapes and textures.
This involves generalizing among unseen configurations (i.e. position and orientation variations) for seen object instnaces, along with mild variations in the scene like lighting changes and camera pose changes.
We show results categorized by different levels of Generalization. Note that the generalization level is for robot execution as described in the paper. The track prediction model is only trained on web videos. The videos and images are from the robot's on-board RealSense camera.
We can see that some common failure cases of our approach include incorrect grasps of objects, getting stuck while trying to articulate large/heavy objects like fridge doors, and toppling smaller objects like bottles
@inproceedings{bharadhwaj2024track2act,
author = {Homanga Bharadhwaj and Roozbeh Mottaghi and Abhinav Gupta and Shubham Tulsiani},
title = {Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2024}
}