KALM: Keypoint Abstraction using Large Models
for Object-Relative Imitation Learning

1Massachusetts Institute of Technology, 2National Taiwan University


KALM distills keypoint abstraction by prompting and verifying keypoint proposals from large pre-trained models using a small amount of robot demonstration data, which is used to train a keypoint-conditioned policy model.

KALM generalizes effectively across varying object poses, camera views, and object instances
with only 10 demonstrations.

Abstract

Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels.

KALM Overview

framework

(a) Given a demonstration video and a task description, we prompt a VLM to generate a coarse-grained region proposal , which is refined into a fine-grained point set via image segmentation models and VLMs. We use a keypoint detection function \(\phi\) to identify keypoint correspondences across a handful of demonstration trajectories. The final keypoints set is selected based on correspondence consistency verification. These keypoints are used for training a keypoint-conditioned action model. (b) Given a new scene, the keypoint detection function \(\phi\) localizes the distilled keypoints. The learned keypoint-conditioned action prediction model generates an object-relative end-effector trajectory based on the keypoint positions and features.

Simulation Result

We evaluate our method, KALM, on five manipulation tasks in the Meta-World simulator, using Diffuser and 3D Diffuser Actor as baselines, along with ablation studies on keypoint usage. Our method consistently outperforms these baselines, demonstrating that keypoints serve as an effective abstraction.

simulation result

Data Efficiency

We measure the average success rate across all 5 tasks, with the number of demonstrations increasing from 10 to 500. Our method, KALM, demonstrates superior data efficiency compared to all baselines.

simulation result

Real-world Result

We evaluate all models in varying camera and object poses (View), as well as on unseen objects (Cross obj.). Our method demonstrates strong performance under view changes and showcases generalization to objects that are not seen during diffusion training time.

simulation result
simulation result

BibTeX


@article{fang2024kalm,
  title={Keypoint Abstraction using Large Models for Object-Relative Imitation Learning},
  author={Xiaolin Fang and Bo-Ruei Huang and Jiayuan Mao and Jasmine Shone and Joshua B. Tenenbaum and Tomás Lozano-Pérez and Leslie Pack Kaelbling},
  journal={arXiv:2410.},
  year={2024}
}