TAPTR: Tracking Any Point with TRansformers as Detection



SCUT HKUST Tsinghua University IDEA Research
Corresponding author
This work was done while Hongyang Li, Hao Zhang, Shilong Liu, and Feng Li were interns at IDEA.

TAPTR in video editing and trajectory prediction.


Abstract

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformer (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, each tracking point is represented as a DETR query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through temporal self-attention. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to mitigate the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed. :hammer_and_wrench: Method Comparison with previous methods Inspired by detection transformer (DETR), we find that point tracking bears a great resemblance to object detection and tracking. In particular, tracking points can be essentially regarded as queries, which have been extensively studied in DETR-like algorithms. The well-studied DETR-like framework makes our TAPTR conceptually simple yet performance-wise strong.

TAPTR Network Architechture

The video preparation and query preparation parts provide the multi-scale feature map, point queries, and the cost volumes for the point decoder. The point decoder takes these elements as input and processes all frames in parallel. The outputs of the point decoder are sent to our window post-processing module to update the states of the point queries to their belonging tracking points.

Performance

We evaluate TAPTR on the TAP-Vid benchmark to show its superiority. As shown in the tabel, TAPTR shows significant superiority compared with previous SoTA methods across the majority of metrics while maintins the advantage in inference speed. To evaluate the tracking speed of different methods fairly, we compare the Point Per Second (PPS), which is the average number of points that a tracker can track across the entire video per second on the DAVIS dataset in the ``First'' mode.

BibTeX


        @misc{li2024taptr,
          title={TAPTR: Tracking Any Point with Transformers as Detection},
          author={Hongyang Li and Hao Zhang and Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Lei Zhang},
          publisher={arXiv:2403.13042},
          year={2024},
        }