Compared with traditional RGB-only visual tracking, few datasets have been constructed for RGB-D tracking. In this paper, we propose ARKitTrack, a new RGB-D track-ing dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple’s iPhone and iPad.
ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes, we also annotate this dataset with 123.9K pixel-level target masks. Besides, the camera intrinsic and camera pose of each frame are provided for future developments. To demonstrate the potential usefulness of this dataset, we further present a unified baseline for both box-level and pixel-level tracking, which integrates RGB features with bird’s-eye-view representations to better explore cross-modality 3D geometry.
In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts.
Figure 2. Overview the proposed unified RGB-D tracking pipeline in both box-level (VOT) and pixel-level (VOS).
The procedure of our method is as follows. We first use the ViT model to extract image feature map I, which is further projected into the 3D space to generate the BEV feature map B according to the input depth, pixel coordinate and camera intrinsic. The BEV feature is processed in the pillar format using BEV pooling and conv layers. The processed BEV feature is back-projected to the 2D image space, producing I_BEV. The image feature I and BEV feature I_BEV are fused via concatenation and conv layers to produce the final feature map. The image-BEV cross-view (re-)projection mainly follows LSS. Our framework can be applied to both RGBD VOT and VOS, yielding superior performance even with naive output heads. More details could be found in our CVPR2023 paper.
Table 1. Comparisons with SoTA VOT methods on ARKitTrack test set and existing popular RGB-D benchmarks, including DepthTrack and CDTB. For all trackers, the overall performance on our dataset is always lower than that on DepthTrack and CDTB. It confirms that the proposed ARKitTrack is more challenging than the existing RGB-D tracking datasets. Our tracker performs better than other trackers on both ARKitTrack and DepthTrack, achieving 0.478 and 0.612 F-score, respectively. Although CDTB does not provide a training set and our tracker cannot gain from model learning on this dataset, our method still achieves satisfactory performance. 0.677 F-score can be achieved by training with ARKitTrack, and 0.690 F-score is achieved by training with DepthTrack.
Table 2. Comparisons with SoTA VOS methods on ARKitTrack test set. Since there is no existing RGB-D VOS method and dataset, we select 4 state-of-the-art RGB VOS methods for comparison on the ARKitTrack-VOS test set. Besides, We design a variant named STCN_RGBD for RGB-D VOS by adding an additional depth branch to STCN and fusing RGBD features through concatenation. For a fair comparison, all methods are retrained on ARKitTrack-VOS train set without static image pre-training.
Figure 3. Attribute-specific analysis on the VOT test set. We also conduct an attribute-specific analysis for the aforementioned RGB-D VOT trackers by using our per-frame attributes annotation. It shows that our tracker performs better than other RGB-D trackers on 10 attributes. Besides, DeT outperforms other trackers on the full-occlusion attribute. All trackers can deliver satisfactory performance on the extreme-illumination factor. However, none of them can well address the fast-motion and out-of-view attributes, indicating these attributes are more challenging for existing RGB and RGB-D trackers.
Table 3. The ablation analysis of our VOT method on the ARKitTrack test set and DepthTrack test set. Our tracker is trained with their respective training sets and tested on their respective test sets. The basic tracker is OSTrack, which is trained with only the common RGB tracking datasets. FT, BEV, CV, GD stands for Fine-tuning, BEV space, Cross views and Gaussian distribution. More details could be found in our paper.
Table 4. Comparison with different memorizing strategies. We also explore different memory update strategies for VOS. ONE: Keep only one template and update it every N frames. IP: Using the IoU prediction branch to control the template update. ADD: Consecutively add new templates to the memory bank and maintain several templates for prediction. More details could be found in our paper.
@article{zhao2023arkittrack,
author = {Haojie Zhao and Junsong Chen and Lijun Wang and Huchuan Lu},
title = {ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data},
journal = {CVPR},
year = {2023},
}