Activity Graph Transformer for Temporal Action Localization

21 Jan 2021  ·  Megha Nawhal, Greg Mori ·

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Temporal Action Localization THUMOS’14 AGT (Ours) mAP IOU@0.5 50.2 # 24
mAP IOU@0.1 72.1 # 3
mAP IOU@0.2 69.8 # 3
mAP IOU@0.3 65 # 23
mAP IOU@0.4 58.1 # 24

Methods