Video-adverb retrieval with compositional adverb-action embeddings

26 Sep 2023  ยท  Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata ยท

Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video-Adverb Retrieval (Unseen Compositions) ActivityNet Adverbs Pseudo Adverbs Acc-A 57 # 2
Video-Adverb Retrieval (Unseen Compositions) ActivityNet Adverbs ReGaDa Acc-A 58.4 # 1
Video-Adverb Retrieval ActivityNet Adverbs ReGaDa mAP W 0.239 # 1
mAP M 0.175 # 1
Acc-A 0.771 # 1
Video-Adverb Retrieval (Unseen Compositions) ActivityNet Adverbs Action Changes (reg) Acc-A 53.9 # 4
Video-Adverb Retrieval (Unseen Compositions) ActivityNet Adverbs CLIP Acc-A 55.1 # 3
Video-Adverb Retrieval AIR ReGaDa mAP W 0.704 # 1
mAP M 0.418 # 1
Acc-A 0.874 # 1
Video-Adverb Retrieval HowTo100M Adverbs ReGaDa mAP W 0.567 # 1
Acc-A 0.817 # 1
mAP M 0.528 # 1
Video-Adverb Retrieval (Unseen Compositions) MSR-VTT Adverbs Pseudo Adverbs Acc-A 56 # 4
Video-Adverb Retrieval (Unseen Compositions) MSR-VTT Adverbs CLIP Acc-A 57 # 3
Video-Adverb Retrieval MSR-VTT Adverbs ReGaDa mAP W 0.378 # 1
mAP M 0.228 # 1
Acc-A 0.786 # 1
Video-Adverb Retrieval (Unseen Compositions) MSR-VTT Adverbs Action Changes (reg) Acc-A 59 # 2
Video-Adverb Retrieval (Unseen Compositions) MSR-VTT Adverbs Action Changes (cls) Acc-A 53.7 # 5
Video-Adverb Retrieval (Unseen Compositions) MSR-VTT Adverbs ReGaDa Acc-A 61 # 1
Video-Adverb Retrieval (Unseen Compositions) VATEX Adverbs CLIP Acc-A 54.5 # 3
Video-Adverb Retrieval (Unseen Compositions) VATEX Adverbs Pseudo Adverbs Acc-A 53.8 # 5
Video-Adverb Retrieval (Unseen Compositions) VATEX Adverbs Action Changes (reg) Acc-A 54.9 # 2
Video-Adverb Retrieval (Unseen Compositions) VATEX Adverbs Action Changes (cls) Acc-A 54.3 # 4
Video-Adverb Retrieval VATEX Adverbs ReGaDa mAP W 0.29 # 1
mAP M 0.113 # 1
Acc-A 0.817 # 1
Video-Adverb Retrieval (Unseen Compositions) VATEX Adverbs ReGaDa Acc-A 61.7 # 1

Methods