Video-Text Retrieval Models

CAMoE is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A Dual Softmax Loss (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.

Source: Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Retrieval 1 33.33%
Video Retrieval 1 33.33%
Video-Text Retrieval 1 33.33%

Categories