CAMoE is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A Dual Softmax Loss (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.
Source: Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax LossPaper | Code | Results | Date | Stars |
---|
Component | Type |
|
---|---|---|
BERT
|
Language Models | |
Dual Softmax Loss
|
Loss Functions | |
Vision Transformer
|
Image Models |