no code implementations • 6 Mar 2023 • Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang
Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning.