1 code implementation • 22 Oct 2021 • Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, Lin Li
This paper introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model.
no code implementations • 16 Oct 2021 • Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, Luo Zhong
Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence.
no code implementations • 9 Aug 2021 • Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong
This paper presents a three-tier modality alignment approach to learning text-image joint embedding, coined as JEMA, for cross-modal retrieval of cooking recipes and food images.
1 code implementation • 2 Aug 2021 • Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong
We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance cross-modal retrieval services.
no code implementations • 2 Aug 2021 • Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong
This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding, which clearly separates the deep feature calibration in data preprocessing from training the joint embedding model.
no code implementations • 4 Oct 2018 • Zhongwei Xie, Lin Li, Xian Zhong, Luo Zhong
In this paper, we propose an end-to-end neural network framework for image-to-video person reidentification by leveraging cross-modal embeddings learned from extra information. Concretely speaking, cross-modal embeddings from image captioning and video captioning models are reused to help learned features be projected into a coordinated space, where similarity can be directly computed.