no code implementations • 25 Apr 2024 • Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang
AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency.
no code implementations • 25 Apr 2024 • Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang
In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE).
no code implementations • 22 Apr 2024 • Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, Linchao Zhu
The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs.
no code implementations • 24 Mar 2024 • Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage.
no code implementations • 24 Mar 2024 • Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang
The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space.
no code implementations • 23 Mar 2024 • Shuai Zhao, Linchao Zhu, Ruijie Quan, Yi Yang
These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training.
1 code implementation • 1 Feb 2024 • Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang
Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
1 code implementation • 19 Jan 2024 • Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.
no code implementations • 12 Jan 2024 • Yuanzhi Liang, Linchao Zhu, Yi Yang
To address this challenge, we introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
no code implementations • 27 Nov 2023 • Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames.
no code implementations • 27 Oct 2023 • Yucheng Suo, Linchao Zhu, Yi Yang
This task aims to identify the instance mask that is most related to a referring expression without training on pixel-level annotations.
no code implementations • 16 Oct 2023 • Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang
Sample selection is an effective way to deal with label noise.
no code implementations • IEEE Transactions on Multimedia 2023 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density.
Ranked #5 on Video Captioning on VATEX (using extra training data)
1 code implementation • 4 Sep 2023 • Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity. Despite the recent significant process in text-based human motion generation, existing methods often prioritize fitting training motions at the expense of action diversity.
Ranked #3 on Motion Synthesis on HumanML3D (using extra training data)
1 code implementation • 24 Jul 2023 • Yuanzhi Liang, Linchao Zhu, Yi Yang
MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions.
1 code implementation • 3 Jul 2023 • Chao Liang, Zongxin Yang, Linchao Zhu, Yi Yang
In real-world scenarios, collected and annotated data often exhibit the characteristics of multiple classes and long-tailed distribution.
1 code implementation • 29 May 2023 • Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution.
1 code implementation • 28 May 2023 • Wenjie Zhuo, Yifan Sun, Xiaohan Wang, Linchao Zhu, Yi Yang
Consequently, using multiple positive samples with enhanced diversity further improves contrastive learning due to better alignment.
1 code implementation • 23 May 2023 • Shuai Zhao, Ruijie Quan, Linchao Zhu, Yi Yang
With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
Ranked #1 on Scene Text Recognition on Uber-Text
1 code implementation • 22 May 2023 • Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
In this paper, we tackle the problem of sign language translation (SLT) without gloss annotations.
no code implementations • CVPR 2023 • Yaowei Li, Ruijie Quan, Linchao Zhu, Yi Yang
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
1 code implementation • 6 Mar 2023 • Wei Li, Linchao Zhu, Longyin Wen, Yi Yang
This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data.
no code implementations • 22 Jan 2023 • Juncheng Li, Siliang Tang, Linchao Zhu, Wenqiao Zhang, Yi Yang, Tat-Seng Chua, Fei Wu, Yueting Zhuang
To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.
no code implementations • 18 Jan 2023 • Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Jiashi Feng, Yi Yang
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features.
1 code implementation • 1 Jan 2023 • Zenan Huang, Jun Wen, Siheng Chen, Linchao Zhu, Nenggan Zheng
Domain adaptation methods reduce domain shift typically by learning domain-invariant features.
1 code implementation • ICCV 2023 • Yuanzhi Liang, Xiaohan Wang, Linchao Zhu, Yi Yang
Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem.
no code implementations • CVPR 2023 • Hehe Fan, Linchao Zhu, Yi Yang, Mohan Kankanhalli
Deep neural networks on regular 1D lists (e. g., natural languages) and irregular 3D sets (e. g., point clouds) have made tremendous achievements.
1 code implementation • CVPR 2023 • Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou
To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.
Ranked #2 on Video Question Answering on AGQA 2.0 balanced
1 code implementation • IEEE Transactions on Neural Networks and Learning Systems 2022 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
Second, we instantiate the loss function and provide a strong baseline for FGVC, where the performance of a naive backbone can be boosted and be comparable with recent methods.
Ranked #28 on Fine-Grained Image Classification on CUB-200-2011
Fine-Grained Image Classification Fine-Grained Visual Recognition
no code implementations • 30 Sep 2022 • Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
In this work, we present a one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (\emph{SlimCLR}).
no code implementations • 6 Aug 2022 • Shannan Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang
Existing 3D skeleton-based action recognition approaches reach impressive performance by encoding handcrafted action features to image format and decoding by CNNs.
1 code implementation • 4 Aug 2022 • Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, Siliang Tang
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
1 code implementation • 3 Aug 2022 • Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, Yueting Zhuang
In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles.
no code implementations • 7 Jul 2022 • Shannan Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang
3D pose estimation has recently gained substantial interests in computer vision domain.
Ranked #35 on 3D Human Pose Estimation on MPI-INF-3DHP
1 code implementation • 2 May 2022 • Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang
In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
Ranked #11 on Video Retrieval on MSVD (using extra training data)
1 code implementation • CVPR 2022 • Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan
Although UniTrack \cite{wang2021different} demonstrates that a shared appearance model with multiple heads can be used to tackle individual tracking tasks, it fails to exploit the large-scale tracking datasets for training and performs poorly on single object tracking.
1 code implementation • CVPR 2022 • Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang
To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.
no code implementations • CVPR 2022 • Yang Jin, Linchao Zhu, Yadong Mu
The main contributions of this work are two-fold: 1) Different from existing black-box models, the proposed model simultaneously implements the localization of temporal boundaries and the recognition of action categories by grounding the logical rules of MLN in videos.
1 code implementation • CVPR 2022 • Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang
Talking gesture generation is a practical yet challenging task which aims to synthesize gestures in line with speech.
Ranked #6 on Gesture Generation on TED Gesture Dataset
2 code implementations • CVPR 2022 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
In this paper, we propose an episodic linear probing (ELP) classifier to reflect the generalization of visual representations in an online manner.
Ranked #13 on Fine-Grained Image Classification on CUB-200-2011
1 code implementation • ICCV 2021 • Aming Wu, Rui Liu, Yahong Han, Linchao Zhu, Yi Yang
Secondly, domain-specific representations are introduced as the differences between the input and domain-invariant representations.
no code implementations • ICCV 2021 • Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, Yueting Zhuang
Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
no code implementations • 3 Jun 2021 • Kezhou Lin, Xiaohan Wang, Zhedong Zheng, Linchao Zhu, Yi Yang
Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience.
no code implementations • 2 May 2021 • Qianyu Feng, Linchao Zhu, Bang Zhang, Pan Pan, Yi Yang
Specifically, we expect to approximate the real joint distribution over the partial observation and latent variables, thus infer the unseen targets respectively.
1 code implementation • 30 Apr 2021 • Youjiang Xu, Linchao Zhu, Lu Jiang, Yi Yang
It has been shown that deep neural networks are prone to overfitting on biased training data.
1 code implementation • CVPR 2021 • Xiaohan Wang, Linchao Zhu, Yi Yang
Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective.
1 code implementation • ICCV 2021 • Aming Wu, Yahong Han, Linchao Zhu, Yi Yang
Thus, we develop a new framework of few-shot object detection with universal prototypes ({FSOD}^{up}) that owns the merit of feature generalization towards novel objects.
Ranked #23 on Few-Shot Object Detection on MS-COCO (10-shot)
no code implementations • 13 Jan 2021 • Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, Fei Wu
We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.
no code implementations • 1 Jan 2021 • Mathis Petrovich, Chao Liang, Ryoma Sato, Yanbin Liu, Yao-Hung Hubert Tsai, Linchao Zhu, Yi Yang, Ruslan Salakhutdinov, Makoto Yamada
To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence.
no code implementations • ICCV 2021 • Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor.
1 code implementation • ICCV 2021 • Yanbin Liu, Juho Lee, Linchao Zhu, Ling Chen, Humphrey Shi, Yi Yang
Most existing few-shot classification methods only consider generalization on one dataset (i. e., single-domain), failing to transfer across various seen and unseen domains.
no code implementations • 1 Jan 2021 • Hu Zhang, Linchao Zhu, Yi Yang
Motivated by such phenomenon, we propose to disentangle the distinctive effects of data-rich and data-poor gradient and asynchronously train a model via a dual-phase learning process.
3 code implementations • 30 Dec 2020 • Leilei Gan, Zhiyang Teng, Yue Zhang, Linchao Zhu, Fei Wu, Yi Yang
In this paper, we propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
1 code implementation • CVPR 2020 • Linchao Zhu, Yi Yang
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.
Ranked #8 on Action Segmentation on COIN
no code implementations • CVPR 2020 • Linchao Zhu, Yi Yang
It is beneficial to incorporate more discriminative features to improve generalization on tail classes.
Ranked #16 on Long-tail Learning on Places-LT
1 code implementation • 25 May 2020 • Mathis Petrovich, Chao Liang, Ryoma Sato, Yanbin Liu, Yao-Hung Hubert Tsai, Linchao Zhu, Yi Yang, Ruslan Salakhutdinov, Makoto Yamada
To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence.
no code implementations • CVPR 2021 • Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi Yang, Nicu Sebe
In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes.
1 code implementation • ECCV 2020 • Hu Zhang, Linchao Zhu, Yi Zhu, Yi Yang
Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored.
1 code implementation • ECCV 2020 • Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou
To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action.
Ranked #5 on Weakly Supervised Action Localization on BEOID
no code implementations • 8 Feb 2020 • Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.
Ranked #4 on Egocentric Activity Recognition on EGTEA
1 code implementation • NeurIPS 2019 • Aming Wu, Linchao Zhu, Yahong Han, Yi Yang
Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers.
no code implementations • 20 Nov 2019 • Aming Wu, Yahong Han, Linchao Zhu, Yi Yang
Most state-of-the-art methods of object detection suffer from poor generalization ability when the training and test data are from different domains, e. g., with different styles.
3 code implementations • CVPR 2020 • Zongxin Yang, Linchao Zhu, Yu Wu, Yi Yang
This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.
no code implementations • ECCV 2020 • Linchao Zhu, Sercan O. Arik, Yi Yang, Tomas Pfister
We propose a novel adaptive transfer learning framework, learning to transfer learn (L2TL), to improve performance on a target dataset by careful extraction of the related information from a source dataset.
no code implementations • 22 Jun 2019 • Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.
no code implementations • 10 Jun 2019 • Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, Heng Wang
FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.
Ranked #26 on Action Recognition on UCF101
no code implementations • 20 Apr 2019 • Hehe Fan, Linchao Zhu, Yi Yang
Predicting future frames in videos has become a promising direction of research for both computer vision and robot learning communities.
no code implementations • CVPR 2019 • Fengda Zhu, Linchao Zhu, Yi Yang
Specifically, our method employs an adversarial feature adaptation model for visual representation transfer and a policy mimic strategy for policy behavior imitation.
no code implementations • 8 Apr 2019 • Yang He, Ping Liu, Linchao Zhu, Yi Yang
In addition, when evaluating the filter importance, only the magnitude information of the filters is considered.
3 code implementations • ICCV 2019 • Ruijie Quan, Xuanyi Dong, Yu Wu, Linchao Zhu, Yi Yang
We propose to automatically search for a CNN architecture that is specifically suitable for the reID task.
Ranked #9 on Person Re-Identification on CUHK03 detected
no code implementations • ECCV 2018 • Linchao Zhu, Yi Yang
In this paper, we propose a new memory network structure for few-shot video classification by making the following contributions.
no code implementations • 27 Aug 2018 • Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu
We validate the effectiveness of our ASST on two large-scale datasets.
1 code implementation • 11 Apr 2018 • Yu Wu, Linchao Zhu, Lu Jiang, Yi Yang
Thus, the sequence model can be decoupled from the novel object descriptions.
1 code implementation • 13 Jul 2017 • Linchao Zhu, Yanbin Liu, Yi Yang
In this paper, we present our solution to Google YouTube-8M Video Classification Challenge 2017.
no code implementations • CVPR 2017 • Zhongwen Xu, Linchao Zhu, Yi Yang
Then, we demonstrate that with our model, machine-labeled image annotations are very effective and abundant resources to perform object recognition on novel categories.
no code implementations • CVPR 2017 • Linchao Zhu, Zhongwen Xu, Yi Yang
This learning process makes the learned model more capable of dealing with motion speed variance.
no code implementations • 15 Nov 2015 • Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann
In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future.