no code implementations • 20 Feb 2024 • Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.
1 code implementation • 30 Nov 2023 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • 8 Oct 2023 • Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.
no code implementations • CVPR 2023 • Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e. g., NeRV, E-NeRV).
no code implementations • 16 Feb 2023 • Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.
1 code implementation • 1 Feb 2023 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization.
no code implementations • CVPR 2023 • Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.
no code implementations • CVPR 2023 • Xitong Yang, Fu-Jen Chu, Matt Feiszli, Raghav Goyal, Lorenzo Torresani, Du Tran
In this paper, we propose to study these problems in a joint framework for long video understanding.
1 code implementation • CVPR 2022 • Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava
Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i. e., video snippets) are supervised by classifying labeled bags (i. e., untrimmed videos).
1 code implementation • 23 Nov 2021 • Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang
Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.
1 code implementation • 22 Nov 2021 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available.
no code implementations • CVPR 2021 • Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label.
no code implementations • 15 Dec 2020 • Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava
To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner.
no code implementations • 20 Jul 2020 • Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, Jan Kautz
Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition.
2 code implementations • ECCV 2020 • Ahmed Taha, Xitong Yang, Abhinav Shrivastava, Larry Davis
Compared to classification networks, attention visualization for retrieval networks is hardly studied.
no code implementations • ICCV 2019 • Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S. Davis, Jun Li, Jian Yang, Ser-Nam Lim
Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation.
Ranked #19 on Fine-Grained Image Classification on NABirds (using extra training data)
Fine-Grained Image Classification Fine-Grained Visual Categorization
1 code implementation • CVPR 2019 • Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry Davis, Jan Kautz
In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos.
Ranked #7 on Action Detection on UCF101-24
no code implementations • 23 Jan 2019 • Ahmed Taha, Yi-Ting Chen, Xitong Yang, Teruhisa Misu, Larry Davis
We cast visual retrieval as a regression problem by posing triplet loss as a regression loss.
no code implementations • 16 Jun 2018 • Ahmed Taha, Moustafa Meshry, Xitong Yang, Yi-Ting Chen, Larry Davis
The self-supervised pre-trained weights effectiveness is validated on the action recognition task.
no code implementations • 8 May 2018 • Zheng Xu, Xitong Yang, Xue Li, Xiaoshuai Sun
We propose a novel deep neural network architecture for the challenging problem of single image dehazing, which aims to recover the clear image from a degraded hazy image.
1 code implementation • 10 Jul 2017 • Wei Qian, Wending Li, Yasuhiro Sogawa, Ryohei Fujimaki, Xitong Yang, Ji Liu
Sparsity learning with known grouping structure has received considerable attention due to wide modern applications in high-dimensional data analysis.
Human Activity Recognition Vocal Bursts Intensity Prediction
no code implementations • CVPR 2017 • Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications.
no code implementations • ICCV 2015 • Yuncheng Li, Xitong Yang, Jiebo Luo
In this paper, we propose to exploit video visual content to improve video entity linking.