Search Results for author: Xitong Yang

Found 23 papers, 9 papers with code

Video ReCap: Recursive Captioning of Hour-Long Videos

no code implementations • 20 Feb 2024 • Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Video Captioning Video Understanding

Paper
Add Code

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

1 code implementation • 30 Nov 2023 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

291

Paper
Code

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

1 code implementation • 8 Oct 2023 • Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.

Action Recognition Continual Learning +5

Paper
Code

Towards Scalable Neural Representation for Diverse Videos

no code implementations • CVPR 2023 • Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava

Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e. g., NeRV, E-NeRV).

Action Recognition Video Compression

Paper
Add Code

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

no code implementations • 16 Feb 2023 • Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.

Action Detection Sentence +2

Paper
Add Code

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

1 code implementation • 1 Feb 2023 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization.

Action Recognition Continual Learning +2

Paper
Code

Vision Transformers Are Good Mask Auto-Labelers

no code implementations • CVPR 2023 • Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.

Instance Segmentation Segmentation +1

Paper
Add Code

Relational Space-Time Query in Long-Form Videos

no code implementations • CVPR 2023 • Xitong Yang, Fu-Jen Chu, Matt Feiszli, Raghav Goyal, Lorenzo Torresani, Du Tran

In this paper, we propose to study these problems in a joint framework for long video understanding.

Video Understanding

Paper
Add Code

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

1 code implementation • CVPR 2022 • Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava

Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i. e., video snippets) are supervised by classifying labeled bags (i. e., untrimmed videos).

Ranked #5 on Weakly Supervised Action Localization on ActivityNet-1.3

Weakly Supervised Temporal Action Localization

Paper
Code

Efficient Video Transformers with Spatial-Temporal Token Selection

1 code implementation • 23 Nov 2021 • Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang

Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.

Video Recognition

Paper
Code

Semi-Supervised Vision Transformers

1 code implementation • 22 Nov 2021 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available.

Ranked #17 on Semi-Supervised Image Classification on ImageNet - 10% labeled data

Inductive Bias Semi-Supervised Image Classification

Paper
Code

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

no code implementations • CVPR 2021 • Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang

The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label.

Action Detection Action Recognition +1

Paper
Add Code

GTA: Global Temporal Attention for Video Action Understanding

no code implementations • 15 Dec 2020 • Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner.

Action Recognition Action Understanding +1

Paper
Add Code

Hierarchical Contrastive Motion Learning for Video Action Recognition

no code implementations • 20 Jul 2020 • Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, Jan Kautz

Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition.

Action Recognition Contrastive Learning +2

Paper
Add Code

A Generic Visualization Approach for Convolutional Neural Networks

2 code implementations • ECCV 2020 • Ahmed Taha, Xitong Yang, Abhinav Shrivastava, Larry Davis

Compared to classification networks, attention visualization for retrieval networks is hardly studied.

Classification General Classification +2

Paper
Code

Cross-X Learning for Fine-Grained Visual Categorization

no code implementations • ICCV 2019 • Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S. Davis, Jun Li, Jian Yang, Ser-Nam Lim

Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation.

Ranked #19 on Fine-Grained Image Classification on NABirds (using extra training data)

Fine-Grained Image Classification Fine-Grained Visual Categorization

Paper
Add Code

STEP: Spatio-Temporal Progressive Learning for Video Action Detection

1 code implementation • CVPR 2019 • Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry Davis, Jan Kautz

In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos.

Ranked #7 on Action Detection on UCF101-24

Action Detection Action Recognition

246

Paper
Code

Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems

no code implementations • 23 Jan 2019 • Ahmed Taha, Yi-Ting Chen, Xitong Yang, Teruhisa Misu, Larry Davis

We cast visual retrieval as a regression problem by posing triplet loss as a regression loss.

Action Understanding Person Re-Identification +2

Paper
Add Code

Two Stream Self-Supervised Learning for Action Recognition

no code implementations • 16 Jun 2018 • Ahmed Taha, Moustafa Meshry, Xitong Yang, Yi-Ting Chen, Larry Davis

The self-supervised pre-trained weights effectiveness is validated on the action recognition task.

Action Recognition Representation Learning +3

Paper
Add Code

The Effectiveness of Instance Normalization: a Strong Baseline for Single Image Dehazing

no code implementations • 8 May 2018 • Zheng Xu, Xitong Yang, Xue Li, Xiaoshuai Sun

We propose a novel deep neural network architecture for the challenging problem of single image dehazing, which aims to recover the clear image from a degraded hazy image.

Decoder Image Dehazing +1

Paper
Add Code

An Interactive Greedy Approach to Group Sparsity in High Dimensions

1 code implementation • 10 Jul 2017 • Wei Qian, Wending Li, Yasuhiro Sogawa, Ryohei Fujimaki, Xitong Yang, Ji Liu

Sparsity learning with known grouping structure has received considerable attention due to wide modern applications in high-dimensional data analysis.

Human Activity Recognition Vocal Bursts Intensity Prediction

Paper
Code

Deep Multimodal Representation Learning from Temporal Data

no code implementations • CVPR 2017 • Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications.

Audio-Visual Speech Recognition Representation Learning +4

Paper
Add Code

Semantic Video Entity Linking Based on Visual Content and Metadata

no code implementations • ICCV 2015 • Yuncheng Li, Xitong Yang, Jiebo Luo

In this paper, we propose to exploit video visual content to improve video entity linking.

Entity Linking Metric Learning +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.