1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Audio Classification on ESC-50 (using extra training data)
no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao
While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.
no code implementations • 22 Mar 2023 • Hongjie Zhang
Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data.
no code implementations • 8 Feb 2023 • Hongjie Zhang
Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data.
2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)
no code implementations • 28 Nov 2022 • Hongjie Zhang
Inspired by the redundancy of neural networks, we propose a lightweight parallel training framework based on neural network compression, AcceRL, to accelerate the policy learning while ensuring policy quality.
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
no code implementations • 11 Jan 2022 • Hongjie Zhang
In this study, we propose a feature extraction framework based on contrastive learning with adaptive positive and negative samples (CL-FEFA) that is suitable for unsupervised, supervised, and semi-supervised single-view feature extraction.
no code implementations • 25 Jan 2021 • Hongjie Zhang
In this study, we proposed a unified framework based on a new perspective of contrastive learning (CL) that is suitable for both unsupervised and supervised feature extraction.
no code implementations • ECCV 2020 • Hongjie Zhang, Ang Li, Jie Guo, Yanwen Guo
We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category.
no code implementations • 6 Mar 2017 • Jingwu He, Linbo Wang, Wenzhe Zhou, Hongjie Zhang, Xiufen Cui, Yanwen Guo
Unlike previous efforts devoted to photo quality assessment which mainly rely on 2D image features, we show in this paper combining 2D image features extracted from images with 3D geometric features computed on the 3D models can result in more reliable evaluation of viewpoint quality.