Search Results for author: Yu Qiao

Found 414 papers, 249 papers with code

Mining Inter-Video Proposal Relations for Video Object Detection

1 code implementation • ECCV 2020 • Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.

Ranked #12 on Video Object Detection on ImageNet VID

Object object-detection +3

Paper
Code

Automated Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours and RNNs

no code implementations • EACL (BEA) 2021 • Elma Kerz, Daniel Wiechmann, Yu Qiao, Emma Tseng, Marcus Ströbel

The key to the present paper is the combined use of what we refer to as ‘complexity contours’, a series of measurements of indices of L2 proficiency obtained by a computational tool that implements a sliding window technique, and recurrent neural network (RNN) classifiers that adequately capture the sequential information in those contours.

Paper
Add Code

A Language-Based Approach to Fake News Detection Through Interpretable Features and BRNN

no code implementations • RDSM (COLING) 2020 • Yu Qiao, Daniel Wiechmann, Elma Kerz

We demonstrate that our approach is promising as it achieves similar results on these two datasets as the best performing black box models reported in the literature.

Explainable Models Fake News Detection +1

Paper
Add Code

Language that Captivates the Audience: Predicting Affective Ratings of TED Talks in a Multi-Label Classification Task

no code implementations • EACL (WASSA) 2021 • Elma Kerz, Yu Qiao, Daniel Wiechmann

The aim of the paper is twofold: (1) to automatically predict the ratings assigned by viewers to 14 categories available for TED talks in a multi-label classification task and (2) to determine what types of features drive classification accuracy for each of the categories.

Multi-Label Classification

Paper
Add Code

SPADE: A Big Five-Mturk Dataset of Argumentative Speech Enriched with Socio-Demographics for Personality Detection

1 code implementation • LREC 2022 • Elma Kerz, Yu Qiao, Sourabh Zanwar, Daniel Wiechmann

In recent years, there has been increasing interest in automatic personality detection based on language.

Paper
Code

The Best of Both Worlds: Combining Engineered Features with Transformers for Improved Mental Health Prediction from Reddit Posts

no code implementations • SMM4H (COLING) 2022 • Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

In recent years, there has been increasing interest in the application of natural language processing and machine learning techniques to the detection of mental health conditions (MHC) based on social media data.

Paper
Add Code

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

1 code implementation • ECCV 2020 • Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li

To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization.

Paper
Code

FANG-COVID: A New Large-Scale Benchmark Dataset for Fake News Detection in German

1 code implementation • EMNLP (FEVER) 2021 • Justus Mattern, Yu Qiao, Elma Kerz, Daniel Wiechmann, Markus Strohmaier

As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an ‘infodemic’ – a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society.

Fake News Detection

Paper
Code

MANTIS at SMM4H’2022: Pre-Trained Language Models Meet a Suite of Psycholinguistic Features for the Detection of Self-Reported Chronic Stress

no code implementations • SMM4H (COLING) 2022 • Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

This paper describes our submission to Social Media Mining for Health (SMM4H) 2022 Shared Task 8, aimed at detecting self-reported chronic stress on Twitter.

Paper
Add Code

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

1 code implementation • 9 May 2024 • Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

430

Paper
Code

Causal Evaluation of Language Models

1 code implementation • 1 May 2024 • Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning.

Causal Discovery Causal Inference +1

Paper
Code

Towards Real-world Video Face Restoration: A New Benchmark

no code implementations • 30 Apr 2024 • Ziyan Chen, Jingwen He, Xinqi Lin, Yu Qiao, Chao Dong

Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved.

Blind Face Restoration Image Quality Assessment +1

Paper
Add Code

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

1 code implementation • 25 Apr 2024 • Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

Ranked #6 on Visual Question Answering on MM-Vet

4k Language Modelling +3

1,724

Paper
Code

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

no code implementations • 24 Apr 2024 • Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation.

Paper
Add Code

FedCCL: Federated Dual-Clustered Feature Contrast Under Domain Heterogeneity

no code implementations • 14 Apr 2024 • Yu Qiao, Huy Q. Le, Mengchun Zhang, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

First, we employ clustering on the local representations of each client, aiming to capture intra-class information based on these local clusters at a high level of granularity.

Clustering Federated Learning +1

Paper
Add Code

Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data

no code implementations • 10 Apr 2024 • Yu Qiao, Chaoning Zhang, Apurba Adhikary, Choong Seon Hong

Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks.

Adversarial Robustness Federated Learning +1

Paper
Add Code

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

2 code implementations • 9 Apr 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution.

Ranked #12 on Visual Question Answering on MM-Vet

4k Language Modelling +1

1,703

Paper
Code

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

no code implementations • 3 Apr 2024 • Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.

Dense Video Captioning

Paper
Add Code

Linear Attention Sequence Parallelism

1 code implementation • 3 Apr 2024 • Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP method tailored to linear attention-based language models.

Paper
Code

VideoDistill: Language-aware Vision Distillation for Video Question Answering

no code implementations • 1 Apr 2024 • Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.

Answer Generation Question Answering +1

Paper
Add Code

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

no code implementations • 1 Apr 2024 • Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.

Image Captioning Instruction Following

Paper
Add Code

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

1 code implementation • 31 Mar 2024 • Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.

Language Modelling Large Language Model

Paper
Code

Are We on the Right Way for Evaluating Large Vision-Language Models?

1 code implementation • 29 Mar 2024 • Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

World Knowledge

105

Paper
Code

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

no code implementations • 28 Mar 2024 • Yutong Chen, Yifan Zhan, Zhihang Zhong, Wei Wang, Xiao Sun, Yu Qiao, Yinqiang Zheng

Neural rendering techniques have significantly advanced 3D human body modeling.

Neural Rendering Quantization

Paper
Add Code

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

no code implementations • 28 Mar 2024 • Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments.

Motion Planning

Paper
Add Code

InternLM2 Technical Report

1 code implementation • 26 Mar 2024 • Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).

Ranked #5 on Long-Context Understanding on Ada-LEval (BestAnswer)

4k Long-Context Understanding

5,261

Paper
Code

Assessment of Multimodal Large Language Models in Alignment with Human Values

1 code implementation • 26 Mar 2024 • Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh).

267

Paper
Code

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Ranked #1 on Audio Classification on ESC-50 (using extra training data)

Action Classification Action Recognition +12

973

Paper
Code

DreamDA: Generative Data Augmentation with Diffusion Models

1 code implementation • 19 Mar 2024 • Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu

The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor.

Data Augmentation

Paper
Code

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

1 code implementation • 18 Mar 2024 • Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways.

Instruction Following

Paper
Code

AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

no code implementations • 14 Mar 2024 • Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others).

Fairness Language Modelling

Paper
Add Code

Desigen: A Pipeline for Controllable Design Template Generation

no code implementations • 14 Mar 2024 • Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen

In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background.

Paper
Add Code

Generalized Predictive Model for Autonomous Driving

1 code implementation • 14 Mar 2024 • Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline.

Autonomous Driving Video Prediction

384

Paper
Code

Exploring Safety Generalization Challenges of Large Language Models via Code

no code implementations • 12 Mar 2024 • Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Yu Qiao, Wai Lam, Lizhuang Ma

The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse.

Code Completion

Paper
Add Code

VideoMamba: State Space Model for Efficient Video Understanding

3 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Video Understanding

603

Paper
Code

Embodied Understanding of Driving Scenarios

1 code implementation • 7 Mar 2024 • Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.

Autonomous Driving Language Modelling +1

Paper
Code

Towards Robust Federated Learning via Logits Calibration on Non-IID Data

no code implementations • 5 Mar 2024 • Yu Qiao, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

Meanwhile, the non-independent and identically distributed (non-IID) challenge of data distribution between edge devices can further degrade the performance of models.

Federated Learning Privacy Preserving

Paper
Add Code

Towards Implicit Prompt For Text-To-Image Models

no code implementations • 4 Mar 2024 • Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Position

Paper
Add Code

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation • 4 Mar 2024 • Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

237

Paper
Code

Efficient Action Counting with Dynamic Queries

1 code implementation • 3 Mar 2024 • Zishi Li, Xiaoxuan Ma, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang

Temporal repetition counting aims to quantify the repeated action cycles within a video.

Contrastive Learning

Paper
Code

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

no code implementations • 29 Feb 2024 • Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, JIA YU, Chaobin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan, Conghui He

To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb.

Paper
Add Code

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

1 code implementation • 29 Feb 2024 • Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.

Ranked #33 on Visual Question Answering on MM-Vet

Hallucination Object Localization +3

380

Paper
Code

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

1 code implementation • 29 Feb 2024 • Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong liu, Jing Shao

This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.

Fairness Mutual Information Estimation

Paper
Code

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

no code implementations • 29 Feb 2024 • BoYu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks.

Transfer Learning Video Recognition

Paper
Add Code

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

no code implementations • 27 Feb 2024 • Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions.

Imitation Learning Quantization

Paper
Add Code

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

no code implementations • 25 Feb 2024 • Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI.

Ranked #77 on Visual Question Answering on MM-Vet

Code Generation Multimodal Reasoning +1

Paper
Add Code

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

no code implementations • 22 Feb 2024 • Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.

Code Generation Common Sense Reasoning +2

Paper
Add Code

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

1 code implementation • 19 Feb 2024 • Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao

Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously.

181

Paper
Code

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

1 code implementation • 19 Feb 2024 • Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao

Large language models (LLMs) need to undergo safety alignment to ensure safe conversations with humans.

Language Modelling

Paper
Code

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

2 code implementations • 18 Feb 2024 • Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.

Question Answering Text Summarization

Paper
Code

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

1 code implementation • 14 Feb 2024 • Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

Medical Visual Question Answering Question Answering +1

374

Paper
Code

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

1 code implementation • 14 Feb 2024 • Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao

Large Language Models (LLMs) are now commonplace in conversation applications.

Paper
Code

Real-time Holistic Robot Pose Estimation with Unknown States

1 code implementation • 8 Feb 2024 • Shikun Ban, Juling Fan, Wentao Zhu, Xiaoxuan Ma, Yu Qiao, Yizhou Wang

We propose an end-to-end pipeline for real-time, holistic robot pose estimation from a single RGB image, even in the absence of known robot states.

Ranked #1 on Robot Pose Estimation on DREAM-dataset

6D Pose Estimation using RGB Robot Pose Estimation

Paper
Code

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

1 code implementation • 8 Feb 2024 • Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.

Ranked #5 on Video Question Answering on MVBench

Benchmarking Language Modelling +4

2,537

Paper
Code

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

1 code implementation • 7 Feb 2024 • Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, WangMeng Zuo, Dahua Lin, Yu Qiao, Jing Shao

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount.

Multiple-choice

Paper
Code

Safety of Multimodal Large Language Models on Images and Text

1 code implementation • 1 Feb 2024 • Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text.

Paper
Code

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

1 code implementation • 29 Jan 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension.

Ranked #17 on Visual Question Answering on MM-Vet

Language Modelling Visual Question Answering

1,703

Paper
Code

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

1 code implementation • 29 Jan 2024 • Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth.

Paper
Code

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

no code implementations • 26 Jan 2024 • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, LiMin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents.

Paper
Add Code

Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality

no code implementations • 25 Jan 2024 • Huy Q. Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh N. H. Nguyen, Choong Seon Hong

In this paper, we propose Multimodal Federated Cross Prototype Learning (MFCPL), a novel approach for MFL under severely missing modalities by conducting the complete prototypes to provide diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.

Federated Learning

Paper
Add Code

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

no code implementations • 24 Jan 2024 • Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian

Moreover, existing reinforcement learning (RL) based methods overlook the structured relationships, underutilizing the potential of RL in structured reasoning.

Question Answering reinforcement-learning +1

Paper
Add Code

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

no code implementations • 24 Jan 2024 • Fanghua Yu, Jinjin Gu, Zheyuan Li, JinFan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong

We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up.

Descriptive Image Restoration

Paper
Add Code

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

1 code implementation • 22 Jan 2024 • Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao

In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety.

Paper
Code

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

1 code implementation • 18 Jan 2024 • Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie zhou, Hongsheng Li, Yu Qiao, Jifeng Dai

Developing generative models for interleaved image-text data has both research and practical value.

157

Paper
Code

Vlogger: Make Your Dream A Vlog

1 code implementation • 17 Jan 2024 • Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang

More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.

Language Modelling Large Language Model +1

Paper
Code

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

1 code implementation • 11 Jan 2024 • Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

341

Paper
Code

Latte: Latent Diffusion Transformer for Video Generation

2 code implementations • 5 Jan 2024 • Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.

Text-to-Video Generation Video Generation

308

Paper
Code

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

1 code implementation • 4 Jan 2024 • Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

Charts play a vital role in data visualization, understanding data patterns, and informed decision-making.

Data Visualization Decision Making +2

Paper
Code

Critic-Guided Decision Transformer for Offline Reinforcement Learning

no code implementations • 21 Dec 2023 • Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner.

D4RL Offline RL +3

Paper
Add Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2 code implementations • 21 Dec 2023 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Image Retrieval Image-to-Text Retrieval +11

1,724

Paper
Code

M-BEV: Masked BEV Perception for Robust Autonomous Driving

1 code implementation • 19 Dec 2023 • Siran Chen, Yue Ma, Yu Qiao, Yali Wang

It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision, and reconstructs the masked ones with the distinct spatio-temporal context across views.

Autonomous Driving

Paper
Code

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

1 code implementation • 19 Dec 2023 • Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language.

Text Generation Text-to-Image Generation

Paper
Code

Point Transformer V3: Simpler, Faster, Stronger

3 code implementations • 15 Dec 2023 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao

This paper is not motivated to seek innovation within the attention mechanism.

Ranked #1 on Semantic Segmentation on S3DIS (using extra training data)

3D Semantic Segmentation LIDAR Semantic Segmentation +1

1,995

Paper
Code

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

no code implementations • 15 Dec 2023 • Xu Liu, Tong Zhou, Yuanxin Wang, Yuping Wang, Qinjingwen Cao, Weizhi Du, Yonghuan Yang, Junjun He, Yu Qiao, Yiqing Shen

The advent of foundation models, which are pre-trained on vast datasets, has ushered in a new era of computer vision, characterized by their robustness and remarkable zero-shot generalization capabilities.

Image Generation Image Segmentation +2

Paper
Add Code

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

1 code implementation • 14 Dec 2023 • Wenhai Wang, Jiangwei Xie, Chuanyang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai

In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD).

Autonomous Driving Motion Planning

124

Paper
Code

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

1 code implementation • 14 Dec 2023 • Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada

The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points.

Paper
Code

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

no code implementations • 14 Dec 2023 • Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

Many reinforcement learning environments (e. g., Minecraft) provide only sparse rewards that indicate task completion or failure with binary values.

reinforcement-learning

Paper
Add Code

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

1 code implementation • 12 Dec 2023 • Yiran Qin, Enshen Zhou, Qichang Liu, Zhenfei Yin, Lu Sheng, Ruimao Zhang, Yu Qiao, Jing Shao

It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways.

267

Paper
Code

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

1 code implementation • 12 Dec 2023 • Yuchen Yang, Yu Qiao, Xiao Sun

Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision.

3D Pose Estimation

Paper
Code

Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation

no code implementations • 12 Dec 2023 • Shaopeng Zhai, Jie Wang, Tianyi Zhang, Fuxian Huang, Qi Zhang, Ming Zhou, Jing Hou, Yu Qiao, Yu Liu

Building embodied agents on integrating Large Language Models (LLMs) and Reinforcement Learning (RL) have revolutionized human-AI interaction: researchers can now leverage language instructions to plan decision-making for open-ended tasks.

Decision Making Language Modelling +1

Paper
Add Code

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

1 code implementation • NeurIPS 2023 • Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao

What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity.

Instance Segmentation Semantic Segmentation +1

Paper
Code

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

no code implementations • 11 Dec 2023 • Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, Lu Sheng

Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image.

SSIM

Paper
Add Code

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao

While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.

Question Answering Video Question Answering +1

Paper
Add Code

Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies

no code implementations • 7 Dec 2023 • Pengcheng Chen, Ziyan Huang, Zhongying Deng, Tianbin Li, Yanzhou Su, Haoyu Wang, Jin Ye, Yu Qiao, Junjun He

OpenAI's latest large vision-language model (LVLM), GPT-4V(ision), has piqued considerable interest for its potential in medical applications.

Language Modelling Prompt Engineering

Paper
Add Code

Towards Knowledge-driven Autonomous Driving

1 code implementation • 7 Dec 2023 • Xin Li, Yeqi Bai, Pinlong Cai, Licheng Wen, Daocheng Fu, Bo Zhang, Xuemeng Yang, Xinyu Cai, Tao Ma, Jianfei Guo, Xing Gao, Min Dou, Yikang Li, Botian Shi, Yong liu, Liang He, Yu Qiao

This paper explores the emerging knowledge-driven autonomous driving technologies.

Autonomous Driving Neural Rendering

310

Paper
Code

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

2 code implementations • 6 Dec 2023 • Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Chunjing Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao

With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem.

Autonomous Driving

384

Paper
Code

OneLLM: One Framework to Align All Modalities with Language

1 code implementation • 6 Dec 2023 • Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

In detail, we first train an image projection module to connect a vision encoder with LLM.

Ranked #80 on Visual Question Answering on MM-Vet

Question Answering Visual Question Answering

460

Paper
Code

VideoBooth: Diffusion-based Video Generation with Image Prompts

no code implementations • 1 Dec 2023 • Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts.

Video Generation

Paper
Add Code

MLLMs-Augmented Visual-Language Representation Learning

1 code implementation • 30 Nov 2023 • Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets.

Representation Learning Retrieval +1

Paper
Code

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

294

Paper
Code

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

1 code implementation • 29 Nov 2023 • Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied.

Paper
Code

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2 code implementations • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Ranked #1 on Zero-Shot Video Question Answer on STAR Benchmark

Fairness Multiple-choice +8

2,710

Paper
Code

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

2 code implementations • 23 Nov 2023 • Yi Yu, Xue Yang, Qingyun Li, Feipeng Da, Jifeng Dai, Yu Qiao, Junchi Yan

To our best knowledge, Point2RBox is the first end-to-end solution for point-supervised OOD.

Object object-detection +2

1,739

Paper
Code

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

1 code implementation • 23 Nov 2023 • YuFei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen

Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference.

Image Super-Resolution

142

Paper
Code

DiffusionMat: Alpha Matting as Sequential Refinement Learning

no code implementations • 22 Nov 2023 • Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo

In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes.

Denoising Image Matting

Paper
Add Code

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

1 code implementation • 20 Nov 2023 • Jin Ye, Junlong Cheng, Jianpin Chen, Zhongying Deng, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Min Zhu, Shaoting Zhang, Junjun He, Yu Qiao

Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes.

Image Segmentation Medical Image Segmentation +2

749

Paper
Code

Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models

no code implementations • 15 Nov 2023 • Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, Jun Liu

Although Large Language Models (LLMs) demonstrate remarkable ability in processing and generating human-like text, they do have limitations when it comes to comprehending and expressing world knowledge that extends beyond the boundaries of natural language(e. g., chemical molecular formula).

World Knowledge

Paper
Add Code

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

1 code implementation • 14 Nov 2023 • Zhihang Zhong, Gurunandan Krishnan, Xiao Sun, Yu Qiao, Sizhuo Ma, Jian Wang

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements.

Object Video Editing +1

158

Paper
Code

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

1 code implementation • 13 Nov 2023 • Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.

Ranked #2 on Visual Question Answering on BenchLMM

Described Object Detection Language Modelling +4

2,537

Paper
Code

Fake Alignment: Are LLMs Really Aligned Well?

1 code implementation • 10 Nov 2023 • Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang

The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety.

Multiple-choice

Paper
Code

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

1 code implementation • 9 Nov 2023 • Licheng Wen, Xuemeng Yang, Daocheng Fu, XiaoFeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao

This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving.

Autonomous Driving Common Sense Reasoning +4

265

Paper
Code

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

no code implementations • 6 Nov 2023 • Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang

And AMD achieves 73. 3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3. 7% improvement over the original ViT-B model from VideoMAE.

Ranked #20 on Action Recognition on Something-Something V2

Action Classification Action Recognition +3

Paper
Add Code

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

1 code implementation • 5 Nov 2023 • Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.

Hallucination In-Context Learning +2

267

Paper
Code

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

1 code implementation • 5 Nov 2023 • Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao

While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs).

Decoder Zero-shot Generalization

267

Paper
Code

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

no code implementations • 31 Oct 2023 • Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu

The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos.

Paper
Add Code

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation • 30 Oct 2023 • Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

973

Paper
Code

ControlLLM: Augment Language Models with Tools by Searching on Graphs

1 code implementation • 26 Oct 2023 • Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.

Scheduling

163

Paper
Code

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

1 code implementation • NeurIPS 2023 • Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert).

Ranked #6 on 3D Object Detection on nuScenes Camera Only

3D Object Detection object-detection

1,083

Paper
Code

SAM-Med3D

1 code implementation • 23 Oct 2023 • Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, Bin Fu, Shaoting Zhang, Junjun He, Yu Qiao

These issues can hardly be addressed by fine-tuning SAM on medical data because the original 2D structure of SAM neglects 3D spatial information.

3D Architecture Image Segmentation +1

362

Paper
Code

A Comparative Study of Image Restoration Networks for General Backbone Network Design

1 code implementation • 18 Oct 2023 • Xiangyu Chen, Zheyuan Li, Yuandong Pu, Yihao Liu, Jiantao Zhou, Yu Qiao, Chao Dong

Following this, we present the benchmark results and analyze the reasons behind the performance disparity of different models across various tasks.

Image Restoration

Paper
Code

Unifying Image Processing as Visual Prompting Question Answering

no code implementations • 16 Oct 2023 • Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong

To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc.

Image Enhancement Image Restoration +4

Paper
Add Code

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

1 code implementation • 12 Oct 2023 • Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong He, Hengshuang Zhao, Chunhua Shen, Yu Qiao, Tong He, Wanli Ouyang

In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation, thereby establishing a pathway to 3D foundational models.

Ranked #1 on 3D Semantic Segmentation on ScanNet++ (using extra training data)

3D Object Detection 3D Reconstruction +5

302

Paper
Code

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

no code implementations • 12 Oct 2023 • Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.

Decision Making

Paper
Add Code

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

1 code implementation • 11 Oct 2023 • Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.

Code Generation Image Generation +2

290

Paper
Code

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

1 code implementation • 11 Oct 2023 • Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao

In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e. g., Stable Diffusion).

Text-to-Image Generation Text-to-Video Generation +1

Paper
Code

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

no code implementations • 10 Oct 2023 • Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan

Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.

Benchmarking

Paper
Add Code

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

no code implementations • 8 Oct 2023 • Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches.

Keypoint Detection

Paper
Add Code

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

1 code implementation • 5 Oct 2023 • Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao

A single language model (LM), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences.

Language Modelling Long Form Question Answering

Paper
Code

Exploring Counterfactual Alignment Loss towards Human-centered AI

no code implementations • 3 Oct 2023 • Mingzhou Liu, Xinwei Sun, Ching-Wen Lee, Yu Qiao, Yizhou Wang

In particular, we utilize the counterfactual generation's ability for causal attribution to introduce a novel loss called the CounterFactual Alignment (CF-Align) loss.

Attribute counterfactual +1

Paper
Add Code

DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

2 code implementations • 28 Sep 2023 • Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao

Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability.

Autonomous Driving Common Sense Reasoning +1

317

Paper
Code

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

1 code implementation • 26 Sep 2023 • Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.

Ranked #9 on Visual Question Answering (VQA) on InfiMM-Eval

Image Comprehension Reading Comprehension +1

1,703

Paper
Code

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

2 code implementations • 26 Sep 2023 • Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model.

Ranked #4 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Generation +1

737

Paper
Code

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

1 code implementation • 20 Sep 2023 • Renqiu Xia, Bo Zhang, Haoyang Peng, Hancheng Ye, Xiangchao Yan, Peng Ye, Botian Shi, Yu Qiao, Junchi Yan

Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers.

Ranked #19 on Chart Question Answering on ChartQA (using extra training data)

Chart Question Answering Language Modelling +2

Paper
Code

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

1 code implementation • 19 Sep 2023 • Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao

Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks.

3D Object Detection Autonomous Driving +3

570

Paper
Code

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

2 code implementations • 11 Sep 2023 • Bo Zhang, Xinyu Cai, Jiakang Yuan, Donglin Yang, Jianfei Guo, Xiangchao Yan, Renqiu Xia, Botian Shi, Min Dou, Tao Chen, Si Liu, Junchi Yan, Yu Qiao

Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous domain knowledge can be hardly directly deployed to a new domain without additional costs.

Autonomous Driving Domain Generalization

570

Paper
Code

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

1 code implementation • ICCV 2023 • Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, Yu Qiao, Yuenan Hou

Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase.

Ranked #2 on 3D Semantic Segmentation on SemanticKITTI (using extra training data)

3D Semantic Segmentation LIDAR Semantic Segmentation +2

303

Paper
Code

HAT: Hybrid Attention Transformer for Image Restoration

2 code implementations • 11 Sep 2023 • Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong

In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement.

Image Compression Image Denoising +2

1,092

Paper
Code

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

2 code implementations • 8 Sep 2023 • Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong

However, the majority of available resources are still in standard dynamic range (SDR).

121

Paper
Code

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

2 code implementations • 7 Sep 2023 • Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation.

Organ Segmentation Segmentation

362

Paper
Code

ImageBind-LLM: Multi-modality Instruction Tuning

2 code implementations • 7 Sep 2023 • Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder.

Instruction Following Text Generation

5,532

Paper
Code

SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution

1 code implementation • 6 Sep 2023 • Wenlong Zhang, Xiaohui Li, Xiangyu Chen, Yu Qiao, Xiao-Ming Wu, Chao Dong

In particular, we cluster the extensive degradation space to create a set of representative degradation cases, which serves as a comprehensive test set.

Super-Resolution

Paper
Code

SAM-Med2D

3 code implementations • 30 Aug 2023 • Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, Yu Qiao

To bridge this gap, we introduce SAM-Med2D, the most comprehensive studies on applying SAM to medical 2D images.

Decoder Image Segmentation +4

749

Paper
Code

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

1 code implementation • 29 Aug 2023 • Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, Chao Dong

We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework.

Ranked #1 on Blind Face Restoration on LFW

Blind Face Restoration Image Denoising +2

3,047

Paper
Code

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

2 code implementations • 25 Aug 2023 • Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

LWC modulates the extreme values of weights by optimizing the clipping threshold.

Common Sense Reasoning Computational Efficiency +3

579

Paper
Code

MGMAE: Motion Guided Masking for Video Masked Autoencoding

1 code implementation • ICCV 2023 • Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, LiMin Wang

Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos.

Optical Flow Estimation Representation Learning

Paper
Code

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

1 code implementation • ICCV 2023 • Lihe Yang, Zhen Zhao, Lei Qi, Yu Qiao, Yinghuan Shi, Hengshuang Zhao

To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples.

Ranked #1 on Semi-Supervised Image Classification on SVHN, 40 Labels

Semi-Supervised Image Classification

Paper
Code

Foundation Model is Efficient Multimodal Multitask Model Selector

1 code implementation • NeurIPS 2023 • Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.

Model Selection Question Answering +1

Paper
Code

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

1 code implementation • 7 Aug 2023 • Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Hallucination Visual Reasoning

374

Paper
Code

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

1 code implementation • 3 Aug 2023 • Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.

Question Answering Retrieval +1

380

Paper
Code

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation • ICCV 2023 • Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

136

Paper
Code

Text-guided Foundation Model Adaptation for Pathological Image Classification

2 code implementations • 27 Jul 2023 • Yunkun Zhang, Jin Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, Dequan Wang

In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification.

Classification Image Classification +1

Paper
Code

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

2 code implementations • 27 Jul 2023 • Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, Yiran Zhong

TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.

Language Modelling Large Language Model

214

Paper
Code

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning

no code implementations • 25 Jul 2023 • Huy Q. Le, Minh N. H. Nguyen, Chu Myaet Thwal, Yu Qiao, Chaoning Zhang, Choong Seon Hong

Bringing this concept into a system, we develop a distillation-based multimodal embedding knowledge transfer mechanism, namely FedMEKT, which allows the server and clients to exchange the joint knowledge of their learning models extracted from a small multimodal proxy dataset.

Federated Learning Human Activity Recognition +2

Paper
Add Code

Meta-Transformer: A Unified Framework for Multimodal Learning

1 code implementation • 20 Jul 2023 • Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue

Multimodal learning aims to build models that can process and relate information from multiple modalities.

Time Series

1,442

Paper
Code

Boosting Federated Learning Convergence with Prototype Regularization

no code implementations • 20 Jul 2023 • Yu Qiao, Huy Q. Le, Choong Seon Hong

As a distributed machine learning technique, federated learning (FL) requires clients to collaboratively train a shared model with an edge server without leaking their local data.

Federated Learning

Paper
Add Code

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

1 code implementation • 14 Jul 2023 • Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao

In this paper, we explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner and analyze its ability to reason, interpret, and memorize when facing complex scenarios.

Autonomous Driving Common Sense Reasoning +3

317

Paper
Code

LimSim: A Long-term Interactive Multi-scenario Traffic Simulator

1 code implementation • 13 Jul 2023 • Licheng Wen, Daocheng Fu, Song Mao, Pinlong Cai, Min Dou, Yikang Li, Yu Qiao

With the growing popularity of digital twin and autonomous driving in transportation, the demand for simulation systems capable of generating high-fidelity and reliable scenarios is increasing.

Autonomous Driving

324

Paper
Code

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, we utilize a multi-scale approach to generate video-related descriptions.

Action Recognition Contrastive Learning +7

973

Paper
Code

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

4 code implementations • 10 Jul 2023 • Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai

Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator.

Image Animation

8,945

Paper
Code

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

2 code implementations • 25 Jun 2023 • Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong

Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM.

Decoder Image Segmentation +2

4,324

Paper
Code

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations • 20 Jun 2023 • Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Retrieval +1

Paper
Add Code

MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

1 code implementation • 16 Jun 2023 • Dequan Wang, Xiaosong Wang, Lilong Wang, Mengzhang Li, Qian Da, Xiaoqiang Liu, Xiangyu Gao, Jun Shen, Junjun He, Tian Shen, Qi Duan, Jie Zhao, Kang Li, Yu Qiao, Shaoting Zhang

Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications.

Diabetic Retinopathy Grading Image Classification +3

251

Paper
Code

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

1 code implementation • 15 Jun 2023 • Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning.

Hallucination Image Captioning +3

374

Paper
Code

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

no code implementations • 15 Jun 2023 • Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, Hongsheng Li

Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs).

Ranked #3 on Temporal/Casual QA on NExT-QA (using extra training data)

Domain Generalization Retrieval +2

Paper
Add Code

Robustness of SAM: Segment Anything Under Corruptions and Beyond

no code implementations • 13 Jun 2023 • Yu Qiao, Chaoning Zhang, Taegoo Kang, Donghun Kim, Chenshuang Zhang, Choong Seon Hong

Following by interpreting the effects of synthetic corruption as style changes, we proceed to conduct a comprehensive evaluation for its robustness against 15 types of common corruption.

Style Transfer

Paper
Add Code

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

1 code implementation • ICCV 2023 • Tao Ma, Xuemeng Yang, Hongbin Zhou, Xin Li, Botian Shi, Junjie Liu, Yuchen Yang, Zhizheng Liu, Liang He, Yu Qiao, Yikang Li, Hongsheng Li

Extensive experiments on Waymo Open Dataset show our DetZero outperforms all state-of-the-art onboard and offboard 3D detection methods.

3D Object Detection Object +1

284

Paper
Code

Understanding Segment Anything Model: SAM is Biased Towards Texture Rather than Shape

no code implementations • 3 Jun 2023 • Chaoning Zhang, Yu Qiao, Shehbaz Tariq, Sheng Zheng, Chenshuang Zhang, Chenghao Li, Hyundong Shin, Choong Seon Hong

Different from label-oriented recognition tasks, the SAM is trained to predict a mask for covering the object shape based on a promt.

Image Segmentation Semantic Segmentation

Paper
Add Code

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

no code implementations • 2 Jun 2023 • Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang

In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model.

Denoising Segmentation +1

Paper
Add Code

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

1 code implementation • NeurIPS 2023 • Jiakang Yuan, Bo Zhang, Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks.

Autonomous Driving Point Cloud Pre-training

570

Paper
Code

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

1 code implementation • 1 Jun 2023 • Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, Hongsheng Li

In addition to the scene generation, the final part of DiffInDScene can be used as a post-processing module to refine the 3D reconstruction results from multi-view stereo.

3D Generation 3D Reconstruction +1

Paper
Code

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

1 code implementation • ICCV 2023 • Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo

Token compression aims to speed up large-scale vision transformers (e. g. ViTs) by pruning (dropping) or merging tokens.

Ranked #4 on Efficient ViTs on ImageNet-1K (with DeiT-S)

Efficient ViTs

Paper
Code

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

1 code implementation • 25 May 2023 • Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, Jifeng Dai

These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.

Common Sense Reasoning Navigate +1

568

Paper
Code

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

1 code implementation • 25 May 2023 • Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei zhang, Hongyang Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao

In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation.

Ranked #1 on Referring Expression Segmentation on Referring Expressions for DAVIS 2016 & 2017

Object Referring Expression Segmentation +3

Paper
Code

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

no code implementations • NeurIPS 2023 • Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

Paper
Add Code

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Decoder Video Understanding

153

Paper
Code

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

1 code implementation • 18 May 2023 • Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng Li

This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks.

Language Modelling Large Language Model +2

264

Paper
Code

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

2 code implementations • NeurIPS 2023 • Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie zhou, Yu Qiao, Jifeng Dai

We hope this model can set a new baseline for generalist vision and language models.

Decoder Language Modelling +1

3,134

Paper
Code

A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering

no code implementations • 12 May 2023 • Chaoning Zhang, Fachrina Dewi Puspitasari, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, Lik-Hang Lee, Sung-Ho Bae, Choong Seon Hong

This is an ongoing project and we intend to update the manuscript on a regular basis.

Edge Detection Prompt Engineering

Paper
Add Code

VideoChat: Chat-Centric Video Understanding

1 code implementation • 10 May 2023 • Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Ranked #1 on Question Answering on NExT-QA (Open-ended VideoQA)

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +5

2,710

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,134

Paper
Code

Causal Discovery via Conditional Independence Testing with Proxy Variables

1 code implementation • 9 May 2023 • Mingzhou Liu, Xinwei Sun, Yu Qiao, Yizhou Wang

Distinguishing causal connections from correlations is important in many scenarios.

Causal Discovery Causal Identification

Paper
Code

LEO: Generative Latent Image Animator for Human Video Synthesis

5 code implementations • 6 May 2023 • Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao

Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.

Disentanglement Video Editing

Paper
Code

Long-Term Rhythmic Video Soundtracker

1 code implementation • 2 May 2023 • Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms.

Paper
Code

Segment Anything Model (SAM) Meets Glass: Mirror and Transparent Objects Cannot Be Easily Detected

no code implementations • 29 Apr 2023 • Dongsheng Han, Chaoning Zhang, Yu Qiao, Maryam Qamar, Yuna Jung, Seungkyu Lee, Sung-Ho Bae, Choong Seon Hong

Meta AI Research has recently released SAM (Segment Anything Model) which is trained on a large segmentation dataset of over 1 billion masks.

Segmentation Semantic Segmentation +1

Paper
Add Code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

3 code implementations • 28 Apr 2023 • Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Ranked #6 on Visual Question Answering (VQA) on InfiMM-Eval

Instruction Following Optical Character Recognition (OCR) +7

5,532

Paper
Code

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

no code implementations • 24 Apr 2023 • Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu

To mitigate those limitations, we propose Hierarchical Diffusion Autoencoders (HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level feature hierarchy for the latent space of diffusion models.

Image Generation Image Manipulation +1

Paper
Add Code

Perception Imitation: Towards Synthesis-free Simulator for Autonomous Vehicles

no code implementations • 19 Apr 2023 • Xiaoliang Ju, Yiyang Sun, Yiming Hao, Yikang Li, Yu Qiao, Hongsheng Li

We propose a perception imitation method to simulate results of a certain perception model, and discuss a new heuristic route of autonomous driving simulator without data synthesis.

Autonomous Driving

Paper
Add Code

STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training

no code implementations • 13 Apr 2023 • Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaoting Zhang, Yu Qiao

However, the state-of-the-art models for medical image segmentation are still small-scale, with their parameters only in the tens of millions.

Image Segmentation Medical Image Segmentation +2

Paper
Add Code

One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era

no code implementations • 4 Apr 2023 • Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, Gyeong-Moon Park, Sung-Ho Bae, Lik-Hang Lee, Pan Hui, In So Kweon, Choong Seon Hong

Overall, this work is the first to survey ChatGPT with a comprehensive review of its underlying technology, applications, and challenges.

Paper
Add Code

MP-FedCL: Multiprototype Federated Contrastive Learning for Edge Intelligence

no code implementations • 1 Apr 2023 • Yu Qiao, Md. Shirajum Munir, Apurba Adhikary, Huy Q. Le, Avi Deb Raha, Chaoning Zhang, Choong Seon Hong

The existing single prototype-based strategy represents a class by using the mean of the feature space.

Contrastive Learning Federated Learning

Paper
Add Code

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +4

409

Paper
Code

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

7 code implementations • 28 Mar 2023 • Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.

Ranked #2 on Music Question Answering on MusicQA

Instruction Following Language Modelling +3

5,819

Paper
Code

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

1 code implementation • ICCV 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)

Action Classification Action Recognition +5

250

Paper
Code

Fine-grained Audible Video Description

1 code implementation • CVPR 2023 • Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Lingpeng Kong, Meng Wang, Yu Qiao, Yiran Zhong

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD).

Language Modelling Masked Language Modeling +5

Paper
Code

Prototype Helps Federated Learning: Towards Faster Convergence

no code implementations • 22 Mar 2023 • Yu Qiao, Seong-Bae Park, Sun Moo Kang, Choong Seon Hong

In this paper, a prototype-based federated learning framework is proposed, which can achieve better inference performance with only a few changes to the last global iteration of the typical federated learning process.

Federated Learning

Paper
Add Code

A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?

no code implementations • 21 Mar 2023 • Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, Donguk Kim, Sung-Ho Bae, Lik-Hang Lee, Yang Yang, Heng Tao Shen, In So Kweon, Choong Seon Hong

As ChatGPT goes viral, generative AI (AIGC, a. k. a AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond.

Language Modelling

Paper
Add Code

Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior

1 code implementation • CVPR 2023 • Jiaqi Xu, Xiaowei Hu, Lei Zhu, Qi Dou, Jifeng Dai, Yu Qiao, Pheng-Ann Heng

Video dehazing aims to recover haze-free frames with high visibility and contrast.

Paper
Code

Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection

1 code implementation • CVPR 2023 • Bo Zhang, Jiakang Yuan, Botian Shi, Tao Chen, Yikang Li, Yu Qiao

In this paper, we study the task of training a unified 3D detector from multiple datasets.

3D Object Detection object-detection

570

Paper
Code

SCPNet: Semantic Scene Completion on Point Cloud

1 code implementation • CVPR 2023 • Zhaoyang Xia, Youquan Liu, Xin Li, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao

We propose a simple yet effective label rectification strategy, which uses off-the-shelf panoptic segmentation labels to remove the traces of dynamic objects in completion labels, greatly improving the performance of deep models especially for those moving objects.

Ranked #1 on 3D Semantic Scene Completion on SemanticKITTI

3D Semantic Scene Completion Knowledge Distillation +3

Paper
Code

Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields

1 code implementation • 10 Mar 2023 • Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada

Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF).

Paper
Code

Bi3D: Bi-domain Active Learning for Cross-domain 3D Object Detection

1 code implementation • CVPR 2023 • Jiakang Yuan, Bo Zhang, Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

Unsupervised Domain Adaptation (UDA) technique has been explored in 3D cross-domain tasks recently.

3D Object Detection Active Learning +2

570

Paper
Code

Rethinking Range View Representation for LiDAR Segmentation

no code implementations • ICCV 2023 • Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, Ziwei Liu

We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks, i. e., SemanticKITTI, nuScenes, and ScribbleKITTI.

Ranked #4 on 3D Semantic Segmentation on SemanticKITTI

3D Semantic Segmentation Autonomous Driving +4

Paper
Add Code

FCN+: Global Receptive Convolution Makes FCN Great Again

no code implementations • 8 Mar 2023 • Zhongying Deng, Xiaoyu Ren, Jin Ye, Junjun He, Yu Qiao

The motivation of GRC is that different channels of a convolutional filter can have different grid sampling locations across the whole input feature map.

Segmentation Semantic Segmentation

Paper
Add Code

LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion

1 code implementation • CVPR 2023 • Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yuchen Yang, Youquan Liu, Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, Liang He

Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81. 02 mAPH (L2) detection performance.

3D Object Detection object-detection +1

Paper
Code

OpenICL: An Open-Source Framework for In-context Learning

3 code implementations • 6 Mar 2023 • Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, Zhiyong Wu

However, the implementation of ICL is sophisticated due to the diverse retrieval and inference methods involved, as well as the varying pre-processing requirements for different models, datasets, and tasks.

In-Context Learning Language Modelling +4

503

Paper
Code

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

3 code implementations • CVPR 2023 • Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao

Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge.

Few-Shot Learning Representation Learning

478

Paper
Code

Uncertainty-Estimation with Normalized Logits for Out-of-Distribution Detection

no code implementations • 15 Feb 2023 • Mouxiao Huang, Yu Qiao

However, neural networks often suffer from the overconfidence issue, making high confidence for OOD data which are never seen during training process and may be irrelevant to training data, namely in-distribution (ID) data.

Autonomous Driving Medical Diagnosis +2

Paper
Add Code

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

1 code implementation • CVPR 2023 • Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie

The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.

Open Vocabulary Semantic Segmentation Semantic Segmentation

Paper
Code

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

1 code implementation • CVPR 2023 • Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, Wenping Wang

For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20. 8% and 25. 08% mIoU on nuScenes and ScanNet, respectively.

3D Semantic Segmentation Contrastive Learning +4

132

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.