2 code implementations • 6 Feb 2024 • Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models.
Ranked #1 on Zero-Shot Transfer Image Classification on SUN
Image Classification Zero-Shot Transfer Image Classification
1 code implementation • 20 Dec 2023 • Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang
The human ability to easily solve multimodal tasks in context (i. e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate.
Ranked #21 on Visual Question Answering on MM-Vet
1 code implementation • 31 Oct 2023 • Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu
To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions.
2 code implementations • 11 Jul 2023 • Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.
Ranked #1 on Visual Question Answering on VQA v2
1 code implementation • 6 Apr 2023 • Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang
We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images.
Ranked #1 on Few-Shot Semantic Segmentation on PASCAL-5i (5-Shot) (using extra training data)
no code implementations • ICCV 2023 • Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang
We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images.
1 code implementation • 30 May 2022 • Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, Qi Tian
A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e. g., ViT), albeit hierarchical vision transformers (e. g., Swin Transformer) have potentially better properties in formulating vision inputs.
3 code implementations • ICCV 2023 • Feng Liu, Xiaosong Zhang, Zhiliang Peng, Zonghao Guo, Fang Wan, Xiangyang Ji, Qixiang Ye
Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models.
Ranked #3 on Few-Shot Object Detection on MS-COCO (30-shot)
1 code implementation • 6 Oct 2021 • Zhiliang Peng, Wei Huang, Zonghao Guo, Xiaosong Zhang, Jianbin Jiao, Qixiang Ye
We propose to jointly optimize empirical risks of the unbalanced and balanced domains and approximate their domain divergence by intra-class and inter-class distances, with the aim to adapt models trained on the long-tailed distribution to general distributions in an interpretable way.
2 code implementations • CVPR 2021 • Zonghao Guo, Chang Liu, Xiaosong Zhang, Jianbin Jiao, Xiangyang Ji, Qixiang Ye
Detecting oriented and densely packed objects remains challenging for spatial feature aliasing caused by the intersection of reception fields between objects.
Ranked #34 on Object Detection In Aerial Images on DOTA (using extra training data)
4 code implementations • NeurIPS 2019 • Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, Qixiang Ye
In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner.
Ranked #136 on Object Detection on COCO test-dev
no code implementations • 12 Feb 2019 • Xiaolei Liu, Xiaojiang Du, Xiaosong Zhang, Qingxin Zhu, Mohsen Guizani
An automated testing framework is needed to help these learning-based malware detection systems for IoT devices perform security analysis.
no code implementations • 26 Jan 2019 • Xiaolei Liu, Xiaosong Zhang, Kun Wan, Qingxin Zhu, Yufei Ding
In this paper, we propose~\textit{weighted-sampling audio adversarial examples}, focusing on the numbers and the weights of distortion to reinforce the attack.
no code implementations • 26 Jan 2019 • Xiaolei Liu, Yuheng Luo, Xiaosong Zhang, Qingxin Zhu
Our experimental results show that both the MNIST images and the CIFAR-10 images can be perturbed to successful generate a black-box attack with 100\% probability on average.