TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	RSICD	RemoteCLIP	Mean Recall	36.35%	# 2
Cross-Modal Retrieval	RSICD	RemoteCLIP	Image-to-text R@1	18.39%	# 2
Cross-Modal Retrieval	RSICD	RemoteCLIP	text-to-image R@1	14.73%	# 2
Cross-Modal Retrieval	RSITMD	RemoteCLIP	Mean Recall	50.52%	# 2
Cross-Modal Retrieval	RSITMD	RemoteCLIP	Image-to-text R@1	28.76%	# 2
Cross-Modal Retrieval	RSITMD	RemoteCLIP	text-to-imageR@1	23.76%	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/remoteclip-a-vision-language-foundation-model/cross-modal-retrieval-on-rsicd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsicd?p=remoteclip-a-vision-language-foundation-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/remoteclip-a-vision-language-foundation-model/cross-modal-retrieval-on-rsitmd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsitmd?p=remoteclip-a-vision-language-foundation-model)`

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

19 Jun 2023 · Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, Jun Zhou ·

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

PDF Abstract

Code

Add Remove Mark official

chendelong1999/remoteclip official

↳ Quickstart in

Colab

198

Tasks

Add Remove

Classification

Cross-Modal Retrieval

Image Classification

Object Counting

Retrieval

Self-Supervised Learning

Text Retrieval

Zero-Shot Image Classification

Zero-Shot Learning

Datasets

EuroSAT

DOTA

RESISC45

CARPK

VisDrone

iSAID

LoveDA

RSICD RSITMD

MLRSNet

AU-AIR

Results from the Paper

Edit

Ranked #2 on Cross-Modal Retrieval on RSITMD (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	RSICD	RemoteCLIP	Mean Recall	36.35%	# 2	Compare
			Image-to-text R@1	18.39%	# 2	Compare
			text-to-image R@1	14.73%	# 2	Compare
Cross-Modal Retrieval	RSITMD	RemoteCLIP	Mean Recall	50.52%	# 2	Compare
			Image-to-text R@1	28.76%	# 2	Compare
			text-to-imageR@1	23.76%	# 2	Compare

Methods

Add Remove

CLIP • k-NN

Edit Social Preview

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove