TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	CIReVL	mAP@10	27.59	# 1
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	CIReVL	R@5	64.29	# 4
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	CIReVL (Training-Free)	(Recall@10+Recall@50)/2	42.28	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-by-language-for-training-free/zero-shot-composed-image-retrieval-zs-cir-on)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on?p=vision-by-language-for-training-free)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-by-language-for-training-free/zero-shot-composed-image-retrieval-zs-cir-on-1)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-1?p=vision-by-language-for-training-free)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-by-language-for-training-free/zero-shot-composed-image-retrieval-zs-cir-on-2)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-2?p=vision-by-language-for-training-free)`

Vision-by-Language for Training-Free Compositional Image Retrieval

13 Oct 2023 · Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata ·

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

PDF Abstract

Code

Add Remove Mark official

explainableml/vision_by_language official

Tasks

Add Remove

Image Retrieval

Retrieval

Zero-Shot Composed Image Retrieval (ZS-CIR)

Datasets

Fashion IQ

CIRR

CIRCO GeneCIS

Results from the Paper

Edit

Ranked #1 on Zero-Shot Composed Image Retrieval (ZS-CIR) on CIRCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	CIReVL	mAP@10	27.59	# 1	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	CIReVL	R@5	64.29	# 4	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	CIReVL (Training-Free)	(Recall@10+Recall@50)/2	42.28	# 6	Compare

Methods

Add Remove

BLIP • CLIP

Edit Social Preview

Vision-by-Language for Training-Free Compositional Image Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove