SK-VG is a dataset for Scene Knowledge-guided Visual Grounding, where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, SK-VG is the first dataset of the fourth type, where for each image, we provide human-written knowledge to describe its content.
Paper | Code | Results | Date | Stars |
---|