See-Through-Text Grouping for Referring Image Segmentation
Motivated by the conventional grouping techniques to image segmentation, we develop their DNN counterpart to tackle the referring variant. The proposed method is driven by a convolutional-recurrent neural network (ConvRNN) that iteratively carries out top-down processing of bottom-up segmentation cues. Given a natural language referring expression, our method learns to predict its relevance to each pixel and derives a See-through-Text Embedding Pixelwise (STEP) heatmap, which reveals segmentation cues of pixel level via the learned visual-textual co-embedding. The ConvRNN performs a top-down approximation by converting the STEP heatmap into a refined one, whereas the improvement is expected from training the network with a classification loss from the ground truth. With the refined heatmap, we update the textual representation of the referring expression by re-evaluating its attention distribution and then compute a new STEP heatmap as the next input to the ConvRNN. Boosting by such collaborative learning, the framework can progressively and simultaneously yield the desired referring segmentation and reasonable attention distribution over the referring sentence. Our method is general and does not rely on, say, the outcomes of object detection from other DNN models, while achieving state-of-the-art performance in all of the four datasets in the experiments.
PDF AbstractDatasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Referring Expression Segmentation | RefCOCO testA | STEP (1-fold) | Overall IoU | 58.70 | # 23 | |
Referring Expression Segmentation | RefCOCO+ testA | STEP (5-fold) | Overall IoU | 52.33 | # 18 | |
Referring Expression Segmentation | RefCOCO testB | STEP (1-fold) | Overall IoU | 55.39 | # 16 | |
Referring Expression Segmentation | RefCOCO+ test B | STEP (5-fold) | Overall IoU | 40.41 | # 17 | |
Referring Expression Segmentation | RefCoCo val | STEP (1-fold) | Overall IoU | 56.58 | # 24 | |
Referring Expression Segmentation | RefCOCO+ val | STEP (5-fold) | Overall IoU | 48.18 | # 19 |