Harvesting information from captions for weakly supervised semantic segmentation
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
2019 International Conference on Computer Vision Workshops: ICCV 2019 : proceedings : 27 October-2 November 2019, Seoul, Korea. Piscataway: Institute of Electrical and Electronics Engineers Inc., 2019. p. 4481-4490 9022140 (IEEE International Conference on Computer Vision workshops; Vol. 2019).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - Harvesting information from captions for weakly supervised semantic segmentation
AU - Sawatzky, Johann
AU - Banerjee, Debayan
AU - Gall, Juergen
N1 - Conference code: 17
PY - 2019/10
Y1 - 2019/10
N2 - Since acquiring pixel-wise annotations for training convolutional neural networks for semantic image segmentation is time-consuming, weakly supervised approaches that only require class tags have been proposed. In this work, we propose another form of supervision, namely image captions as they can be found on the Internet. These captions have two advantages. They do not require additional curation as it is the case for the clean class tags used by current weakly supervised approaches and they provide textual context for the classes present in an image. To leverage such textual context, we deploy a multi-modal network that learns a joint embedding of the visual representation of the image and the textual representation of the caption. The network estimates text activation maps (TAMs) for class names as well as compound concepts, i.e. combinations of nouns and their attributes. The TAMs of compound concepts describing classes of interest substantially improve the quality of the estimated class activation maps which are then used to train a network for semantic segmentation. We evaluate our method on the COCO dataset where it achieves state of the art results for weakly supervised image segmentation.
AB - Since acquiring pixel-wise annotations for training convolutional neural networks for semantic image segmentation is time-consuming, weakly supervised approaches that only require class tags have been proposed. In this work, we propose another form of supervision, namely image captions as they can be found on the Internet. These captions have two advantages. They do not require additional curation as it is the case for the clean class tags used by current weakly supervised approaches and they provide textual context for the classes present in an image. To leverage such textual context, we deploy a multi-modal network that learns a joint embedding of the visual representation of the image and the textual representation of the caption. The network estimates text activation maps (TAMs) for class names as well as compound concepts, i.e. combinations of nouns and their attributes. The TAMs of compound concepts describing classes of interest substantially improve the quality of the estimated class activation maps which are then used to train a network for semantic segmentation. We evaluate our method on the COCO dataset where it achieves state of the art results for weakly supervised image segmentation.
KW - Multimodal learning
KW - Semantic segmentation
KW - Weakly supervised learning
KW - Weakly supervised semantic segmentation
KW - Informatics
UR - http://www.scopus.com/inward/record.url?scp=85082499279&partnerID=8YFLogxK
U2 - 10.1109/ICCVW.2019.00549
DO - 10.1109/ICCVW.2019.00549
M3 - Article in conference proceedings
AN - SCOPUS:85082499279
SN - 978-1-7281-5024-6
T3 - IEEE International Conference on Computer Vision workshops
SP - 4481
EP - 4490
BT - 2019 International Conference on Computer Vision Workshops
PB - Institute of Electrical and Electronics Engineers Inc.
CY - Piscataway
T2 - 17th IEEE/CVF International Conference on Computer Vision Workshop - ICCVW 2019
Y2 - 27 October 2019 through 28 October 2019
ER -