Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (bibtex)
by Yale Song, Mohammad Soleymani
Abstract:
Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focus on image-text pairs of data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.
Reference:
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (Yale Song, Mohammad Soleymani), In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2019.
Bibtex Entry:
@inproceedings{song_polysemous_2019,
	address = {Long Beach, CA},
	title = {Polysemous {Visual}-{Semantic} {Embedding} for {Cross}-{Modal} {Retrieval}},
	url = {https://arxiv.org/abs/1906.04402},
	abstract = {Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focus on image-text pairs of data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.},
	booktitle = {Proceedings of the 2019 {IEEE} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
	publisher = {IEEE},
	author = {Song, Yale and Soleymani, Mohammad},
	month = jun,
	year = {2019},
	keywords = {Virtual Humans, UARC},
	pages = {10}
}
Powered by bibtexbrowser