Document - Transformer reasoning network for image-text matching and retrieval

2021

Conference article Open Access

Transformer reasoning network for image-text matching and retrieval

Messina N, Falchi F, Esuli A, Amato G

Deep learning Image-text matching Cross-modal retrieval Computer Vision and Pattern Recognition (cs.CV) FOS: Computer and information sciences artificial intelligence Computer Science - Computer Vision and Pattern Recognition transformer encoder

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

Source: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, pp. 5222-5229. Online conference, 10-15/01/2021

Citations

[1] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: improving visual-semantic embeddings with hard negatives,” in BMVC 2018. BMVA Press, 2018, p. 12.
[2] F. Carrara, A. Esuli, T. Fagni, F. Falchi, and A. M. Ferna´ndez, “Picture it in your mind: generating high level visual representations from textual descriptions,” Inf. Retr. J., vol. 21, no. 2-3, pp. 208-229, 2018.
[3] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS 2019, 2019, pp. 13-23.
[4] A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in CVPR 2015. IEEE Computer Society, 2015, pp. 3128-3137.
[5] K. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11208. Springer, 2018, pp. 212-228.
[6] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, and A. Sacheti, “Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data,” CoRR, vol. abs/2001.07966, 2020.
[7] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in ICCV 2019. IEEE, 2019, pp. 4653-4661.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS 2017, 2017, pp. 5998-6008.
[9] B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Associating neural word embeddings with deep image representations using fisher vectors,” in CVPR 2015. IEEE Computer Society, 2015, pp. 4437-4446.
[10] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings of images and language,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
[11] X. Lin and D. Parikh, “Leveraging visual question answering for imagecaption ranking,” in ECCV 2016, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906. Springer, 2016, pp. 261-277.
[12] Y. Huang, W. Wang, and L. Wang, “Instance-aware image and sentence matching with selective multimodal LSTM,” in CVPR 2017. IEEE Computer Society, 2017, pp. 7254-7262.
[13] A. Eisenschtat and L. Wolf, “Linking image and text with 2-way nets,” in CVPR 2017. IEEE Computer Society, 2017, pp. 1855-1865.
[14] Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew, “Learning a recurrent residual fusion network for multimodal matching,” in IEEE International Conference on Computer Vision, ICCV 2017. IEEE Computer Society, 2017, pp. 4127-4136.
[15] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang, “Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models,” in CVPR 2018. IEEE Computer Society, 2018, pp. 7181-7189.
[16] Y. Huang, Q. Wu, C. Song, and L. Wang, “Learning semantic concepts and order for image and sentence matching,” in CVPR 2018. IEEE Computer Society, 2018, pp. 6163-6171.
[17] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and VQA,” CoRR, vol. abs/1707.07998, 2017.
[18] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACLHLT 2019. Association for Computational Linguistics, 2019, pp. 4171- 4186.
[19] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” pp. 4967-4976, 2017.
[20] N. Messina, G. Amato, F. Carrara, F. Falchi, and C. Gennaro, “Learning visual features for relational cbir,” International Journal of Multimedia Information Retrieval, Sep 2019. [Online]. Available: https://doi.org/10.1007/s13735-019-00178-7
[21] --, “Learning relationship-aware visual features,” in ECCV 2018 Workshops, ser. Lecture Notes in Computer Science, L. Leal-Taixe´ and S. Roth, Eds., vol. 11132. Springer, 2018, pp. 486-501. [Online]. Available: https://doi.org/10.1007/978-3-030-11018-5 40
[22] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[23] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. FeiFei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[24] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11218. Springer, 2018, pp. 711-727.
[25] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in CVPR 2019. Computer Vision Foundation / IEEE, 2019, pp. 10 685-10 694.
[26] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Trans. Multimedia, vol. 21, no. 8, pp. 2117-2130, 2019.
[27] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph R-CNN for scene graph generation,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11205. Springer, 2018, pp. 690-706.
[28] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: An efficient subgraph-based framework for scene graph generation,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11205. Springer, 2018, pp. 346-363.
[29] K. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao, “Learning visual relation priors for image-text matching and image captioning with neural scene graph generators,” CoRR, vol. abs/1909.09953, 2019.
[30] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 91-99.
[31] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR 2018. IEEE Computer Society, 2018, pp. 6077-6086.
[32] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” CoRR, vol. abs/1602.07332, 2016.
[33] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, 2013.
[34] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74-81.
[35] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: semantic propositional image caption evaluation,” in ECCV 2016, ser. Lecture Notes in Computer Science, vol. 9909. Springer, 2016, pp. 382-398.
[36] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla´r, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV 2014, ser. Lecture Notes in Computer Science, vol. 8693. Springer, 2014, pp. 740-755.

Metrics

Back to previous page

Cite as

BibTeX entry

@inproceedings{oai:it.cnr:prodotti:457521,
	title = {Transformer reasoning network for image-text matching and retrieval},
	author = {Messina N and Falchi F and Esuli A and Amato G},
	doi = {10.1109/icpr48806.2021.9413172 and 10.48550/arxiv.2004.09144},
	booktitle = {INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, pp. 5222-5229. Online conference, 10-15/01/2021},
	year = {2021}
}