[1] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: improving visual-semantic embeddings with hard negatives,” in BMVC 2018. BMVA Press, 2018, p. 12.
[2] F. Carrara, A. Esuli, T. Fagni, F. Falchi, and A. M. Ferna´ndez, “Picture it in your mind: generating high level visual representations from textual descriptions,” Inf. Retr. J., vol. 21, no. 2-3, pp. 208-229, 2018.
[3] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS 2019, 2019, pp. 13-23.
[4] A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in CVPR 2015. IEEE Computer Society, 2015, pp. 3128-3137.
[5] K. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11208. Springer, 2018, pp. 212-228.
[6] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, and A. Sacheti, “Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data,” CoRR, vol. abs/2001.07966, 2020.
[7] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in ICCV 2019. IEEE, 2019, pp. 4653-4661.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS 2017, 2017, pp. 5998-6008.
[9] B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Associating neural word embeddings with deep image representations using fisher vectors,” in CVPR 2015. IEEE Computer Society, 2015, pp. 4437-4446.
[10] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings of images and language,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
[11] X. Lin and D. Parikh, “Leveraging visual question answering for imagecaption ranking,” in ECCV 2016, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906. Springer, 2016, pp. 261-277.
[12] Y. Huang, W. Wang, and L. Wang, “Instance-aware image and sentence matching with selective multimodal LSTM,” in CVPR 2017. IEEE Computer Society, 2017, pp. 7254-7262.
[13] A. Eisenschtat and L. Wolf, “Linking image and text with 2-way nets,” in CVPR 2017. IEEE Computer Society, 2017, pp. 1855-1865.
[14] Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew, “Learning a recurrent residual fusion network for multimodal matching,” in IEEE International Conference on Computer Vision, ICCV 2017. IEEE Computer Society, 2017, pp. 4127-4136.
[15] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang, “Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models,” in CVPR 2018. IEEE Computer Society, 2018, pp. 7181-7189.
[16] Y. Huang, Q. Wu, C. Song, and L. Wang, “Learning semantic concepts and order for image and sentence matching,” in CVPR 2018. IEEE Computer Society, 2018, pp. 6163-6171.
[17] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and VQA,” CoRR, vol. abs/1707.07998, 2017.
[18] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACLHLT 2019. Association for Computational Linguistics, 2019, pp. 4171- 4186.
[19] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” pp. 4967-4976, 2017.
[20] N. Messina, G. Amato, F. Carrara, F. Falchi, and C. Gennaro, “Learning visual features for relational cbir,” International Journal of Multimedia Information Retrieval, Sep 2019. [Online]. Available: https://doi.org/10.1007/s13735-019-00178-7
[21] --, “Learning relationship-aware visual features,” in ECCV 2018 Workshops, ser. Lecture Notes in Computer Science, L. Leal-Taixe´ and S. Roth, Eds., vol. 11132. Springer, 2018, pp. 486-501. [Online]. Available: https://doi.org/10.1007/978-3-030-11018-5 40
[22] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[23] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. FeiFei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[24] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11218. Springer, 2018, pp. 711-727.
[25] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in CVPR 2019. Computer Vision Foundation / IEEE, 2019, pp. 10 685-10 694.
[26] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Trans. Multimedia, vol. 21, no. 8, pp. 2117-2130, 2019.
[27] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph R-CNN for scene graph generation,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11205. Springer, 2018, pp. 690-706.
[28] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: An efficient subgraph-based framework for scene graph generation,” in ECCV 2018, ser. Lecture Notes in Computer Science, vol. 11205. Springer, 2018, pp. 346-363.
[29] K. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao, “Learning visual relation priors for image-text matching and image captioning with neural scene graph generators,” CoRR, vol. abs/1909.09953, 2019.
[30] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 91-99.
[31] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR 2018. IEEE Computer Society, 2018, pp. 6077-6086.
[32] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” CoRR, vol. abs/1602.07332, 2016.
[33] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, 2013.
[34] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74-81.
[35] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: semantic propositional image caption evaluation,” in ECCV 2016, ser. Lecture Notes in Computer Science, vol. 9909. Springer, 2016, pp. 382-398.
[36] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla´r, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV 2014, ser. Lecture Notes in Computer Science, vol. 8693. Springer, 2014, pp. 740-755.