2020
Conference article  Open Access

Relational visual-textual information retrieval

Messina N.

Neural networks  Deep features  Cross-media retrieval 

With the advent of deep learning, multimedia information processing gained a huge boost, and astonishing results have been observed on a multitude of interesting visual-textual tasks. Relation networks paved the way towards an attentive processing methodology that considers images and texts as sets of basic interconnected elements (regions and words). These winning ideas recently helped to reach the state-of-the-art on the image-text matching task. Cross-media information retrieval has been proposed as a benchmark to test the capabilities of the proposed networks to match complex multi-modal concepts in the same common space. Modern deep-learning powered networks are complex and almost all of them cannot provide concise multi-modal descriptions that can be used in fast multi-modal search engines. In fact, the latest image-sentence matching networks use cross-attention and early-fusion approaches, which force all the elements of the database to be considered at query time. In this work, I will try to lay down some ideas to bridge the gap between the effectiveness of modern deep-learning multi-modal matching architectures and their efficiency, as far as fast and scalable visual-textual information retrieval is concerned.

Source: SISAP 2020 - 13th International Conference on Similarity Search and Applications, pp. 405–411, Copenhagen, Denmark, September 30 - October 2, 2020


Metrics



Back to previous page
BibTeX entry
@inproceedings{oai:it.cnr:prodotti:443698,
	title = {Relational visual-textual information retrieval},
	author = {Messina N.},
	doi = {10.1007/978-3-030-60936-8_33},
	booktitle = {SISAP 2020 - 13th International Conference on Similarity Search and Applications, pp. 405–411, Copenhagen, Denmark, September 30 - October 2, 2020},
	year = {2020}
}