Messina N, Coccomini Da, Esuli A, Falchi F
Multi-modal matching Information retrieval Deep learning Transformer networks
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.
Source: MULTIMEDIA TOOLS AND APPLICATIONS, vol. 83, pp. 62915-62935
@article{oai:it.cnr:prodotti:491916, title = {Cascaded transformer-based networks for Wikipedia large-scale image-caption matching}, author = {Messina N and Coccomini Da and Esuli A and Falchi F}, doi = {10.1007/s11042-023-17977-0}, year = {2024} }