8 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2021 Software Unknown
Visione III
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE III is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In this third version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.

See at: bilioso.isti.cnr.it | CNR ExploRA


2021 Conference article Open Access OPEN
VISIONE at Video Browser Showdown 2021
Amato G., Bolettieri P., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.Source: MMM 2021 - 27th International Conference on Multimedia Modeling, pp. 473–478, Prague, Czech Republic, 22-24/06/2021
DOI: 10.1007/978-3-030-67835-7_47
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ZENODO Open Access | zenodo.org Open Access | Lecture Notes in Computer Science Restricted | link.springer.com Restricted | CNR ExploRA


2021 Report Open Access OPEN
AIMH research activities 2021
Aloia N., Amato G., Bartalesi V., Benedetti F., Bolettieri P., Cafarelli D., Carrara F., Casarosa V., Coccomini D., Ciampi L., Concordia C., Corbara S., Di Benedetto M., Esuli A., Falchi F., Gennaro C., Lagani G., Massoli F. V., Meghini C., Messina N., Metilli D., Molinari A., Moreo A., Nardi A., Pedrotti A., Pratelli N., Rabitti F., Savino P., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C.
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2021 activities of the research group.Source: ISTI Annual Report, ISTI-2021-AR/003, pp.1–34, 2021
DOI: 10.32079/isti-ar-2021/003
Metrics:


See at: ISTI Repository Open Access | CNR ExploRA


2021 Conference article Open Access OPEN
AIMH at SemEval-2021 - Task 6: multimodal classification using an ensemble of transformer models
Messina N., Falchi F., Gennaro C., Amato G.
This paper describes the system used by the AIMH Team to approach the SemEval Task 6. We propose an approach that relies on an architecture based on the transformer model to process multimodal content (text and images) in memes. Our architecture, called DVTT (Double Visual Textual Transformer), approaches Subtasks 1 and 3 of Task 6 as multi-label classification problems, where the text and/or images of the meme are processed, and the probabilities of the presence of each possible persuasion technique are returned as a result. DVTT uses two complete networks of transformers that work on text and images that are mutually conditioned. One of the two modalities acts as the main one and the second one intervenes to enrich the first one, thus obtaining two distinct ways of operation. The two transformers outputs are merged by averaging the inferred probabilities for each possible label, and the overall network is trained end-to-end with a binary cross-entropy loss.Source: SemEval-2021 - 15th International Workshop on Semantic Evaluation, pp. 1020–1026, Bangkok, Thailand, 5-6/08/2021
DOI: 10.18653/v1/2021.semeval-1.140
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: aclanthology.org Open Access | ISTI Repository Open Access | ISTI Repository Open Access | CNR ExploRA


2021 Journal article Open Access OPEN
Solving the same-different task with convolutional neural networks
Messina N., Amato G. Carrara F., Gennaro C., Falchi F.
Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the same prerequisite to be solved correctly: understanding if two random shapes inside the same image are the same or not. With the experiments carried out in this work, we demonstrate that residual connections, and more generally the skip connections, seem to have only a marginal impact on the learning of the proposed problems. In particular, we experiment with DenseNets, and we examine the contribution of residual and recurrent connections in already tested architectures, ResNet-18, and CorNet-S respectively. Our experiments show that older feed-forward networks, AlexNet and VGG, are almost unable to learn the proposed problems, except in some specific scenarios. We show that recently introduced architectures can converge even in the cases where the important parts of their architecture are removed. We finally carry out some zero-shot generalization tests, and we discover that in these scenarios residual and recurrent connections can have a stronger impact on the overall test accuracy. On four difficult problems from the SVRT dataset, we can reach state-of-the-art results with respect to the previous approaches, obtaining super-human performances on three of the four problems.Source: Pattern recognition letters 143 (2021): 75–80. doi:10.1016/j.patrec.2020.12.019
DOI: 10.1016/j.patrec.2020.12.019
DOI: 10.48550/arxiv.2101.09129
Project(s): AI4EU via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Pattern Recognition Letters Open Access | ISTI Repository Open Access | ZENODO Open Access | Pattern Recognition Letters Restricted | doi.org Restricted | www.sciencedirect.com Restricted | CNR ExploRA


2021 Conference article Open Access OPEN
Transformer reasoning network for image-text matching and retrieval
Messina N., Falchi F., Esuli A., Amato G.
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.Source: ICPR 2020 - 25th International Conference on Pattern Recognition, pp. 5222–5229, Online conference, 10-15/01/2021
DOI: 10.1109/icpr48806.2021.9413172
DOI: 10.48550/arxiv.2004.09144
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | arxiv.org Open Access | ISTI Repository Open Access | ZENODO Open Access | doi.org Restricted | doi.org Restricted | Archivio della Ricerca - Università di Pisa Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA


2021 Conference article Open Access OPEN
Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features
Messina N., Amato G., Falchi F., Gennaro C., Marchand-Maillet S.
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.Source: CBMI - International Conference on Content-Based Multimedia Indexing, Lille, France, 28-30/06/2021
DOI: 10.1109/cbmi50038.2021.9461890
Project(s): AI4EU via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA


2021 Journal article Open Access OPEN
Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders
Messina N., Amato G., Esuli A., Falchi F., Gennaro C., Marchand-Maillet S.
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.Source: ACM transactions on multimedia computing communications and applications 17 (2021). doi:10.1145/3451390
DOI: 10.1145/3451390
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA