52 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2020 Conference article Open Access OPEN
Relational visual-textual information retrieval
Messina N.
With the advent of deep learning, multimedia information processing gained a huge boost, and astonishing results have been observed on a multitude of interesting visual-textual tasks. Relation networks paved the way towards an attentive processing methodology that considers images and texts as sets of basic interconnected elements (regions and words). These winning ideas recently helped to reach the state-of-the-art on the image-text matching task. Cross-media information retrieval has been proposed as a benchmark to test the capabilities of the proposed networks to match complex multi-modal concepts in the same common space. Modern deep-learning powered networks are complex and almost all of them cannot provide concise multi-modal descriptions that can be used in fast multi-modal search engines. In fact, the latest image-sentence matching networks use cross-attention and early-fusion approaches, which force all the elements of the database to be considered at query time. In this work, I will try to lay down some ideas to bridge the gap between the effectiveness of modern deep-learning multi-modal matching architectures and their efficiency, as far as fast and scalable visual-textual information retrieval is concerned.Source: SISAP 2020 - 13th International Conference on Similarity Search and Applications, pp. 405–411, Copenhagen, Denmark, September 30 - October 2, 2020
DOI: 10.1007/978-3-030-60936-8_33
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | link.springer.com Restricted | CNR ExploRA


2023 Conference article Open Access OPEN
CrowdSim2: an open synthetic benchmark for object detectors
Foszner P., Szczesna A., Ciampi L., Messina N., Cygan A., Bizon B., Cogiel M., Golba D., Macioszek E., Staniszewski M.
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.Source: VISIGRAPP 2023 - 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 676–683, Lisbon, Portugal, 19-21/02/2023
DOI: 10.5220/0011692500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | www.scitepress.org Restricted | CNR ExploRA


2023 Conference article Open Access OPEN
Development of a realistic crowd simulation environment for fine-grained validation of people tracking methods
Foszner P., Szczesna A., Ciampi L., Messina N., Cygan A., Bizon B., Cogiel M., Golba D., Macioszek E., Staniszewski M.
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.Source: VISIGRAPP 2023 - 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, pp. 222–229, Lisbon, Portugal, 19-21/02/2023
DOI: 10.5220/0011691500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | www.scitepress.org Restricted | CNR ExploRA


2023 Conference article Restricted
Improving query and assessment quality in text-based interactive video retrieval evaluation
Bailer W., Arnold R., Benz V., Coccomini D., Gkagkas A., Þór Guðmundsson G., Heller S., Þór Jónsson B., Lokoc J., Messina N., Pantelidis N., Wu J.
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.Source: ICMR '23: International Conference on Multimedia Retrieval, pp. 597–601, Thessaloniki, Greece, 12-15/06/2023
DOI: 10.1145/3591106.3592281
Project(s): AI4Media via OpenAIRE
Metrics:


See at: dl.acm.org Restricted | CNR ExploRA


2023 Conference article Open Access OPEN
Text-to-motion retrieval: towards joint understanding of human motion data and natural language
Messina N., Sedmidubsk'y J., Falchi F., Rebok T.
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.Source: SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2420–2425, Taipei, Taiwan, 23-27/07/2023
DOI: 10.1145/3539618.3592069
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | CNR ExploRA


2023 Journal article Open Access OPEN
Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS
Lokoc J., Andreadis S., Bailer W., Duane A., Gurrin C., Ma Z., Messina N., Nguyen T. N., Peska L., Rossetto L., Sauter L., Schall K., Schoeffmann K., Khan O. S., Spiess F., Vadicamo L., Vrochidis S.
This paper presents findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In this paper, a broad survey of all utilized approaches is presented in connection with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for ad-hoc search based tasks at Video Browser Showdown is introduced.Source: Multimedia systems (2023). doi:10.1007/s00530-023-01143-5
DOI: 10.1007/s00530-023-01143-5
Project(s): AI4Media via OpenAIRE, XRECO via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ZENODO Open Access | link.springer.com Restricted | CNR ExploRA


2020 Conference article Closed Access
Re-implementing and Extending Relation Network for R-CBIR
Messina N., Amato G., Falchi F.
Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision in particular. Deep Mind has recently proposed a module called Relation Network (RN) that has shown impressive results on visual question answering tasks. Unfortunately, the implementation of the proposed approach was not public. To reproduce their experiments and extend their approach in the context of Information Retrieval, we had to re-implement everything, testing many parameters and conducting many experiments. Our implementation is now public on GitHub and it is already used by a large community of researchers. Furthermore, we recently presented a variant of the relation network module that we called Aggregated Visual Features RN (AVF-RN). This network can produce and aggregate at inference time compact visual relationship-aware features for the Relational-CBIR (R-CBIR) task. R-CBIR consists in retrieving images with given relationships among objects. In this paper, we discuss the details of our Relation Network implementation and more experimental results than the original paper. Relational reasoning is a very promising topic for better understanding and retrieving inter-object relationships, especially in digital libraries.Source: 16th Italian Research Conference on Digital Libraries, IRCDL 2020, pp. 82–92, Bari, Italy, 30-31/01/2020
DOI: 10.1007/978-3-030-39905-4_9
Metrics:


See at: doi.org Restricted | link.springer.com Restricted | CNR ExploRA


2021 Conference article Open Access OPEN
Transformer reasoning network for image-text matching and retrieval
Messina N., Falchi F., Esuli A., Amato G.
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.Source: ICPR 2020 - 25th International Conference on Pattern Recognition, pp. 5222–5229, Online conference, 10-15/01/2021
DOI: 10.1109/icpr48806.2021.9413172
DOI: 10.48550/arxiv.2004.09144
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | arxiv.org Open Access | ISTI Repository Open Access | ZENODO Open Access | doi.org Restricted | doi.org Restricted | Archivio della Ricerca - Università di Pisa Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA


2021 Conference article Open Access OPEN
AIMH at SemEval-2021 - Task 6: multimodal classification using an ensemble of transformer models
Messina N., Falchi F., Gennaro C., Amato G.
This paper describes the system used by the AIMH Team to approach the SemEval Task 6. We propose an approach that relies on an architecture based on the transformer model to process multimodal content (text and images) in memes. Our architecture, called DVTT (Double Visual Textual Transformer), approaches Subtasks 1 and 3 of Task 6 as multi-label classification problems, where the text and/or images of the meme are processed, and the probabilities of the presence of each possible persuasion technique are returned as a result. DVTT uses two complete networks of transformers that work on text and images that are mutually conditioned. One of the two modalities acts as the main one and the second one intervenes to enrich the first one, thus obtaining two distinct ways of operation. The two transformers outputs are merged by averaging the inferred probabilities for each possible label, and the overall network is trained end-to-end with a binary cross-entropy loss.Source: SemEval-2021 - 15th International Workshop on Semantic Evaluation, pp. 1020–1026, Bangkok, Thailand, 5-6/08/2021
DOI: 10.18653/v1/2021.semeval-1.140
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE
Metrics:


See at: aclanthology.org Open Access | ISTI Repository Open Access | ISTI Repository Open Access | CNR ExploRA


2021 Conference article Open Access OPEN
Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features
Messina N., Amato G., Falchi F., Gennaro C., Marchand-Maillet S.
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.Source: CBMI - International Conference on Content-Based Multimedia Indexing, Lille, France, 28-30/06/2021
DOI: 10.1109/cbmi50038.2021.9461890
Project(s): AI4EU via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA


2022 Doctoral thesis Open Access OPEN
Relational Learning in computer vision
Messina N.
The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image. This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks: Relational Content-based Image Retrieval (R-CBIR), Visual-Textual Retrieval, and the Same-Different tasks. We use state-of-the-art deep learning methods for relational learning, such as the Relation Networks and the Transformer Networks for relating the different entities in an image or in a text.

See at: etd.adm.unipi.it Open Access | ISTI Repository Open Access | CNR ExploRA


2022 Conference article Open Access OPEN
Combining EfficientNet and vision transformers for video deepfake detection
Coccomini D. A., Messina N., Gennaro C., Falchi F.
Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.Source: ICIAP 2022 - 21st International Conference on Image Analysis and Processing, pp. 219–229, Lecce, Italy, 23-27/05/2022
DOI: 10.1007/978-3-031-06433-3_19
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | link.springer.com Restricted | CNR ExploRA


2022 Conference article Open Access OPEN
Towards unsupervised machine learning approaches for knowledge graphs
Minutella F., Falchi F., Manghi P., De Bonis M., Messina N.
Nowadays, a lot of data is in the form of Knowledge Graphs aiming at representing information as a set of nodes and relationships between them. This paper proposes an efficient framework to create informative embeddings for node classification on large knowledge graphs. Such embeddings capture how a particular node of the graph interacts with his neighborhood and indicate if it is either isolated or part of a bigger clique. Since a homogeneous graph is necessary to perform this kind of analysis, the framework exploits the metapath approach to split the heterogeneous graph into multiple homogeneous graphs. The proposed pipeline includes an unsupervised attentive neural network to merge different metapaths and produce node embeddings suitable for classification. Preliminary experiments on the IMDb dataset demonstrate the validity of the proposed approach, which can defeat current state-of-the-art unsupervised methods.Source: IRCDL 2022 - 18th Italian Research Conference on Digital Libraries, Padua, Italy, 24-25/02/2022
Project(s): OpenAIRE Nexus via OpenAIRE

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA


2022 Conference article Open Access OPEN
A spatio-temporal attentive network for video-based crowd counting
Avvenuti M., Bongiovanni M., Ciampi L., Falchi F., Gennaro C., Messina N.
Automatic people counting from images has recently drawn attention for urban monitoring in modern Smart Cities due to the ubiquity of surveillance camera networks. Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images. Only a bunch of works take advantage of temporal consistency in video sequences. In this work, we propose a spatio-temporal attentive neural network to estimate the number of pedestrians from surveillance videos. By taking advantage of the temporal correlation between consecutive frames, we lowered state-of-the-art count error by 5% and localization error by 7.5% on the widely-used FDST benchmark.Source: ISCC 2022 - 27th IEEE Symposium on Computers and Communications, Rhodes Island, Greece, 30/06/2022-03/07/2022
DOI: 10.1109/iscc55528.2022.9913019
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA


2022 Conference article Open Access OPEN
ALADIN: distilling fine-grained alignment scores for efficient image-text matching and retrieval
Messina N., Stefanini M., Cornia M., Baraldi L., Falchi F, Amato G., Cucchiara R.
Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space -- where an efficient kNN search can be performed -- by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.Source: CBMI 2022 - International Conference on Content-based Multimedia Indexing, pp. 64–70, Graz, Austria, 14-16/09/2022
DOI: 10.1145/3549555.3549576
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | CNR ExploRA


2023 Conference article Open Access OPEN
An optimized pipeline for image-based localization in museums from egocentric images
Messina N., Falchi F., Furnari A., Gennaro C., Farinella G. M.
With the increasing interest in augmented and virtual reality, visual localization is acquiring a key role in many downstream applications requiring a real-time estimate of the user location only from visual streams. In this paper, we propose an optimized hierarchical localization pipeline by specifically tackling cultural heritage sites with specific applications in museums. Specifically, we propose to enhance the Structure from Motion (SfM) pipeline for constructing the sparse 3D point cloud by a-priori filtering blurred and near-duplicated images. We also study an improved inference pipeline that merges similarity-based localization with geometric pose estimation to effectively mitigate the effect of strong outliers. We show that the proposed optimized pipeline obtains the lowest localization error on the challenging Bellomo dataset. Our proposed approach keeps both build and inference times bounded, in turn enabling the deployment of this pipeline in real-world scenarios.Source: ICIAP 2023 - 22nd International Conference on Image Analysis and Processing, pp. 512–524, Udine, Italy, 11-15/09/2023
DOI: 10.1007/978-3-031-43148-7_43
Project(s): AI4Media via OpenAIRE
Metrics:


See at: IRIS - Università degli Studi di Catania Open Access | ISTI Repository Open Access | doi.org Restricted | CNR ExploRA


2024 Journal article Open Access OPEN
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N., Coccomini D. A., Esuli A., Falchi F.
With the increasing importance of multimedia and multilingual data in online encyclopedias, novel methods are needed to fill domain gaps and automatically connect different modalities for increased accessibility. For example,Wikipedia is composed of millions of pages written in multiple languages. Images, when present, often lack textual context, thus remaining conceptually floating and harder to find and manage. In this work, we tackle the novel task of associating images from Wikipedia pages with the correct caption among a large pool of available ones written in multiple languages, as required by the image-caption matching Kaggle challenge organized by theWikimedia Foundation.Asystem able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias. We propose a cascade of two models powered by the recent Transformer networks able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experiments that the proposed cascaded approach effectively handles a large pool of images and captions while maintaining bounded the overall computational complexity at inference time.With respect to other approaches in the challenge leaderboard,we can achieve remarkable improvements over the previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrained resources. The code is publicly available at https://tinyurl.com/wiki-imcap.Source: Multimedia tools and applications (2024). doi:10.1007/s11042-023-17977-0
DOI: 10.1007/s11042-023-17977-0
Project(s): AI4Media via OpenAIRE
Metrics:


See at: link.springer.com Open Access | ISTI Repository Open Access | CNR ExploRA


2019 Conference article Open Access OPEN
Learning relationship-aware visual features
Messina N., Amato G., Carrara F., Falchi F., Gennaro C.
Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module - a two-stage Relation Network (2S-RN) - that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.Source: ECCV 2018 - European Conference on Computer Vision, pp. 486–501, Monaco, Germania, 8-14 Settembre 2018
DOI: 10.1007/978-3-030-11018-5_40
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | link.springer.com Restricted | CNR ExploRA


2019 Conference article Open Access OPEN
Testing Deep Neural Networks on the Same-Different Task
Messina N., Amato G., Carrara F., Falchi F., Gennaro C.
Developing abstract reasoning abilities in neural networks is an important goal towards the achievement of human-like performances on many tasks. As of now, some works have tackled this problem, developing ad-hoc architectures and reaching overall good generalization performances. In this work we try to understand to what extent state-of-The-Art convolutional neural networks for image classification are able to deal with a challenging abstract problem, the so-called same-different task. This problem consists in understanding if two random shapes inside the same image are the same or not. A recent work demonstrated that simple convolutional neural networks are almost unable to solve this problem. We extend their work, showing that ResNet-inspired architectures are able to learn, while VGG cannot converge. In light of this, we suppose that residual connections have some important role in the learning process, while the depth of the network seems not so relevant. In addition, we carry out some targeted tests on the converged architectures to figure out to what extent they are able to generalize to never seen patterns. However, further investigation is needed in order to understand what are the architectural peculiarities and limits as far as abstract reasoning is concerned.Source: 2019 International Conference on Content-Based Multimedia Indexing (CBMI), Dublin, Ireland, 4/9/2019, 6/9/2019
DOI: 10.1109/cbmi.2019.8877412
Project(s): AI4EU via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA


2019 Conference article Open Access OPEN
Learning pedestrian detection from virtual worlds
Amato G., Ciampi L., Falchi F., Gennaro C., Messina N.
In this paper, we present a real-time pedestrian detection system that has been trained using a virtual environment. This is a very popular topic of research having endless practical applications and recently, there was an increasing interest in deep learning architectures for performing such a task. However, the availability of large labeled datasets is a key point for an effective train of such algorithms. For this reason, in this work, we introduced ViPeD, a new synthetically generated set of images extracted from a realistic 3D video game where the labels can be automatically generated exploiting 2D pedestrian positions extracted from the graphics engine. We exploited this new synthetic dataset fine-tuning a state-of-the-art computationally efficient Convolutional Neural Network (CNN). A preliminary experimental evaluation, compared to the performance of other existing approaches trained on real-world images, shows encouraging results.Source: Image Analysis and Processing - ICIAP 2019, pp. 302–312, Trento, Italia, 9/9/2019, 13/9/2019
DOI: 10.1007/978-3-030-30642-7_27
Project(s): AI4EU via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | Lecture Notes in Computer Science Restricted | link.springer.com Restricted | CNR ExploRA