59 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2020 Conference article Open Access OPEN
Relational visual-textual information retrieval
Messina N
With the advent of deep learning, multimedia information processing gained a huge boost, and astonishing results have been observed on a multitude of interesting visual-textual tasks. Relation networks paved the way towards an attentive processing methodology that considers images and texts as sets of basic interconnected elements (regions and words). These winning ideas recently helped to reach the state-of-the-art on the image-text matching task. Cross-media information retrieval has been proposed as a benchmark to test the capabilities of the proposed networks to match complex multi-modal concepts in the same common space. Modern deep-learning powered networks are complex and almost all of them cannot provide concise multi-modal descriptions that can be used in fast multi-modal search engines. In fact, the latest image-sentence matching networks use cross-attention and early-fusion approaches, which force all the elements of the database to be considered at query time. In this work, I will try to lay down some ideas to bridge the gap between the effectiveness of modern deep-learning multi-modal matching architectures and their efficiency, as far as fast and scalable visual-textual information retrieval is concerned.

See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2022 Other Open Access OPEN
Relational Learning in computer vision
Messina N
Il crescente interesse verso i social network, le smart cities e l'Industria 4.0 sta incentivando lo sviluppo di tecniche per processare, comprendere e organizzare enormi quantità di dati. I recenti sviluppi nell'ambito dell'Intelligenza Artificiale hanno dato vita al Deep Learning, una branca del Machine Learning che riconosce autonomamente i pattern più rilevanti nei dati in input, senza dover dipendere da una selezione guidata da un esperto umano. Il Deep Learning ha rivoluzionato importanti campi applicativi, come la Computer Vision e il Natural Language Processing; nonostante ciò, soffre ancora di importanti limitazioni. Sebbene siano stati raggiunti risultati straordinari in molti campi applicativi, le reti neurali hanno ancora difficoltà nel comprendere la relazione tra elementi semanticamente collegati tra loro ma distanti, in riferimento sia alla dimensione spazio-temporale ma anche più genericamente alla loro forma (un testo è in sua essenza diverso da un'immagine, anche se può perfettamente descriverla). Questa mancanza ha ripercussioni negative sulla ricerca di interconnessioni tra oggetti multimediali aventi natura differente o sulla ricerca di relazioni tra oggetti spazialmente distanti in un'immagine. In questa tesi abbiamo affrontato il problema della comprensione relazionale nelle reti neurali profonde, prendendo come riferimento tre task differenti ma strettamente correlati tra loro. In primo luogo, abbiamo introdotto il Relational Content-Based Image Retrieval (R-CBIR) - un'estensione al task di CBIR classico - il cui scopo è quello di cercare tutte le immagini che condividano una similarità tra le relazioni che insistono tra gli oggetti in esse contenuti. Abbiamo affrontato il Relational CBIR definendo alcune architetture capaci di estrarre dei descrittori relazionali ed estendendo il dataset sintetico CLEVR per ottenere un ground-truth adatto alla valutazione di questo nuovo task. Il passo successivo ha riguardato l'ampliamento di questi risultati preliminari verso l'utilizzo di immagini reali nel contesto di ricerce cross-modali, dove descrizioni in linguaggio naturale vengono usate come query per cercare in grossi database di immagini (e viceversa). Abbiamo utilizzato l'architettura Transformer per correlare elementi visuali e testuali, ponendoci come obiettivo finale la ricerca su larga scala. Dopo aver effettuato l'integrazione di queste reti in uno strumento per la ricerca interattiva di video su larga scala (VISIONE), abbiamo osservato come i descrittori ottenuti siano capaci di codificare elementi altamente semantici, raggiungendo risultati eccellenti sul task di Semantic CBIR. Abbiamo infine utilizzato queste stesse tecnologie per risolvere un problema estremamente importante nei social network: la rilevazione di tecniche di persuasione nelle campagne di disinformazione. L'ultima parte della ricerca si è focalizzata sullo studio delle architetture convoluzionali su semplici problemi di ragionamento visivo, che richiedono confronti tra forme distanti nello spazio. In questo contesto abbiamo proposto un'architettura ibrida CNN-Transformer che ha ottenuto ottimi risultati, rimanendo comunque meno complessa e più efficiente rispetto alle reti concorrenti. Lo scopo primario di questa tesi è stato quello di esplorare nuovi modelli neurali per la comprensione semantica e relazionale di immagini e testi, con applicazioni su larga scala e con immediate estensioni a ulteriori modalità quali audio e/o video.

See at: etd.adm.unipi.it Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Conference article Open Access OPEN
CrowdSim2: an open synthetic benchmark for object detectors
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.Project(s): AI4Media via OpenAIRE

See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Development of a realistic crowd simulation environment for fine-grained validation of people tracking methods
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.Project(s): AI4Media via OpenAIRE

See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Restricted
Improving query and assessment quality in text-based interactive video retrieval evaluation
Bailer W, Arnold R, Benz V, Coccomini D, Gkagkas A, Þór Guðmundsson G, Heller S, Þór Jónsson B, Lokoc J, Messina N, Pantelidis N, Wu J
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.Project(s): AI4Media via OpenAIRE

See at: dl.acm.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Text-to-motion retrieval: towards joint understanding of human motion data and natural language
Messina N, Sedmidubsk'y J, Falchi F, Rebok T
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.DOI: 10.1145/3539618.3592069
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS
Lokoc J, Andreadis S, Bailer W, Duane A, Gurrin C, Ma Z, Messina N, Nguyen T N, Peska L, Rossetto L, Sauter L, Schall K, Schoeffmann K, Khan Os, Spiess F, Vadicamo L, Vrochidis S
This paper presents findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In this paper, a broad survey of all utilized approaches is presented in connection with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for ad-hoc search based tasks at Video Browser Showdown is introduced.Source: MULTIMEDIA SYSTEMS
Project(s): AI4Media via OpenAIRE

See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2020 Conference article Restricted
Re-implementing and Extending Relation Network for R-CBIR
Messina N, Amato G, Falchi F
Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision in particular. Deep Mind has recently proposed a module called Relation Network (RN) that has shown impressive results on visual question answering tasks. Unfortunately, the implementation of the proposed approach was not public. To reproduce their experiments and extend their approach in the context of Information Retrieval, we had to re-implement everything, testing many parameters and conducting many experiments. Our implementation is now public on GitHub and it is already used by a large community of researchers. Furthermore, we recently presented a variant of the relation network module that we called Aggregated Visual Features RN (AVF-RN). This network can produce and aggregate at inference time compact visual relationship-aware features for the Relational-CBIR (R-CBIR) task. R-CBIR consists in retrieving images with given relationships among objects. In this paper, we discuss the details of our Relation Network implementation and more experimental results than the original paper. Relational reasoning is a very promising topic for better understanding and retrieving inter-object relationships, especially in digital libraries.Source: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE (PRINT), pp. 82-92. Bari, Italy, 30-31/01/2020

See at: CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2021 Journal article Open Access OPEN
Solving the same-different task with convolutional neural networks
Messina N, Amato G Carrara F, Gennaro C, Falchi F
Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the same prerequisite to be solved correctly: understanding if two random shapes inside the same image are the same or not. With the experiments carried out in this work, we demonstrate that residual connections, and more generally the skip connections, seem to have only a marginal impact on the learning of the proposed problems. In particular, we experiment with DenseNets, and we examine the contribution of residual and recurrent connections in already tested architectures, ResNet-18, and CorNet-S respectively. Our experiments show that older feed-forward networks, AlexNet and VGG, are almost unable to learn the proposed problems, except in some specific scenarios. We show that recently introduced architectures can converge even in the cases where the important parts of their architecture are removed. We finally carry out some zero-shot generalization tests, and we discover that in these scenarios residual and recurrent connections can have a stronger impact on the overall test accuracy. On four difficult problems from the SVRT dataset, we can reach state-of-the-art results with respect to the previous approaches, obtaining super-human performances on three of the four problems.Source: PATTERN RECOGNITION LETTERS, vol. 143, pp. 75-80
Project(s): AI4EU via OpenAIRE

See at: CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2021 Conference article Open Access OPEN
Transformer reasoning network for image-text matching and retrieval
Messina N, Falchi F, Esuli A, Amato G
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.Source: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, pp. 5222-5229. Online conference, 10-15/01/2021
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE

See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2021 Conference article Open Access OPEN
AIMH at SemEval-2021 - Task 6: multimodal classification using an ensemble of transformer models
Messina N, Falchi F, Gennaro C, Amato G
This paper describes the system used by the AIMH Team to approach the SemEval Task 6. We propose an approach that relies on an architecture based on the transformer model to process multimodal content (text and images) in memes. Our architecture, called DVTT (Double Visual Textual Transformer), approaches Subtasks 1 and 3 of Task 6 as multi-label classification problems, where the text and/or images of the meme are processed, and the probabilities of the presence of each possible persuasion technique are returned as a result. DVTT uses two complete networks of transformers that work on text and images that are mutually conditioned. One of the two modalities acts as the main one and the second one intervenes to enrich the first one, thus obtaining two distinct ways of operation. The two transformers outputs are merged by averaging the inferred probabilities for each possible label, and the overall network is trained end-to-end with a binary cross-entropy loss.Source: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING, pp. 1020-1026. Bangkok, Thailand, 5-6/08/2021
Project(s): AI4EU via OpenAIRE, AI4Media via OpenAIRE

See at: aclanthology.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2021 Conference article Open Access OPEN
Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features
Messina N, Amato G, Falchi F, Gennaro C, Marchandmaillet S
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.Project(s): AI4EU via OpenAIRE

See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2022 Conference article Open Access OPEN
Combining EfficientNet and vision transformers for video deepfake detection
Coccomini Da, Messina N, Gennaro C, Falchi F
Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.

See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2022 Conference article Open Access OPEN
Towards unsupervised machine learning approaches for knowledge graphs
Minutella F, Falchi F, Manghi P, De Bonis M, Messina N
Nowadays, a lot of data is in the form of Knowledge Graphs aiming at representing information as a set of nodes and relationships between them. This paper proposes an efficient framework to create informative embeddings for node classification on large knowledge graphs. Such embeddings capture how a particular node of the graph interacts with his neighborhood and indicate if it is either isolated or part of a bigger clique. Since a homogeneous graph is necessary to perform this kind of analysis, the framework exploits the metapath approach to split the heterogeneous graph into multiple homogeneous graphs. The proposed pipeline includes an unsupervised attentive neural network to merge different metapaths and produce node embeddings suitable for classification. Preliminary experiments on the IMDb dataset demonstrate the validity of the proposed approach, which can defeat current state-of-the-art unsupervised methods.Source: CEUR WORKSHOP PROCEEDINGS. Padua, Italy, 24-25/02/2022
Project(s): OpenAIRE Nexus via OpenAIRE

See at: ceur-ws.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2022 Conference article Open Access OPEN
A spatio-temporal attentive network for video-based crowd counting
Avvenuti M, Bongiovanni M, Ciampi L, Falchi F, Gennaro C, Messina N
Automatic people counting from images has recently drawn attention for urban monitoring in modern Smart Cities due to the ubiquity of surveillance camera networks. Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images. Only a bunch of works take advantage of temporal consistency in video sequences. In this work, we propose a spatio-temporal attentive neural network to estimate the number of pedestrians from surveillance videos. By taking advantage of the temporal correlation between consecutive frames, we lowered state-of-the-art count error by 5% and localization error by 7.5% on the widely-used FDST benchmark.Source: PROCEEDINGS - IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS. Rhodes Island, Greece, 30/06/2022-03/07/2022
Project(s): AI4Media via OpenAIRE

See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2022 Conference article Open Access OPEN
ALADIN: distilling fine-grained alignment scores for efficient image-text matching and retrieval
Messina N, Stefanini M, Cornia M, Baraldi L, Falchi F, Amato G, Cucchiara R
Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space -- where an efficient kNN search can be performed -- by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.DOI: 10.1145/3549555.3549576
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
An optimized pipeline for image-based localization in museums from egocentric images
Messina N, Falchi F, Furnari A, Gennaro C, Farinella Gm
With the increasing interest in augmented and virtual reality, visual localization is acquiring a key role in many downstream applications requiring a real-time estimate of the user location only from visual streams. In this paper, we propose an optimized hierarchical localization pipeline by specifically tackling cultural heritage sites with specific applications in museums. Specifically, we propose to enhance the Structure from Motion (SfM) pipeline for constructing the sparse 3D point cloud by a-priori filtering blurred and near-duplicated images. We also study an improved inference pipeline that merges similarity-based localization with geometric pose estimation to effectively mitigate the effect of strong outliers. We show that the proposed optimized pipeline obtains the lowest localization error on the challenging Bellomo dataset. Our proposed approach keeps both build and inference times bounded, in turn enabling the deployment of this pipeline in real-world scenarios.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14233, pp. 512-524. Udine, Italy, 11-15/09/2023
DOI: 10.1007/978-3-031-43148-7_43
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE, Visual Analysis For Location And Understanding Of Environments
Metrics:


See at: IRIS - Università degli Studi di Catania Open Access | CNR IRIS Open Access | ISTI Repository Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N, Coccomini Da, Esuli A, Falchi F
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.Source: MULTIMEDIA TOOLS AND APPLICATIONS, vol. 83, pp. 62915-62935
Project(s): AI4Media via OpenAIRE

See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Is CLIP the main roadblock for fine-grained open-world perception?
Bianchi L., Carrara F., Messina N., Falchi F.
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time – a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings – i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.Project(s): Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives

See at: CNR IRIS Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Vibration monitoring of historical towers: new contributions from data science
Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Deep neural networks are used to study the ambient vibrations of the medieval towers of the San Frediano Cathedral and the Guinigi Palace in the historic centre of Lucca. The towers have been continuously monitored for many months via high-sensitivity seismic stations. The recorded data sets integrated with environmental parameters are employed to train a Temporal Fusion Transformer network and forecast the dynamic behaviour of the monitored structures. The results show that the adopted algorithm can learn the main features of the towers’ dynamic response, predict its evolution over time, and detect anomalies.Source: LECTURE NOTES IN CIVIL ENGINEERING, vol. 514, pp. 15-24. Naples, Italy, 21-24/05/2024

See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted