62 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2025 Conference article Restricted
Towards identity-aware cross-modal retrieval: a dataset and a baseline
Messina N., Vadicamo L., Maltese L., Gennaro C.
Recent advancements in deep learning have significantly enhanced content-based retrieval methods, notably through models like CLIP that map images and texts into a shared embedding space. However, these methods often struggle with domain-specific entities and long-tail concepts absent from their training data, particularly in identifying specific individuals. In this paper, we explore the task of identity-aware cross-modal retrieval, which aims to retrieve images of persons in specific contexts based on natural language queries. This task is critical in various scenarios, such as for searching and browsing personalized video collections or large audio-visual archives maintained by national broadcasters. We introduce a novel dataset, COCO Person FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched with deepfake-generated faces from VGGFace2. This dataset addresses the lack of large-scale datasets needed for training and evaluating models for this task. Our experiments assess the performance of different CLIP variations repurposed for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which achieves competitive retrieval performance through targeted fine-tuning. Our contributions lay the groundwork for more robust cross-modal retrieval systems capable of recognizing long-tail identities and contextual nuances. Data and code are available at .Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15572, pp. 437-452. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88708-6_28
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Is CLIP the main roadblock for fine-grained open-world perception?
Bianchi L., Carrara F., Messina N., Falchi F.
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time – a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings – i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.DOI: 10.1109/cbmi62980.2024.10859215
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Conference article Restricted
Maybe you are looking for CroQS Cross-Modal Query Suggestion for text-to-image retrieval
Pacini G., Carrara F., Messina N., Tonellotto N., Amato G., Falchi F.
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15573, pp. 138-152. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88711-6_9
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2024 Journal article Open Access OPEN
Deep learning and structural health monitoring: temporal fusion transformers for anomaly detection in masonry towers
Falchi F., Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Detecting anomalies in the vibrational features of age-old buildings is crucial within the Structural Health Monitoring (SHM) framework. The SHM techniques can leverage information from onsite measurements and environmental sources to identify the dynamic properties (such as the frequencies) of the monitored structure, searching for possible deviations or unusual behavior over time. In this paper, the Temporal Fusion Transformer (TFT) network, a deep learning algorithm initially designed for multi-horizon time series forecasting and tested on electricity, traffic, retail, and volatility problems, is applied to SHM. The TFT approach is adopted to investigate the behavior of the Guinigi Tower located in Lucca (Italy) and subjected to a long-term dynamic monitoring campaign. The TFT network is trained on the tower's experimental frequencies enriched with other environmental parameters. The transformer is then employed to predict the vibrational features (natural frequencies, root mean squares values of the velocity time series) and detect possible anomalies or unexpected events by inspecting how much the actual frequencies deviate from the predicted ones. The TFT technique is used to detect the effects of the Viareggio earthquake that occurred on 6 February 2022, and the structural damage induced by three simulated damage scenarios.Source: MECHANICAL SYSTEMS AND SIGNAL PROCESSING, vol. 215 (issue 111382)
DOI: 10.1016/j.ymssp.2024.111382
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N, Coccomini Da, Esuli A, Falchi F
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.Source: MULTIMEDIA TOOLS AND APPLICATIONS, vol. 83, pp. 62915-62935
DOI: 10.1007/s11042-023-17977-0
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Will VISIONE remain competitive in lifelog image search?
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness. Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms. In this paper, we present an overview of VISIONE's core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It's important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes. Our objective is to evaluate the system's performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.DOI: 10.1145/3643489.3661122
Project(s): AI4Media via OpenAIRE
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Visione 5.0: toward evaluation with novice users
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a video search system that integrates multiple search functionalities, allowing users to search for video segments using textual and visual queries, complemented by temporal search capabilities. It exploits state-of-the-art Artificial Intelligence approaches for visual content analysis and highly efficient indexing techniques to ensure fast response and scalability. In the recently concluded Video Browser Showdown (VBS2024) - a well-established international competition in interactive video retrieval - VISIONE ranked first and scored as the best interactive video search system in four out of seven tasks carried out in the competition.This paper provides an overview of the VISIONE system, emphasizing the improvements made to the system in the last year to improve its usability for novice users. A demonstration video showcasing the system's capabilities across 2,300 hours of diverse video content is available online, as well as a simplified demo of VISIONE.DOI: 10.1109/cbmi62980.2024.10859203
Project(s): AI4Media via OpenAIRE, National Centre for HPC, Big Data and Quantum Computing, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Vibration monitoring of historical towers: new contributions from data science
Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Deep neural networks are used to study the ambient vibrations of the medieval towers of the San Frediano Cathedral and the Guinigi Palace in the historic centre of Lucca. The towers have been continuously monitored for many months via high-sensitivity seismic stations. The recorded data sets integrated with environmental parameters are employed to train a Temporal Fusion Transformer network and forecast the dynamic behaviour of the monitored structures. The results show that the adopted algorithm can learn the main features of the towers’ dynamic response, predict its evolution over time, and detect anomalies.Source: LECTURE NOTES IN CIVIL ENGINEERING, vol. 514, pp. 15-24. Naples, Italy, 21-24/05/2024
DOI: 10.1007/978-3-031-61421-7_2
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding
Bianchi L., Carrara F., Messina N., Gennaro C., Falchi F.
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenar-ios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand finegrained prop-erties of objects and their parts. To this end, we intro-duce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and as-sign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing dif-ferent properties like color, pattern, and material. We fur-ther enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.Source: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 22520-22529. Seattle (USA), 17-21/06/2024
DOI: 10.1109/cvpr52733.2024.02125
DOI: 10.48550/arxiv.2311.17518
Project(s): SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: arXiv.org e-Print Archive Open Access | IRIS Cnr Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Evaluating performance and trends in interactive video retrieval: insights from the 12th VBS competition
Vadicamo L., Arnold R., Bailer W., Carrara F., Gurrin C., Hezel N., Li X., Lokoc J., Lubos S., Ma Z., Messina N., Nguyen T., Peska L., Rossetto L., Sauter L., Schöffmann K., Spiess F., Tran M., Vrochidis S.
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems. The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process. Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.Source: IEEE ACCESS, vol. 12, pp. 79342-79366
DOI: 10.1109/access.2024.3405638
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
VISIONE 5.0: enhanced user interface and AI models for VBS2024
Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, Claudio Vairo
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content. Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14557, pp. 332-339. Amsterdam, NL, 29/01-2/02/2024
DOI: 10.1007/978-3-031-53302-0_29
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
CrowdSim2: an open synthetic benchmark for object detectors
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.DOI: 10.5220/0011692500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Development of a realistic crowd simulation environment for fine-grained validation of people tracking methods
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.DOI: 10.5220/0011691500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Restricted
Improving query and assessment quality in text-based interactive video retrieval evaluation
Bailer W, Arnold R, Benz V, Coccomini D, Gkagkas A, Þór Guðmundsson G, Heller S, Þór Jónsson B, Lokoc J, Messina N, Pantelidis N, Wu J
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.DOI: 10.1145/3591106.3592281
Project(s): AI4Media via OpenAIRE
Metrics:


See at: dl.acm.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Text-to-motion retrieval: towards joint understanding of human motion data and natural language
Messina N, Sedmidubsk'y J, Falchi F, Rebok T
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.DOI: 10.1145/3539618.3592069
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS
Lokoc J, Andreadis S, Bailer W, Duane A, Gurrin C, Ma Z, Messina N, Nguyen T N, Peska L, Rossetto L, Sauter L, Schall K, Schoeffmann K, Khan Os, Spiess F, Vadicamo L, Vrochidis S
This paper presents findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In this paper, a broad survey of all utilized approaches is presented in connection with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for ad-hoc search based tasks at Video Browser Showdown is introduced.Source: MULTIMEDIA SYSTEMS
DOI: 10.1007/s00530-023-01143-5
Project(s): AI4Media via OpenAIRE, XRECO via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | ZENODO Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
VISIONE: a large-scale video retrieval system with advanced search functionalities
Amato G, Bolettieri P, Carrara F, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C
VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.itDOI: 10.1145/3591106.3592226
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
VISIONE at Video Browser Showdown 2023
Amato G, Bolettieri P, Carrara F, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C
In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend. In this new version of the system, we introduced some changes both to the current search techniques and to the user interface.DOI: 10.1007/978-3-031-27077-2_48
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | ZENODO Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
MC-GTA: a synthetic benchmark for multi-camera vehicle tracking
Ciampi L, Messina N, Valenti Ge, Amato G, Falchi F, Gennaro C
Multi-camera vehicle tracking (MCVT) aims to trace multiple vehicles among videos gathered from overlapping and non-overlapping city cameras. It is beneficial for city-scale traffic analysis and management as well as for security. However, developing MCVT systems is tricky, and their real-world applicability is dampened by the lack of data for training and testing computer vision deep learning-based solutions. Indeed, creating new annotated datasets is cumbersome as it requires great human effort and often has to face privacy concerns. To alleviate this problem, we introduce MC-GTA - Multi Camera Grand Tracking Auto, a synthetic collection of images gathered from the virtual world provided by the highly-realistic Grand Theft Auto 5 (GTA) video game. Our dataset has been recorded from several cameras recording urban scenes at various crossroads. The annotations, consisting of bounding boxes localizing the vehicles with associated unique IDs consistent across the video sources, have been automatically generated by interacting with the game engine. To assess this simulated scenario, we conduct a performance evaluation using an MCVT SOTA approach, showing that it can be a valuable benchmark that mitigates the need for real-world data. The MC-GTA dataset and the code for creating new ad-hoc custom scenarios are available at https://github.com/GaetanoV10/GT5-Vehicle-BB.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14233, pp. 316-327. Udine, Italy, 11-15/09/2023
DOI: 10.1007/978-3-031-43148-7_27
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
AIMH Lab 2022 activities for Vision
Ciampi L, Amato G, Bolettieri P, Carrara F, Di Benedetto M, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C
The explosion of smartphones and cameras has led to a vast production of multimedia data. Consequently, Artificial Intelligence-based tools for automatically understanding and exploring these data have recently gained much attention. In this short paper, we report some activities of the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR, tackling some challenges in the field of Computer Vision for the automatic understanding of visual data and for novel interactive tools aimed at multimedia data exploration. Specifically, we provide innovative solutions based on Deep Learning techniques carrying out typical vision tasks such as object detection and visual counting, with particular emphasis on scenarios characterized by scarcity of labeled data needed for the supervised training and on environments with limited power resources imposing miniaturization of the models. Furthermore, we describe VISIONE, our large-scale video search system designed to search extensive multimedia databases in an interactive and user-friendly manner.Source: CEUR WORKSHOP PROCEEDINGS, pp. 538-543. Pisa, Italy, 29-31/05/2023
Project(s): AI4Media via OpenAIRE, Future Artificial Intelligence Research

See at: ceur-ws.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted