68 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
not yet published Conference article Open Access OPEN
Talking to DINO: bridging self-supervised vision backbones with language for open-vocabulary segmentation
Barsellotti L., Bianchi L., Messina N., Carrara F., Cornia M., Baraldi L., Falchi F., Cucchiara R.
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.Source: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, pp. 22025-22035. Honolulu, Hawaii (USA), 19-23/10/2025
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives

See at: CNR IRIS Open Access | openaccess.thecvf.com Open Access | CNR IRIS Restricted


2025 Conference article Restricted
Towards identity-aware cross-modal retrieval: a dataset and a baseline
Messina N., Vadicamo L., Maltese L., Gennaro C.
Recent advancements in deep learning have significantly enhanced content-based retrieval methods, notably through models like CLIP that map images and texts into a shared embedding space. However, these methods often struggle with domain-specific entities and long-tail concepts absent from their training data, particularly in identifying specific individuals. In this paper, we explore the task of identity-aware cross-modal retrieval, which aims to retrieve images of persons in specific contexts based on natural language queries. This task is critical in various scenarios, such as for searching and browsing personalized video collections or large audio-visual archives maintained by national broadcasters. We introduce a novel dataset, COCO Person FaceSwap (COCO-PFS), derived from the widely used COCO dataset and enriched with deepfake-generated faces from VGGFace2. This dataset addresses the lack of large-scale datasets needed for training and evaluating models for this task. Our experiments assess the performance of different CLIP variations repurposed for this task, including our architecture, Identity-aware CLIP (Id-CLIP), which achieves competitive retrieval performance through targeted fine-tuning. Our contributions lay the groundwork for more robust cross-modal retrieval systems capable of recognizing long-tail identities and contextual nuances. Data and code are available at .Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15572, pp. 437-452. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88708-6_28
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2025 Other Open Access OPEN
ISTI-day 2025 Proceedings
Del Corso G., Pedrotti A., Federico G., Gennaro C., Carrara F., Amato G., Di Benedetto M., Gabrielli E., Belli D., Matrullo Z., Miori V., Tolomei G., Waheed T., Marchetti E., Calabrò A., Rossetti G., Stella M., Cazabet R., Abramski K., Cau E., Citraro S., Failla A., Mesina V., Morini V., Pansanella V., Colantonio S., Germanese D., Pascali M. A., Bianchi L., Messina N., Falchi F., Barsellotti L., Pacini G., Cassese M., Puccetti G., Esuli A., Volpi L., Moreo A., Sebastiani F., Sperduti G., Nguyen D., Broccia G., Ter Beek M. H., Ferrari A., Massink M., Belmonte G., Ciancia V., Papini O., Canapa G., Catricalà B., Manca M., Paternò F., Santoro C., Zedda E., Gallo S., Maenza S., Mattioli A., Simeoli L., Rucci D., Carlini E., Dazzi P., Kavalionak H., Mordacchini M., Rulli C., Muntean Cristina Ioana, Nardini F. M., Perego R., Rocchietti G., Lettich F., Renso C., Pugliese C., Casini G., Haldimann J., Meyer T., Assante M., Candela L., Dell'Amico A., Frosini L., Mangiacrapa F., Oliviero A., Pagano P., Panichi G., Peccerillo B., Procaccini M., Mannocci A., Manghi P., Lonetti F., Kang D., Di Giandomenico F., Jee E., Lazzini G., Conti F., Scopigno R., D'Acunto M., Moroni D., Cafiso M., Paradisi P., Callieri M., Pavoni G., Corsini M., De Falco A., Sala F., Saraceni Q., Gattiglia G.
ISTI-Day is an annual information and networking event organized by the Institute of Information Science and Technologies "A. Faedo" (ISTI) of the Italian National Research Council (CNR). This event features an opening talk of the Director of the Dept. DIITET (Emilio F. Campana) as well as an overview of the Institute's activities presented by the ISTI Director (Roberto Scopigno). Those institutional segments are complemented by dedicated presentations and round tables featuring former staff members, as well as internal and external collaborators. To foster a network of knowledge and collaboration among newcomers, the 2025 ISTI Day edition also includes a large poster session that provides a comprehensive overview of current research activities. Each of the 13 laboratories contributes 1–3 posters, highlighting the most innovative work and offering early-career researchers a platform for discussion. Thus these proceedings include the posters selected for ISTI-Day 2025, reflecting the diverse and innovative nature of the Institute's research.

See at: CNR IRIS Open Access | www.isti.cnr.it Open Access | CNR IRIS Restricted


2025 Journal article Open Access OPEN
Structural monitoring of heritage buildings via deep learning algorithms
Girardi M., Gurioli G., Messina N.
Monitoring systems constitute a significant, non-invasive tool for verifying the structural health of buildings and infrastructure over time. Deep learning neural networks can be used to analyse data from long-term monitoring systems, such as time series of velocity/acceleration measured at specific points and environmental parameters, and to predict the main features of the buildings’ structural behaviour with respect to ambient stresses. Potential anomalies of the structure’s vibrational features related to damage or unexpected events, such as earthquakes or exceptional loads, can also be detected. The paper focuses on the application of a Temporal Fusion Transformer (TFT) network to data from the dynamic monitoring of a medieval tower in the historic centre of Lucca (Tuscany, Italy).Source: ERCIM NEWS, vol. 141, pp. 13-14

See at: ercim-news.ercim.eu Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Is CLIP the main roadblock for fine-grained open-world perception?
Bianchi L., Carrara F., Messina N., Falchi F.
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time – a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings – i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.DOI: 10.1109/cbmi62980.2024.10859215
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Conference article Restricted
Maybe you are looking for CroQS Cross-Modal Query Suggestion for text-to-image retrieval
Pacini G., Carrara F., Messina N., Tonellotto N., Amato G., Falchi F.
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15573, pp. 138-152. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88711-6_9
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2025 Journal article Open Access OPEN
Joint-dataset learning and cross-consistent regularization for text-to-motion retrieval
Messina N., Sedmidubsky J., Falchi F., Rebok T.
Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: .Source: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS
DOI: 10.1145/3744565
DOI: 10.48550/arxiv.2407.02104
Metrics:


See at: arXiv.org e-Print Archive Open Access | CNR IRIS Open Access | ACM Transactions on Multimedia Computing Communications and Applications Restricted | doi.org Restricted | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Mind the prompt: a novel benchmark for prompt-based class-agnostic counting
Ciampi L., Messina N., Pierucci M., Amato G., Avvenuti M., Falchi F.
Recently, object counting has shifted towards classagnostic counting (CAC), which counts instances of arbitrary object classes never seen during model training. With advancements in robust vision-and-language foundation models, there is a growing interest in prompt-based CAC, where object categories are specified using natural language. However, we identify significant limitations in current benchmarks for evaluating this task, which hinder both accurate assessment and the development of more effective solutions. Specifically, we argue that the current evaluation protocols do not measure the ability of the model to understand which object has to be counted. This is due to two main factors: (i) the shortcomings of CAC datasets, which primarily consist of images containing objects from a single class, and (ii) the limitations of current counting performance evaluators, which are based on traditional class-specific counting and focus solely on counting errors. To fill this gap, we introduce the Prompt-Aware Counting (PrACo) benchmark. It comprises two targeted tests coupled with evaluation metrics specifically designed to quantitatively measure the robustness and trustworthiness of existing prompt-based CAC models. We evaluate state-of-the-art methods and demonstrate that, although some achieve impressive results on standard class-specific counting metrics, they exhibit a significant deficiency in understanding the input prompt, indicating the need for more careful training procedures or revised designs. The code for reproducing our results is available at https://github.com/ciampluca/PrACo.DOI: 10.1109/wacv61041.2025.00774
DOI: 10.48550/arxiv.2409.15953
Project(s): SUN via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | CNR IRIS Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | doi.org Restricted | Archivio della Ricerca - Università di Pisa Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Using Artificial Intelligence for the dynamic monitoring of an old tower in the Medicean Port of Livorno (Italy)
Bartoli G., Betti M., Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D., Zini G.
This paper applies deep learning techniques to the signals acquired during a long-term dynamic monitoring campaign conducted on the Matilde donjon, a fortified keep belonging to the Old Fortress in the Medicean Port of Livorno, Italy. The time series collected during the dynamic monitoring complemented with the environmental parameters (temperature, wind speed) were used to train a deep learning neural network and forecast the dynamical behaviour of the tower. Although the signals are sparse and noisy, the algorithm can learn the main features of the tower’s dynamic response and detect anomalies and events occurring in the surrounding environment.Source: LECTURE NOTES IN CIVIL ENGINEERING, vol. 676, pp. 54-62. Porto, Portugal, 2-4/07/2025
DOI: 10.1007/978-3-031-96114-4_7
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Deep learning and structural health monitoring: temporal fusion transformers for anomaly detection in masonry towers
Falchi F., Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Detecting anomalies in the vibrational features of age-old buildings is crucial within the Structural Health Monitoring (SHM) framework. The SHM techniques can leverage information from onsite measurements and environmental sources to identify the dynamic properties (such as the frequencies) of the monitored structure, searching for possible deviations or unusual behavior over time. In this paper, the Temporal Fusion Transformer (TFT) network, a deep learning algorithm initially designed for multi-horizon time series forecasting and tested on electricity, traffic, retail, and volatility problems, is applied to SHM. The TFT approach is adopted to investigate the behavior of the Guinigi Tower located in Lucca (Italy) and subjected to a long-term dynamic monitoring campaign. The TFT network is trained on the tower's experimental frequencies enriched with other environmental parameters. The transformer is then employed to predict the vibrational features (natural frequencies, root mean squares values of the velocity time series) and detect possible anomalies or unexpected events by inspecting how much the actual frequencies deviate from the predicted ones. The TFT technique is used to detect the effects of the Viareggio earthquake that occurred on 6 February 2022, and the structural damage induced by three simulated damage scenarios.Source: MECHANICAL SYSTEMS AND SIGNAL PROCESSING, vol. 215 (issue 111382)
DOI: 10.1016/j.ymssp.2024.111382
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N, Coccomini Da, Esuli A, Falchi F
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.Source: MULTIMEDIA TOOLS AND APPLICATIONS, vol. 83, pp. 62915-62935
DOI: 10.1007/s11042-023-17977-0
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Will VISIONE remain competitive in lifelog image search?
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness. Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms. In this paper, we present an overview of VISIONE's core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It's important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes. Our objective is to evaluate the system's performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.DOI: 10.1145/3643489.3661122
Project(s): AI4Media via OpenAIRE
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Visione 5.0: toward evaluation with novice users
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a video search system that integrates multiple search functionalities, allowing users to search for video segments using textual and visual queries, complemented by temporal search capabilities. It exploits state-of-the-art Artificial Intelligence approaches for visual content analysis and highly efficient indexing techniques to ensure fast response and scalability. In the recently concluded Video Browser Showdown (VBS2024) - a well-established international competition in interactive video retrieval - VISIONE ranked first and scored as the best interactive video search system in four out of seven tasks carried out in the competition.This paper provides an overview of the VISIONE system, emphasizing the improvements made to the system in the last year to improve its usability for novice users. A demonstration video showcasing the system's capabilities across 2,300 hours of diverse video content is available online, as well as a simplified demo of VISIONE.DOI: 10.1109/cbmi62980.2024.10859203
Project(s): AI4Media via OpenAIRE, National Centre for HPC, Big Data and Quantum Computing, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Vibration monitoring of historical towers: new contributions from data science
Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Deep neural networks are used to study the ambient vibrations of the medieval towers of the San Frediano Cathedral and the Guinigi Palace in the historic centre of Lucca. The towers have been continuously monitored for many months via high-sensitivity seismic stations. The recorded data sets integrated with environmental parameters are employed to train a Temporal Fusion Transformer network and forecast the dynamic behaviour of the monitored structures. The results show that the adopted algorithm can learn the main features of the towers’ dynamic response, predict its evolution over time, and detect anomalies.Source: LECTURE NOTES IN CIVIL ENGINEERING, vol. 514, pp. 15-24. Naples, Italy, 21-24/05/2024
DOI: 10.1007/978-3-031-61421-7_2
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding
Bianchi L., Carrara F., Messina N., Gennaro C., Falchi F.
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenar-ios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand finegrained prop-erties of objects and their parts. To this end, we intro-duce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and as-sign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing dif-ferent properties like color, pattern, and material. We fur-ther enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.Source: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 22520-22529. Seattle (USA), 17-21/06/2024
DOI: 10.1109/cvpr52733.2024.02125
DOI: 10.48550/arxiv.2311.17518
Project(s): SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: arXiv.org e-Print Archive Open Access | IRIS Cnr Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Other Open Access OPEN
AIMH Research Activities 2024
Aloia N., Amato G., Bartalesi Lenzi V., Bianchi L., Bolettieri P., Bosio C., Carraglia M., Carrara F., Casarosa V., Cassese M., Ciampi L., Coccomini D. A., Concordia C., Connor R., Corbara S., De Martino C., Di Benedetto M., Esuli A., Falchi F., Fazzari E., Gennaro C., Iannello L., Negi K., Lagani G., Lenzi E., Leocata M., Malvaldi M., Meghini C., Messina N., Moreo Fernandez A., Nardi A., Pacini G., Pedrotti A., Pratelli N., Puccetti G., Rabitti F., Savino P., Scotti F., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C., Versienti L., Volpi L.
The AIMH (Artificial Intelligence for Media and Humanities) laboratory is committed to advancing the field of Artificial Intelligence, with a special emphasis on its applications in digital media and the humanities. The lab aims to improve AI technologies, particularly in areas such as deep learning, text analysis, computer vision, multimedia information retrieval, content analysis, recognition, and retrieval. This report summarizes the laboratory’s achievements and activities over the course of 2024.DOI: 10.32079/isti-ar-2024/001
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Evaluating performance and trends in interactive video retrieval: insights from the 12th VBS competition
Vadicamo L., Arnold R., Bailer W., Carrara F., Gurrin C., Hezel N., Li X., Lokoc J., Lubos S., Ma Z., Messina N., Nguyen T., Peska L., Rossetto L., Sauter L., Schöffmann K., Spiess F., Tran M., Vrochidis S.
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems. The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process. Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.Source: IEEE ACCESS, vol. 12, pp. 79342-79366
DOI: 10.1109/access.2024.3405638
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
VISIONE 5.0: enhanced user interface and AI models for VBS2024
Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, Claudio Vairo
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content. Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14557, pp. 332-339. Amsterdam, NL, 29/01-2/02/2024
DOI: 10.1007/978-3-031-53302-0_29
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
CrowdSim2: an open synthetic benchmark for object detectors
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.DOI: 10.5220/0011692500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Development of a realistic crowd simulation environment for fine-grained validation of people tracking methods
Foszner P, Szczesna A, Ciampi L, Messina N, Cygan A, Bizon B, Cogiel M, Golba D, Macioszek E, Staniszewski M
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.DOI: 10.5220/0011691500003417
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.scitepress.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted