124 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
not yet published Conference article Open Access OPEN
Talking to DINO: bridging self-supervised vision backbones with language for open-vocabulary segmentation
Barsellotti L., Bianchi L., Messina N., Carrara F., Cornia M., Baraldi L., Falchi F., Cucchiara R.
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.Source: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, pp. 22025-22035. Honolulu, Hawaii (USA), 19-23/10/2025
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives

See at: CNR IRIS Open Access | openaccess.thecvf.com Open Access | CNR IRIS Restricted


not yet published Conference article Open Access OPEN
Breaking the 2D dependency: what limits 3D-only open-vocabulary scene understanding
D’orsi Domenico, Carrara Fabio, Falchi Fabrizio, Tonellotto Nicola
Open-vocabulary 3D scene understanding, i.e., recognizing and classifying objects in 3D scenes without being limited to a predefined set of classes, is a foundational task for robotics and extended reality applications. Current leading methods often rely on 2D foundation models to extract semantics, then projected in 3D. This paper investigates the viability of a purely 3D-native pipeline, thereby eliminating dependencies on 2D models and reprojections. We systematically explored various architectural combinations using established 3D components. However, our extensive experiments on benchmark datasets reveal significant performance limitations with this direct 3D-native approach, with performance metrics falling short of expectations. Rather than a simple failure, these outcomes provide critical insights into the current deficiencies of existing 3D models when cascaded for complex open-vocabulary tasks. We highlight the lessons learned, identify the pipeline's limitations (e.g., segmenter-encoder domain gap, robustness to imperfect segmentations), and posit future research directions. We argue that a fundamental rethinking of model design and interplay is necessary to realize the potential of truly 3D-native open-vocabulary understanding.DOI: 10.5281/zenodo.17338754
DOI: 10.5281/zenodo.17338755
Project(s): Social and Human Centered XR
Metrics:


See at: CNR IRIS Open Access | zenodo.org Open Access | ZENODO Restricted | ZENODO Restricted | CNR IRIS Restricted


2026 Conference article Open Access OPEN
ViSketch-GPT: collaborative multi-scale feature extraction for hand-drawn sketch retrieval
Federico Giulio, Carrara Fabio, Gennaro Claudio, Di Benedetto Marco
Understanding the nature of hand-drawn sketches is challenging due to the wide variation in their creation. Federico et al. [10] demonstrated that recognizing complex structural patterns enhances both sketch recognition and generation. Building on this foundation, we explore how the extracted features can also be leveraged for hand-drawn sketch retrieval. In this work, we extend ViSketch-GPT, a multi-scale context extraction model originally designed for classification and generation, to the task of retrieval. The model’s ability to capture intricate details at multiple scales allows it to learn highly discriminative representations, making it well-suited for retrieval applications. Through extensive experiments on the QuickDraw and TU-Berlin datasets, we show that ViSketch-GPT surpasses state-of-the-art methods in sketch retrieval, achieving substantial improvements across multiple evaluation metrics. Our results show that the extracted feature representations, originally designed for classification and generation, are also highly effective for retrieval tasks. This highlights ViSketch-GPT as a versatile and high-powerful framework for various applications in computer vision and sketch analysis.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 16134, pp. 3-13. Reykjavik, Iceland, 1–3 october 2025
DOI: 10.1007/978-3-032-06069-3_1
Project(s): Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2026 Journal article Open Access OPEN
Pupillometry and brain-wide c-Fos mapping uncover multimodal mirror emotional contagion related networks of mice
Caldarelli Matteo, Zucca Stefano, Viglione Aurelia, Stella Alessandra, Nisar Rida, Sagona Giulia, Papini Ester M., Carrara Fabio, Bovetti Serena, Mazziotti Raffaele M., Pizzorusso Tommaso
Emotional contagion (ECo) represents a fundamental form of empathy. In this study, we used pupillometry to quantify ECo by assessing pupil responses of a mouse watching another mouse receive a tail shock. Pupil dilation effectively measured both direct and vicarious emotional response thresholds at the individual level through psychometric curve analysis. The pupillary ECo response diminished when the observer could not see the demonstrator, suggesting a multisensory process involving vision. Viewing videos of tail-shocked mice elicited a pupil response in the observer. Brain-wide c-Fos mapping revealed a broad network of 88 brain regions activated during ECo, with all areas activated in the demonstrator also engaged in the observer. Additionally, in some brain regions, correlated activation was detected between each observer-demonstrator pair, indicating that ECo promotes a shared neural state. These findings advance our understanding of the neural basis of shared emotions, with implications for analyzing neuropsychiatric disorder models.Source: ISCIENCE
DOI: 10.1016/j.isci.2026.114827
Metrics:


See at: CNR IRIS Open Access | www.cell.com Open Access | CNR IRIS Restricted


2026 Conference article Open Access OPEN
JoinPap: Learning-based matching for the reconstruction of fragmentary papyri
Carrara Fabio, Corsini Massimiliano, Falchi Fabrizio, Messina Nicola
Reconstructing ancient papyri from fragmented pieces is a demanding task, posing significant challenges for papyrologists due to degraded material, subtle texture cues, and a lack of distinct landmarks. This paper introduces JoinPap, an intelligent interactive system designed to foster human-machine collaboration in this specialized domain. JoinPap leverages a self-supervised convolutional autoencoder, trained with a contrastive learning objective on high-resolution papyri scans, to acquire robust and discriminative texture-aware embeddings. These representations capture the continuity of fiber patterns across fragments, enabling a specialized matching algorithm to propose optimal vertical and horizontal alignments. We elaborate on data preparation, network design, training methodology, and integration of the matcher into a user-centered interface that supports fragment manipulation and annotation. JoinPap effectively supports expert-in-the-loop reconstruction by offering high-quality alignment suggestions grounded in visual texture continuity.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 16170, pp. 296-306. Roma, Italy, 15–19 september 2025
DOI: 10.1007/978-3-032-11381-8_25
Project(s): FAIR - "Future Artificial Intelligence Research" - Spoke 1 "Human-centered AI", JoinPap – Reconstructing Fragmentary Papyri through Human-Machine Interaction
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2026 Journal article Open Access OPEN
Vi-SketchGPT: a novel multi-scale and context-aware representation for sketch generation and classification
Federico Giulio, Amato Giuseppe, Carrara Fabio, Gennaro Claudio, Di Benedetto Marco
Human sketches exhibit substantial variability across individuals in terms of line style, abstraction level and drawing conventions. Unlike realistic images, they provide limited contextual information and rely on highly simplified concept representations. Recognizing and generating sketches therefore requires efficient use of the available information, identification of the most informative local features, interpretation of their meaning within a minimal context, and understanding of the spatial relationships that define the overall structure. In this study, we introduce ViSketch-GPT, a representation and model that can extract these local features, contextualize them within the sketch and encode spatial relationships, thereby enabling a deeper understanding of the sketch structure. Guided by the intuition of the void as information, we leverage Signed Distance Functions (SDF) to reveal this potentially hidden information, organizing it via quadtree decomposition and processing it with a hierarchical Transformer to capture multi-scale dependencies. This structured representation allows the model to support both high-fidelity generation and accurate classification. Experiments on the QuickDraw and TU-Berlin datasets demonstrated that the model classifies sketches with high accuracy while generating outputs that preserve structural coherence, respect part relationships, and capture essential conceptual patterns despite the scarcity of information in the original sketches.Source: IEEE ACCESS
DOI: 10.1109/access.2026.3659732
Project(s): Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted


2025 Other Open Access OPEN
ISTI-day 2025 Proceedings
Del Corso G., Pedrotti A., Federico G., Gennaro C., Carrara F., Amato G., Di Benedetto M., Gabrielli E., Belli D., Matrullo Z., Miori V., Tolomei G., Waheed T., Marchetti E., Calabrò A., Rossetti G., Stella M., Cazabet R., Abramski K., Cau E., Citraro S., Failla A., Mesina V., Morini V., Pansanella V., Colantonio S., Germanese D., Pascali M. A., Bianchi L., Messina N., Falchi F., Barsellotti L., Pacini G., Cassese M., Puccetti G., Esuli A., Volpi L., Moreo A., Sebastiani F., Sperduti G., Nguyen D., Broccia G., Ter Beek M. H., Ferrari A., Massink M., Belmonte G., Ciancia V., Papini O., Canapa G., Catricalà B., Manca M., Paternò F., Santoro C., Zedda E., Gallo S., Maenza S., Mattioli A., Simeoli L., Rucci D., Carlini E., Dazzi P., Kavalionak H., Mordacchini M., Rulli C., Muntean Cristina Ioana, Nardini F. M., Perego R., Rocchietti G., Lettich F., Renso C., Pugliese C., Casini G., Haldimann J., Meyer T., Assante M., Candela L., Dell'Amico A., Frosini L., Mangiacrapa F., Oliviero A., Pagano P., Panichi G., Peccerillo B., Procaccini M., Mannocci A., Manghi P., Lonetti F., Kang D., Di Giandomenico F., Jee E., Lazzini G., Conti F., Scopigno R., D'Acunto M., Moroni D., Cafiso M., Paradisi P., Callieri M., Pavoni G., Corsini M., De Falco A., Sala F., Saraceni Q., Gattiglia G.
ISTI-Day is an annual information and networking event organized by the Institute of Information Science and Technologies "A. Faedo" (ISTI) of the Italian National Research Council (CNR). This event features an opening talk of the Director of the Dept. DIITET (Emilio F. Campana) as well as an overview of the Institute's activities presented by the ISTI Director (Roberto Scopigno). Those institutional segments are complemented by dedicated presentations and round tables featuring former staff members, as well as internal and external collaborators. To foster a network of knowledge and collaboration among newcomers, the 2025 ISTI Day edition also includes a large poster session that provides a comprehensive overview of current research activities. Each of the 13 laboratories contributes 1–3 posters, highlighting the most innovative work and offering early-career researchers a platform for discussion. Thus these proceedings include the posters selected for ISTI-Day 2025, reflecting the diverse and innovative nature of the Institute's research.

See at: CNR IRIS Open Access | www.isti.cnr.it Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Is CLIP the main roadblock for fine-grained open-world perception?
Bianchi L., Carrara F., Messina N., Falchi F.
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time – a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings – i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.DOI: 10.1109/cbmi62980.2024.10859215
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Conference article Restricted
Maybe you are looking for CroQS Cross-Modal Query Suggestion for text-to-image retrieval
Pacini G., Carrara F., Messina N., Tonellotto N., Amato G., Falchi F.
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15573, pp. 138-152. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88711-6_9
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2025 Conference article Restricted
A comparative demonstration of relevance feedback methods for image retrieval
Scotti F., Vadicamo L., Amato G., Carrara F.
Relevance feedback is a well-established approach to refine search results based on user input, but its comparative evaluation across different methods remains limited in practice. This demonstration paper introduces an interactive platform that supports and compares four relevance feedback methods—Rocchio, PicHunter, Polyadic Search, and SVM-based active learning—under consistent conditions. The primary goal is to enhance the understanding of how different relevance feedback methods affect retrieval performance from both a technical and user-centric perspective. The source code is available at https://github.com/francescascotti16/Demo-Relevance-Feedback, while the demonstration can be found at http://relevance-feedback.isti.cnr.it/.Source: LECTURE NOTES IN COMPUTER SCIENCE, pp. 375-383. Reykjavik, Iceland, 1-3/10/2025
DOI: 10.1007/978-3-032-06069-3_30
Project(s): MUCES – a MUltimedia platform for Content Enrichment and Search in audiovisual archives, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2025 Journal article Open Access OPEN
Training-free sparse representations of dense vectors for scalable information retrieval
Carrara F., Vadicamo L., Amato G., Gennaro C.
In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.Source: INFORMATION SYSTEMS, vol. 133 (issue 102567)
DOI: 10.1016/j.is.2025.102567
Project(s): Empowering Knowledge Extraction to Empower Learners, National Centre for HPC, Big Data and Quantum Computing, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition
Schall K., Bailer W., Barthel K. -U., Carrara F., Lokoč J., Peška L., Schoeffmann K., Vadicamo L., Vairo C.
CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition Video Browser Showdown 2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems’ functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large-scale video database.Source: INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, vol. 13 (issue 15)
DOI: 10.1007/s13735-024-00325-9
Project(s): AI4Media via OpenAIRE, XReco via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Beyond human imagination: the art of creating prompt-driven 3D scenes with Generative AI
Federico G., Carrara F., Amato G., Di Benedetto M.
Reconstructing large-scale outdoor environments is essential for advancing XR applications but is hindered by the high cost and limitations of traditional methods like LiDAR, depth sensors, and photogrammetry. We propose generative neural architectures to address these issues. Our initial Spatio-Temporal Diffusion model combines temporal image sequences and coarse spatial data with a novel SDF_MIP representation for efficient training. Building on this, we introduce Neural-Clipmap, a scalable framework using an enhanced octree structure and Triplane representations to refine 3D reconstructions iteratively. Additionally, we leverage monocular RGB image sequences with 2D diffusion priors via Score Distillation Sampling (SDS) to reconstruct missing data, addressing challenges like initialization coherence and color accuracy through a multi-phase inpainting process. These approaches reduce resource requirements while enabling efficient, high-quality reconstructions.Source: VTT TECHNOLOGY, vol. 432, pp. 201-206. Athens, Greece, 27-29 Novembre 2024
Project(s): Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE

See at: cris.vtt.fi Open Access | CNR IRIS Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Will VISIONE remain competitive in lifelog image search?
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness. Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms. In this paper, we present an overview of VISIONE's core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It's important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes. Our objective is to evaluate the system's performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.DOI: 10.1145/3643489.3661122
Project(s): AI4Media via OpenAIRE
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Visione 5.0: toward evaluation with novice users
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a video search system that integrates multiple search functionalities, allowing users to search for video segments using textual and visual queries, complemented by temporal search capabilities. It exploits state-of-the-art Artificial Intelligence approaches for visual content analysis and highly efficient indexing techniques to ensure fast response and scalability. In the recently concluded Video Browser Showdown (VBS2024) - a well-established international competition in interactive video retrieval - VISIONE ranked first and scored as the best interactive video search system in four out of seven tasks carried out in the competition.This paper provides an overview of the VISIONE system, emphasizing the improvements made to the system in the last year to improve its usability for novice users. A demonstration video showcasing the system's capabilities across 2,300 hours of diverse video content is available online, as well as a simplified demo of VISIONE.DOI: 10.1109/cbmi62980.2024.10859203
Project(s): AI4Media via OpenAIRE, National Centre for HPC, Big Data and Quantum Computing, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding
Bianchi L., Carrara F., Messina N., Gennaro C., Falchi F.
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenar-ios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand finegrained prop-erties of objects and their parts. To this end, we intro-duce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and as-sign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing dif-ferent properties like color, pattern, and material. We fur-ther enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.Source: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 22520-22529. Seattle (USA), 17-21/06/2024
DOI: 10.1109/cvpr52733.2024.02125
DOI: 10.48550/arxiv.2311.17518
Project(s): SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: arXiv.org e-Print Archive Open Access | IRIS Cnr Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Other Open Access OPEN
AIMH Research Activities 2024
Aloia N., Amato G., Bartalesi Lenzi V., Bianchi L., Bolettieri P., Bosio C., Carraglia M., Carrara F., Casarosa V., Cassese M., Ciampi L., Coccomini D. A., Concordia C., Connor R., Corbara S., De Martino C., Di Benedetto M., Esuli A., Falchi F., Fazzari E., Gennaro C., Iannello L., Negi K., Lagani G., Lenzi E., Leocata M., Malvaldi M., Meghini C., Messina N., Moreo Fernandez A., Nardi A., Pacini G., Pedrotti A., Pratelli N., Puccetti G., Rabitti F., Savino P., Scotti F., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C., Versienti L., Volpi L.
The AIMH (Artificial Intelligence for Media and Humanities) laboratory is committed to advancing the field of Artificial Intelligence, with a special emphasis on its applications in digital media and the humanities. The lab aims to improve AI technologies, particularly in areas such as deep learning, text analysis, computer vision, multimedia information retrieval, content analysis, recognition, and retrieval. This report summarizes the laboratory’s achievements and activities over the course of 2024.DOI: 10.32079/isti-ar-2024/001
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Restricted
Bridging virtual and physical worlds through AI
Carrara F.
Immersive and user-friendly experiences will win the Extended Reality (XR) game in the long run. However, setting up good VR/AR scenarios often requires manual asset authoring, which is realistic just when dealing with a limited number of predefined objects and scenes. The Social and hUman ceNtered (SUN) XR project is investigating low-cost, yet effective, solutions to create links between a physical environment and its corresponding one in the virtual world.Source: ERCIM NEWS, vol. 2024 (issue 137), pp. 17-18
Project(s): SUN via OpenAIRE

See at: ercim-news.ercim.eu Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Spatio-temporal 3D reconstruction from frame sequences and feature points
Federico G., Carrara F., Amato G., Di Benedetto M.
Reconstructing a large real environment is a fundamental task to promote eXtended Reality adoption in industrial and entertainment fields. However, the short range of depth cameras, the sparsity of LiDAR sensors, and the huge computational cost of Structure-from-Motion pipelines prevent scene replication in near real time. To overcome these limitations, we introduce a spatio-temporal diffusion neural architecture, a generative AI technique that fuses temporal information (i.e., a short temporally-ordered list of color photographs, like sparse frames of a video stream) with an approximate spatial resemblance of the explored environment. Our aim is to modify an existing 3D diffusion neural model to produce a Signed Distance Field volume from which a 3D mesh representation can be extracted. Our results show that the hallucination approach of diffusion models is an effective methodology where a fast reconstruction is a crucial target.DOI: 10.1145/3672406.3672415
Project(s): Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Evaluating performance and trends in interactive video retrieval: insights from the 12th VBS competition
Vadicamo L., Arnold R., Bailer W., Carrara F., Gurrin C., Hezel N., Li X., Lokoc J., Lubos S., Ma Z., Messina N., Nguyen T., Peska L., Rossetto L., Sauter L., Schöffmann K., Spiess F., Tran M., Vrochidis S.
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems. The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process. Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.Source: IEEE ACCESS, vol. 12, pp. 79342-79366
DOI: 10.1109/access.2024.3405638
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted