284 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2025 Conference article Open Access OPEN
Is CLIP the main roadblock for fine-grained open-world perception?
Bianchi L., Carrara F., Messina N., Falchi F.
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time – a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings – i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.DOI: 10.1109/cbmi62980.2024.10859215
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Contribution to book Open Access OPEN
Adversarial magnification to deceive deepfake detection through super resolution
Coccomini D. A., Caldelli R., Amato G., Falchi F., Gennaro C.
Deepfake technology is rapidly advancing, posing significant challenges to the detection of manipulated media content. Parallel to that, some adversarial attack techniques have been developed to fool the deepfake detectors and make deepfakes even more difficult to be detected. This paper explores the application of super resolution techniques as a possible adversarial attack in deepfake detection. Through our experiments, we demonstrate that minimal changes made by these methods in the visual appearance of images can have a profound impact on the performance of deepfake detection systems. We propose a novel attack using super resolution as a quick, black-box and effective method to camouflage fake images and/or generate false alarms on pristine images. Our results indicate that the usage of super resolution can significantly impair the accuracy of deepfake detectors, thereby highlighting the vulnerability of such systems to adversarial attacks. The code to reproduce our experiments is available at: https://github.com/davide-coccomini/Adversarial-Magnification-to-Deceive-Deepfake-Detection-through-Super-Resolution.Source: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE, vol. 2134, pp. 491-501
DOI: 10.1007/978-3-031-74627-7_41
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | doi.org Restricted | CNR IRIS Restricted


2025 Conference article Restricted
Maybe you are looking for CroQS Cross-Modal Query Suggestion for text-to-image retrieval
Pacini G., Carrara F., Messina N., Tonellotto N., Amato G., Falchi F.
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15573, pp. 138-152. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88711-6_9
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2024 Journal article Open Access OPEN
Deep learning and structural health monitoring: temporal fusion transformers for anomaly detection in masonry towers
Falchi F., Girardi M., Gurioli G., Messina N., Padovani C., Pellegrini D.
Detecting anomalies in the vibrational features of age-old buildings is crucial within the Structural Health Monitoring (SHM) framework. The SHM techniques can leverage information from onsite measurements and environmental sources to identify the dynamic properties (such as the frequencies) of the monitored structure, searching for possible deviations or unusual behavior over time. In this paper, the Temporal Fusion Transformer (TFT) network, a deep learning algorithm initially designed for multi-horizon time series forecasting and tested on electricity, traffic, retail, and volatility problems, is applied to SHM. The TFT approach is adopted to investigate the behavior of the Guinigi Tower located in Lucca (Italy) and subjected to a long-term dynamic monitoring campaign. The TFT network is trained on the tower's experimental frequencies enriched with other environmental parameters. The transformer is then employed to predict the vibrational features (natural frequencies, root mean squares values of the velocity time series) and detect possible anomalies or unexpected events by inspecting how much the actual frequencies deviate from the predicted ones. The TFT technique is used to detect the effects of the Viareggio earthquake that occurred on 6 February 2022, and the structural damage induced by three simulated damage scenarios.Source: MECHANICAL SYSTEMS AND SIGNAL PROCESSING, vol. 215 (issue 111382)
DOI: 10.1016/j.ymssp.2024.111382
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Cascaded transformer-based networks for Wikipedia large-scale image-caption matching
Messina N, Coccomini Da, Esuli A, Falchi F
With the increasing importance of multimedia and multilingual data in online encyclopedias,novel methods are needed to fill domain gaps and automatically connect different modalitiesfor increased accessibility. For example,Wikipedia is composed of millions of pages writtenin multiple languages. Images, when present, often lack textual context, thus remainingconceptually floating and harder to find and manage. In this work, we tackle the novel taskof associating images from Wikipedia pages with the correct caption among a large poolof available ones written in multiple languages, as required by the image-caption matchingKaggle challenge organized by theWikimedia Foundation.Asystem able to perform this taskwould improve the accessibility and completeness of the underlying multi-modal knowledgegraph in online encyclopedias. We propose a cascade of two models powered by the recentTransformer networks able to efficiently and effectively infer a relevance score betweenthe query image data and the captions. We verify through extensive experiments that theproposed cascaded approach effectively handles a large pool of images and captions whilemaintaining bounded the overall computational complexity at inference time.With respect toother approaches in the challenge leaderboard,we can achieve remarkable improvements overthe previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrainedresources. The code is publicly available at https://tinyurl.com/wiki-imcap.Source: MULTIMEDIA TOOLS AND APPLICATIONS, vol. 83, pp. 62915-62935
DOI: 10.1007/s11042-023-17977-0
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Will VISIONE remain competitive in lifelog image search?
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness. Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms. In this paper, we present an overview of VISIONE's core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It's important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes. Our objective is to evaluate the system's performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.DOI: 10.1145/3643489.3661122
Project(s): AI4Media via OpenAIRE
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Detecting images generated by diffusers
Coccomini D. A., Esuli A., Falchi F., Gennaro C., Amato G.
In recent years, the field of artificial intelligence has witnessed a remarkable surge in the generation of synthetic images, driven by advancements in deep learning techniques. These synthetic images, often created through complex algorithms, closely mimic real photographs, blurring the lines between reality and artificiality. This proliferation of synthetic visuals presents a pressing challenge: how to accurately and reliably distinguish between genuine and generated images. This article, in particular, explores the task of detecting images generated by text-to-image diffusion models, highlighting the challenges and peculiarities of this field. To evaluate this, we consider images generated from captions in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable Diffusion and GLIDE. Our experiments show that it is possible to detect the generated images using simple multi-layer perceptrons (MLPs), starting from features extracted by CLIP or RoBERTa, or using traditional convolutional neural networks (CNNs). These latter models achieve remarkable performances in particular when pretrained on large datasets. We also observe that models trained on images generated by Stable Diffusion can occasionally detect images generated by GLIDE, but only on the MSCOCO dataset. However, the reverse is not true. Lastly, we find that incorporating the associated textual information with the images in some cases can lead to a better generalization capability, especially if textual features are closely related to visual ones. We also discovered that the type of subject depicted in the image can significantly impact performance. This work provides insights into the feasibility of detecting generated images and has implications for security and privacy concerns in real-world applications.Source: PEERJ. COMPUTER SCIENCE., vol. 10
DOI: 10.7717/peerj-cs.2127
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | peerj.com Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Visione 5.0: toward evaluation with novice users
Amato G., Bolettieri P., Carrara F., Falchi F., Gennaro C., Messina N., Vadicamo L., Vairo C.
VISIONE is a video search system that integrates multiple search functionalities, allowing users to search for video segments using textual and visual queries, complemented by temporal search capabilities. It exploits state-of-the-art Artificial Intelligence approaches for visual content analysis and highly efficient indexing techniques to ensure fast response and scalability. In the recently concluded Video Browser Showdown (VBS2024) - a well-established international competition in interactive video retrieval - VISIONE ranked first and scored as the best interactive video search system in four out of seven tasks carried out in the competition.This paper provides an overview of the VISIONE system, emphasizing the improvements made to the system in the last year to improve its usability for novice users. A demonstration video showcasing the system's capabilities across 2,300 hours of diverse video content is available online, as well as a simplified demo of VISIONE.DOI: 10.1109/cbmi62980.2024.10859203
Project(s): AI4Media via OpenAIRE, National Centre for HPC, Big Data and Quantum Computing, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding
Bianchi L., Carrara F., Messina N., Gennaro C., Falchi F.
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenar-ios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand finegrained prop-erties of objects and their parts. To this end, we intro-duce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and as-sign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing dif-ferent properties like color, pattern, and material. We fur-ther enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.Source: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 22520-22529. Seattle (USA), 17-21/06/2024
DOI: 10.1109/cvpr52733.2024.02125
DOI: 10.48550/arxiv.2311.17518
Project(s): SUN via OpenAIRE, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: arXiv.org e-Print Archive Open Access | IRIS Cnr Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Selective state models are what you need for animal action recognition
Fazzari E., Romano D., Falchi F., Stefanini C.
Recognizing animal actions provides valuable insights into animal welfare, yielding crucial information for agricultural, ethological, and neuroscientific research. While video-based action recognition models have been applied to this task, current approaches often rely on computationally intensive Transformer layers, limiting their practical application in field settings such as farms and wildlife reserves. This study introduces Mamba-MSQNet, a novel architecture family for multilabel Animal Action Recognition using Selective Space Models. By transforming the state-of-the-art MSQNet model with Mamba blocks, we achieve significant reductions in computational requirements: up to 90% fewer Floating point OPerations and 78% fewer parameters compared to MSQNet. These optimizations not only make the model more efficient but also enable it to outperform Transformer-based counterparts on the Animal Kingdom dataset, achieving a mean Average Precision of 74.6, marking an improvement over previous architectures. This combination of enhanced efficiency and improved performance represents a significant advancement in the field of animal action recognition. The dramatic reduction in computational demands, coupled with a performance boost, opens new possibilities for real-time animal behavior monitoring in resource-constrained environments. This enhanced efficiency could revolutionize how we observe and analyze animal behavior, potentially leading to breakthroughs in animal welfare assessment, behavioral studies, and conservation efforts.Source: ECOLOGICAL INFORMATICS
DOI: 10.1016/j.ecoinf.2024.102955
Metrics:


See at: Ecological Informatics Open Access | CNR IRIS Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
In the wild video violence detection: an unsupervised domain adaptation approach
Ciampi L., Santiago C., Falchi F., Gennaro C., Amato G.
This work addresses the challenge of video violence detection in data-scarce scenarios, focusing on bridging the domain gap that often hinders the performance of deep learning models when applied to unseen domains. We present a novel unsupervised domain adaptation (UDA) scheme designed to effectively mitigate this gap by combining supervised learning in the train (source) domain with unlabeled test (target) data. We employ single-image classification and multiple instance learning (MIL) to select frames with the highest classification scores, and, upon this, we exploit UDA techniques to adapt the model to unlabeled target domains. We perform an extensive experimental evaluation, using general-context data as the source domain and target domain datasets collected in specific environments, such as violent/non-violent actions in hockey matches and public transport. The results demonstrate that our UDA pipeline substantially enhances model performances, improving their generalization capabilities in novel scenarios without requiring additional labeled data.Source: SN COMPUTER SCIENCE, vol. 5 (issue 7)
DOI: 10.1007/s42979-024-03126-3
Project(s): "FAIR - Future Artificial Intelligence Research" - Spoke 1 "Human-centered AI", AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Robustness and generalization of synthetic images detectors
Coccomini D. A., Caldelli R., Gennaro C., Fiameni G., Amato G., Falchi F.
In recent times, the increasing spread of synthetic media, known as deepfakes has been made possible by the rapid progress in artificial intelligence technologies, especially deep learning algorithms. Growing worries about the increasing availability and believability of deepfakes have spurred researchers to concentrate on developing methods to detect them. In this field researchers at ISTI CNR’s AIMH Lab, in collaboration with researchers from other organizations, have conducted research, investigations, and projects to contribute to combating this trend, exploring new solutions and threats. This article summarizes the most recent efforts made in this area by our researchers and in collaboration with other institutions and experts.

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
You write like a GPT
Esuli A., Falchi F., Malvaldi M., Puccetti G.
We investigate how Raymond Queneau's \textit{Exercises in Style} are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau's original French version, and the Italian translation by Umberto Eco. We start by comparing how various methods for the detection of automatically generated text, also using different large language models, evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles. This work is an initial attempt at exploring how methods for the detection of artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
Project(s): Future Artificial Intelligence Research

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Scalable bio-inspired training of Deep Neural Networks with FastHebb
Lagani G., Falchi F., Gennaro C., Fassold H., Amato G.
Recent work on sample efficient training of Deep Neural Networks (DNNs) proposed a semi-supervised methodology based on biologically inspired Hebbian learning, combined with traditional backprop-based training. Promising results were achieved on various computer vision benchmarks, in scenarios of scarce labeled data availability. However, current Hebbian learning solutions can hardly address large-scale scenarios due to their demanding computational cost. In order to tackle this limitation, in this contribution, we investigate a novel solution, named FastHebb (FH), based on the reformulation of Hebbian learning rules in terms of matrix multiplications, which can be executed more efficiently on GPU. Starting from Soft-Winner-Takes-All (SWTA) and Hebbian Principal Component Analysis (HPCA) learning rules, we formulate their improved FH versions: SWTA-FH and HPCA-FH. We experimentally show that the proposed approach accelerates training speed up to 70 times, allowing us to gracefully scale Hebbian learning experiments on large datasets and network architectures such as ImageNet and VGG.Source: NEUROCOMPUTING, vol. 595
DOI: 10.1016/j.neucom.2024.127867
Metrics:


See at: CNR IRIS Open Access | www.sciencedirect.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
VISIONE 5.0: enhanced user interface and AI models for VBS2024
Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, Claudio Vairo
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content. Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14557, pp. 332-339. Amsterdam, NL, 29/01-2/02/2024
DOI: 10.1007/978-3-031-53302-0_29
Project(s): AI4Media via OpenAIRE, SUN via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
MINTIME: Multi-identity size-invariant video deepfake detection
Coccomini D. A., Zilos G. K., Amato G., Caldelli R., Falchi F., Papadopoulos S., Gennaro C.
In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.Source: IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, vol. 19, pp. 6084-6096
DOI: 10.1109/tifs.2024.3409054
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
On the generalization of Deep Learning models in video deepfake detection
Coccomini Da, Caldelli R, Falchi F, Gennaro C
The increasing use of deep learning techniques to manipulate images and videos, commonly referred to as "deepfakes", is making it more challenging to differentiate between real and fake content, while various deepfake detection systems have been developed, they often struggle to detect deepfakes in real-world situations. In particular, these methods are often unable to effectively distinguish images or videos when these are modified using novel techniques which have not been used in the training set. In this study, we carry out an analysis of different deep learning architectures in an attempt to understand which is more capable of better generalizing the concept of deepfake. According to our results, it appears that Convolutional Neural Networks (CNNs) seem to be more capable of storing specific anomalies and thus excel in cases of datasets with a limited number of elements and manipulation methodologies. The Vision Transformer, conversely, is more effective when trained with more varied datasets, achieving more outstanding generalization capabilities than the other methods analysed. Finally, the Swin Transformer appears to be a good alternative for using an attention-based method in a more limited data regime and performs very well in cross-dataset scenarios. All the analysed architectures seem to have a different way to look at deepfakes, but since in a real-world environment the generalization capability is essential, based on the experiments carried out, the attention-based architectures seem to provide superior performances.Source: JOURNAL OF IMAGING, vol. 9 (issue 5)
DOI: 10.3390/jimaging9050089
DOI: 10.20944/preprints202303.0161.v1
Project(s): AI4Media via OpenAIRE
Metrics:


See at: doi.org Open Access | Journal of Imaging Open Access | CNR IRIS Open Access | www.mdpi.com Open Access | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Text-to-motion retrieval: towards joint understanding of human motion data and natural language
Messina N, Sedmidubsk'y J, Falchi F, Rebok T
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.DOI: 10.1145/3539618.3592069
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Conference article Open Access OPEN
VISIONE: a large-scale video retrieval system with advanced search functionalities
Amato G, Bolettieri P, Carrara F, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C
VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.itDOI: 10.1145/3591106.3592226
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
VISIONE at Video Browser Showdown 2023
Amato G, Bolettieri P, Carrara F, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C
In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend. In this new version of the system, we introduced some changes both to the current search techniques and to the user interface.DOI: 10.1007/978-3-031-27077-2_48
Project(s): AI4Media via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | ZENODO Open Access | CNR IRIS Restricted | CNR IRIS Restricted