99 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2026 Journal article Open Access OPEN
Quantifying query fairness under unawareness
Jaenich Thomas, Moreo Alejandro, Fabris Alessandro, Mcdonald Graham, Esuli Andrea, Ounis Iadh, Fabrizio Sebastiani
Traditional ranking algorithms are designed to retrieve the most relevant items for a user’s query, but they often inherit biases from data that can unfairly disadvantage vulnerable groups. Fairness in information access systems (IAS) is typically assessed by comparing the distribution of groups in a ranking to a target distribution, such as the overall group distribution in the dataset. These fairness metrics depend on knowing the true group labels for each item. However, when groups are defined by demographic or sensitive attributes, these labels are often unknown, leading to a setting known as “fairness under unawareness.” To address this, group membership can be inferred using machine-learned classifiers, and group prevalence is estimated by counting the predicted labels. Unfortunately, such an estimation is known to be unreliable under dataset shift, compromising the accuracy of fairness evaluations. In this paper, we introduce a robust fairness estimator based on quantification that effectively handles multiple sensitive attributes beyond binary classifications. Our method outperforms existing baselines across various sensitive attributes and, to the best of our knowledge, is the first to establish a reliable protocol for measuring fairness under unawareness across multiple queries and groups.Source: JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, vol. 85 (issue articolo 7)
DOI: 10.1613/jair.1.17675
Metrics:


See at: CNR IRIS Open Access | www.jair.org Open Access | CNR IRIS Restricted


2026 Journal article Open Access OPEN
Misspellings in natural language processing: a survey of recent literature
Sperduti Gianluca, Moreo Alejandro
This survey provides an overview of the challenges of misspellings in natural language processing (NLP). Misspellings are ubiquitous in digital communication, and even if humans can generally interpret misspelt text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. The survey also explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalisation and representation. Additionally, the survey explores the challenges that misspellings pose in multilingual contexts. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analysed, including benchmarks, datasets and performances of the most prominent language models against misspellings. This survey provides a comprehensive review of recent research on misspellings and aims to serve as a valuable resource for researchers seeking to get up to speed on this problem within the rapidly evolving landscape of NLP.Source: Natural Language Processing, pp. 1-47
DOI: 10.1017/nlp.2026.10020
Project(s): Word EMBeddings: From Cognitive Linguistics to Language Engineering, and Back
Metrics:


See at: CNR IRIS Open Access | www.cambridge.org Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Transductive model selection under prior probability shift
Volpi L., Moreo Fernandez A., Sebastiani F.
Transductive learning is a supervised machine learning task in which, unlike in traditional inductive learning, the unlabelled data that require labelling are a finite set and are available at training time. Similarly to inductive learning contexts, transductive learning contexts may be affected by dataset shift, i.e., may be such that the assumption according to which the training data and the unlabelled data are independently and identically distributed (IID), does not hold. We here propose a method, tailored to transductive classification contexts, for performing model selection (i.e., hyperparameter optimisation) when the data exhibit prior probability shift, an important type of dataset shift typical of anti-causal learning problems. In our proposed method the hyperparameters can be optimised directly on the unlabelled data to which the trained classifier must be applied; this is unlike traditional model selection methods, that are based on performing cross-validation on the labelled training data. By tailoring model selection to the actual test distribution, our approach contributes to the trustworthiness of AI systems, as it enables more reliable and robust classifier deployment under changed conditions. We provide experimental results that show the benefits brought about by our method.Source: CEUR WORKSHOP PROCEEDINGS, vol. 4132, pp. 256-265. Bologna, Italy, 25-26 October 2025
Project(s): Future Artificial Intelligence Research

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2025 Journal article Open Access OPEN
QuAcc: using quantification to predict classifier accuracy under prior probability shift
Volpi L., Moreo Fernandez A., Sebastiani F.
Using cross-validation to predict the accuracy of a classifier on unseen data can be done reliably only in the absence of dataset shift, i.e., when the training data and the unseen data are IID. In this work we deal instead with the problem of predicting classifier accuracy on unseen data affected by prior probability shift (PPS), an important type of dataset shift. We propose QuAcc, a method built on top of ?quantification? algorithms robust to PPS, i.e., algorithms devised for estimating the prevalence values of the classes in unseen data affected by PPS. QuAcc is based on the idea of viewing the cells of the contingency table (on which classifier accuracy is computed) as classes, and of estimating, via a quantification algorithm, their prevalence values on the unseen data labelled by the classifier. We perform systematic experiments in which we compare the prediction error incurred by QuAcc with that of state-of-the-art classifier accuracy prediction (CAP) methods.Source: INTELLIGENZA ARTIFICIALE, vol. 19 (issue 2), pp. 141-157
DOI: 10.1177/17248035251338347
DOI: https://doi.org/10.1177/17248035251338347
Project(s): Future Artificial Intelligence Research, Italian Strengthening of ESFRI RI RESILIENCE, Quantification in the Context of Dataset Shift, Strengthening the Italian RI for Social Mining and Big Data Analytics
Metrics:


See at: CNR IRIS Open Access | Intelligenza Artificiale Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2025 Book Open Access OPEN
Proceedings of the 5th International Workshop on Learning to Quantify (LQ 2025)
Bunse M., González P., Moreo Fernandez A., Sebastiani F.
The 5th International Workshop on Learning to Quantify (LQ 2025 – https: //lq-2025.github.io/) has been held in Porto, PT, on September 15, 2025, as a satellite workshop of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2025). While the 1st edition of the workshop (LQ 2021 – https://cikmlq2021. github.io/) had to be an entirely online event due to the COVID-19 pan- demic, the 2nd edition (LQ 2022 – https://lq-2022.github.io/), 3rd edition (LQ 2023 – https://lq-2023.github.io/), 4th edition (LQ 2024 – https://lq-2024.github.io/), and this 5th edition, have been hybrid events, with presentations given in-presence, and both in-presence attendees and remote attendees. The LQ 2025 workshop consisted of the presentations of seven contributed papers, that had each gone through a rigorous peer-reviewing process by three reviewers each, and a final collective discussion on the open problems of learning to quantify and on future initiatives. The present volume con- tains the text of five of the seven presentations given at the workshop (for the other two presentations the authors asked for their papers not to be in the proceedings). We hope that the availability of the present volume will increase the interest in the subject of quantification on the part of researchers and practitioners alike, and will contribute to making quantification better known to potential users of this technology and to researchers interested in advancing the field.Project(s): SoBigData via OpenAIRE

See at: CNR IRIS Open Access | lq-2025.github.io Open Access | CNR IRIS Restricted


2025 Journal article Open Access OPEN
LEAP: Linear equations for classifier accuracy prediction under prior probability shift
Volpi L., Moreo Fernandez A., Sebastiani F.
The standard technique for predicting the accuracy that a classifier will have on unseen data (classifier accuracy prediction — CAP) is cross-validation (CV). However, CV relies on the assumption that the training data and the test data are sampled from the same distribution, an assumption that is often violated in many real-world scenarios. When such violations occur (i.e., in the presence of dataset shift), the estimates returned by CV are unreliable. The contribution of this paper is three-fold. First, we propose a CAP method specifically designed to work under prior probability shift (PPS), an instance of dataset shift in which the training and test distributions are characterized by different class priors. This method estimates the n^2 entries of the contingency table of the test data (thus allowing to estimate the value of any specific evaluation measure) by solving a system of n^2 independent linear equations, with n the number of classes. Second, we show that the equations that the cells of the contingency table must satisfy are actually more than n^2 , which gives rise to an overconstrained problem, and present a family of methods each based on a different selection of n^2 such equations. Third, we observe that, since a key step of the above methods involves predicting the class priors of the test data, one can exploit intuitions from the field of class prior estimation (a.k.a. “quantification”). Our experiments show that, when combined with state-of-the-art quantification techniques, under PPS our methods tend to outperform existing CAP methods.Source: MACHINE LEARNING, vol. 114 (issue 12)
DOI: 10.1007/s10994-025-06878-y
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
An efficient method for deriving confidence intervals in aggregative quantification
Moreo Fernandez A., Salvati N.
This paper explores efficient methods for deriving confidence intervals in quantification, the area of machine learning concerned with estimating class prevalence values. By focusing on computationally efficient strategies, we propose a robust framework for quantifying uncertainty. The key idea is to disentangle the two main phases of current aggregative quantifiers (classification followed by aggregation) and apply bootstrap only to the second phase. We investigate different methods for constructing confidence regions, including confidence intervals, confidence ellipses in the simplex, and confidence regions in the transformed Centered Log-Ratio space. Additionally, we examine various bootstrap strategies, including model-based, population-based, and a combined approach. Our results demonstrate the effectiveness of combining modelbased and population-based bootstrap approaches, particularly when used with traditional confidence intervals, while also achieving significant efficiency gains compared to a naive application of bootstrap.Project(s): Quantification in the Context of Dataset Shift

See at: CNR IRIS Open Access | lq-2025.github.io Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
A simple method for classifier accuracy prediction under prior probability shift
Volpi L., Moreo Fernandez A., Sebastiani F.
The standard technique for predicting the accuracy that a classifier will have on unseen data (classifier accuracy prediction – CAP) is cross-validation (CV). However, CV relies on the assumption that the training data and the test data are sampled from the same distribution, an assumption that is often violated in many real-world scenarios. When such violations occur (i.e., in the presence of dataset shift), the estimates returned by CV are unreliable. In this paper we propose a CAP method specifically designed to address prior probability shift (PPS), an instance of dataset shift in which the training and test distributions are characterized by different class priors. By solving a system of independent linear equations, with n the number of classes, our method estimates the entries of the contingency table of the test data, and thus allows estimating any specific evaluation measure. Since a key step in this method involves predicting the class priors of the test data, we further observe a connection between our method and the field of “learning to quantify”. Our experiments show that, when combined with state-of-the-art quantification techniques, under PPS our method tends to outperform existing CAP methods.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15244, pp. 267-283. Pisa, Italy, 14-16/10/2024
DOI: 10.1007/978-3-031-78980-9_17
Project(s): Quantification in the Context of Dataset Shift
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Other Open Access OPEN
ISTI-day 2025 Proceedings
Del Corso G., Pedrotti A., Federico G., Gennaro C., Carrara F., Amato G., Di Benedetto M., Gabrielli E., Belli D., Matrullo Zoe, Miori V., Tolomei Gabriele, Waheed T., Marchetti E., Calabrò Antonello., Rossetti G., Stella Massimo, Cazabet Rémy, Abramski K., Cau E., Citraro S., Failla A., Mesina V., Morini V., Pansanella V., Colantonio S., Germanese D., Pascali M. A., Bianchi L., Messina N., Falchi F., Barsellotti L., Pacini G., Cassese M., Puccetti G., Esuli A., Volpi L., Moreo Alejandro, Sebastiani F., Sperduti G., Nguyen Dong, Broccia G., Ter Beek M. H., Ferrari A., Massink M., Belmonte Gina, Ciancia V., Papini O., Canapa G., Catricalà B., Manca M., Paternò F., Santoro C., Zedda E., Gallo S., Maenza S., Mattioli A., Simeoli L., Rucci D., Carlini E., Dazzi P., Kavalionak H., Mordacchini M., Rulli C., Muntean Cristina Ioana, Nardini F. M., Perego R., Rocchietti G., Lettich F., Renso C., Pugliese C., Casini G., Haldimann Jonas, Meyer Thomas, Assante M., Candela L., Dell'Amico A., Frosini L., Mangiacrapa F., Oliviero A., Pagano P., Panichi G., Peccerillo B., Procaccini M., Mannocci A., Manghi P., Lonetti F., Kang Dongjae, Di Giandomenico F., Jee Eunkyoung, Lazzini G., Conti F., Scopigno R., D'Acunto M., Moroni D., Cafiso M., Paradisi P., Callieri M., Pavoni G., Corsini M., De Falco A., Sala F., Saraceni Q., Gattiglia Gabriele
ISTI-Day is an annual information and networking event organized by the Institute of Information Science and Technologies "A. Faedo" (ISTI) of the Italian National Research Council (CNR). This event features an opening talk of the Director of the Dept. DIITET (Emilio F. Campana) as well as an overview of the Institute's activities presented by the ISTI Director (Roberto Scopigno). Those institutional segments are complemented by dedicated presentations and round tables featuring former staff members, as well as internal and external collaborators. To foster a network of knowledge and collaboration among newcomers, the 2025 ISTI Day edition also includes a large poster session that provides a comprehensive overview of current research activities. Each of the 13 laboratories contributes 1–3 posters, highlighting the most innovative work and offering early-career researchers a platform for discussion. Thus these proceedings include the posters selected for ISTI-Day 2025, reflecting the diverse and innovative nature of the Institute's research.

See at: CNR IRIS Open Access | www.isti.cnr.it Open Access | CNR IRIS Restricted


2025 Contribution to book Open Access OPEN
Sull’autorialità dantesca della Questio de aqua et terra: uno studio computazionale
Leocata M., Moreo Fernandez A., Sebastiani F., Signori M.
Il presente lavoro illustra i risultati di un’analisi computazionale di autorialità, o computational authorship identification (CAI), applica- ta alla Questio de aqua et terra, e ha lo scopo di fornire un ulteriore tassello agli studi che si propongono di determinarne l’autenticità o meno.

See at: CNR IRIS Open Access | www.pisauniversitypress.it Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
ReCoptic: computer vision for the reconstruction of dismembered coptic codices
Bianchi L., Falchi F., Moreo Fernandez A., Sebastiani F., Bianchi C.
In the course of history, many ancient codices (i.e., bound volumes of manuscripts) written in the Coptic language have been dismembered, often at the hand of sellers of antiques, into individual sheets, who have ended up scattered across the planet. Reconstructing these codices in their original form would be extremely important for a better understanding of the culture of Coptic-speaking communities, and is a long-standing goal of paleographers and egyptologists alike. In this paper we present ReCoptic, a probabilistic, “contrastive” image classification sys- tem based on computer vision techniques, whose goal is to aid scholars in reconstructing dismembered ancient Coptic codices. Given a collection of scans of individual pages of ancient Coptic manuscripts, the system evaluates, for each pair of such scans, the (“posterior”) probability that the two pages originate from the same codex, and ranks all such pairs in descending order of their associated posterior probability. The scholar can thus discover yet unknown pairs of pages originating from the same codex by examining, starting from the top of the list, the pairs proposed by ReCoptic. In experiments that we have run on a collection of 6,000+ pages of Coptic manuscripts, ReCoptic displays extremely high accuracy. The code for reproducing these experiments is available at https://github.com/lorebianchi98/ReCopticDOI: 10.1109/ieee-ch65308.2025.11279398
Project(s): ITSERR Italian Strengthening of the ESFRI RI RESILIENCE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Journal article Open Access OPEN
Efficient quantification on large-scale networks
Micheli A., Moreo Fernandez A., Podda M., Sebastiani F., Simoni W., Tortorella D.
Network quantification (NQ) is the problem of estimating the proportions of nodes belonging to each class in subsets of unlabelled graph nodes. When prior probability shift is at play, this task cannot be effectively addressed by first classifying the nodes and then counting the class predictions. In addition, unlike non-relational quantification, NQ demands enhanced flexibility in order to capture a broad range of connectivity patterns, resilience to the challenge of heterophily, and scalability to large networks. In order to meet these stringent requirements, we introduce XNQ, a novel method that synergizes the flexibility and efficiency of the unsupervised node embeddings computed by randomized recursive Graph Neural Networks, with an Expectation-Maximization algorithm that provides a robust quantification-aware adjustment to the output probabilities of a calibrated node classifier. In an extensive evaluation, in which we also validate the design choices underpinning XNQ through comprehensive ablation experiments, we find that XNQ consistently and significantly improves on the best network quantification methods to date, thereby setting the new state of the art for this challenging task. XNQ also provides a training speed-up of up to 10x–100x over other methods based on graph learning.Source: MACHINE LEARNING, vol. 114 (issue 12)
DOI: 10.1007/s10994-025-06915-w
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted


2025 Journal article Open Access OPEN
Kernel density estimation for multiclass quantification
Moreo Fernandez A., González P., Del Coz Juan J.
Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations using histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance compared to previous DM approaches and other state-of-the-art quantification systems.Source: MACHINE LEARNING, vol. 114 (issue 92)
DOI: 10.1007/s10994-024-06726-5
DOI: https://doi.org/10.1007/s10994-024-06726-5
Project(s): Quantification in the Context of Dataset Shift
Metrics:


See at: arXiv.org e-Print Archive Open Access | CNR IRIS Open Access | Software Heritage Restricted | Software Heritage Restricted | Software Heritage Restricted | Machine Learning Restricted | GitHub Restricted | GitHub Restricted | GitHub Restricted | GitHub Restricted | GitHub Restricted | GitHub Restricted | GitHub Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2025 Journal article Open Access OPEN
Quantification using permutation-invariant networks based on histograms
Pérez-Mon O., Moreo Fernandez A., Del Coz J. J., González P.
Quantification, also known as class prevalence estimation, is the supervised learning task in which a model is trained to predict the prevalence of each class in a given bag of examples. This paper investigates the application of deep neural networks for tasks of quantification in scenarios where it is possible to apply a symmetric supervised approach that eliminates the need for classification as an intermediate step, thus directly addressing the quantification problem. Additionally, it discusses existing permutation-invariant layers designed for set processing and assesses their suitability for quantification. Based on our analysis, we propose HistNetQ, a novel neural architecture that relies on a permutation-invariant representation based on histograms that is especially suited for quantification problems. Our experiments carried out in two standard competitions, which have become a reference in the quantification field, show that HistNetQ outperforms other deep neural network architectures designed for set processing, as well as the current state-of-the-art quantification methods. Furthermore, HistNetQ offers two significant advantages over traditional quantification methods: i) it does not require the labels of the training examples but only the prevalence values of a collection of training bags, making it applicable to new scenarios; and ii) it is able to optimize any custom quantification-oriented loss function.Source: NEURAL COMPUTING & APPLICATIONS, vol. 37, pp. 3505-3520
DOI: 10.1007/s00521-024-10721-1
DOI: https://doi.org/10.1007/s00521-024-10721-1
Project(s): Quantification in the Context of Dataset Shift
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Explainable authorship identification in Cultural Heritage applications
Setzu M., Corbara S., Monreale A., Moreo Fernandez A., Sebastiani F.
While a substantial amount of work has recently been devoted to improving the accuracy of computational Authorship Identification (AId) systems for textual data, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This substantially hinders the practical application of AId methods, since the predictions returned by such systems are hardly useful unless they are supported by suitable explanations. In this article, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factual and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification and same-authorship verification) by running experiments on real AId textual data. Our analysis shows that, while these techniques make important first steps towards XAI, more work remains to be done to provide tools that can be profitably integrated into the workflows of scholars.Source: ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, vol. 17 (issue 3), pp. 1-23
DOI: 10.1145/3654675
DOI: https://doi.org/10.1145/3654675
Project(s): SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | Journal on Computing and Cultural Heritage Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Regularization-based methods for ordinal quantification
Bunse M., Moreo Fernandez A., Sebastiani F., Senz M.
Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of $$n>2$$classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.Source: DATA MINING AND KNOWLEDGE DISCOVERY
DOI: 10.1007/s10618-024-01067-2
DOI: https://doi.org/10.1007/s10618-024-01067-2
Project(s): AI4Media via OpenAIRE, Quantification in the Context of Dataset Shift, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Multimodal heterogeneous transfer learning for multilingual image-text classification
Pedrotti A., Moreo Fernandez A., Sebastiani F.
The Multilingual Image-Text Classification (MITC) task is a specific instance of the Image-Text Classification (ITC) task, where each item to be classified consists of a visual representation and a textual description written in one of several possible languages. In this paper we propose MM-gFun, an extension of the gFun learning architecture originally developed for cross-lingual text classification. We extend its original text-only implementation to handle perceptual modalities.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3928. Pisa, Italy, 14-16/10/2024
Project(s): SoBigData via OpenAIRE

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
A noise-oriented and redundancy-aware instance selection framework
Cunha W., Moreo Fernandez A., Esuli A., Sebastiani F., Rocha L., Gonçalves M. A.
Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC’s sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.Source: ACM TRANSACTIONS ON INFORMATION SYSTEMS
DOI: 10.1145/3705000
Project(s): Future Artificial Intelligence Research, Italian Strengthening of the ESFRI RI RESILIENCE, SoBigData.it
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ACM Transactions on Information Systems Restricted | CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Forging the Forger: an attempt to improve authorship verification via data augmentation
Corbara S., Moreo Fernandez A.
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author (A) or by someone else (A). It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author. In this paper, we investigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of A. We analyze the improvements in the classifier predictions that this augmentation brings to bear in the task of AV in an adversarial setting. In particular, we experiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). We evaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation yields negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application.Source: IEEE ACCESS, vol. 12, pp. 171911-171925
DOI: 10.1109/access.2024.3481161
DOI: 10.48550/arxiv.2403.11265
Project(s): SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | IEEE Access Open Access | IEEE Access Open Access | Archivio istituzionale della Ricerca - Scuola Normale Superiore Open Access | CNR IRIS Open Access | ieeexplore.ieee.org Open Access | doi.org Restricted | GitHub Restricted | CNR IRIS Restricted


2024 Other Open Access OPEN
AIMH Research Activities 2024
Aloia N., Amato G., Bartalesi Lenzi V., Bianchi L., Bolettieri P., Bosio C., Carraglia M., Carrara F., Casarosa V., Cassese M., Ciampi L., Coccomini D. A., Concordia C., Connor R., Corbara S., De Martino C., Di Benedetto M., Esuli A., Falchi F., Fazzari E., Gennaro C., Iannello L., Negi K., Lagani G., Lenzi E., Leocata M., Malvaldi M., Meghini C., Messina N., Moreo Fernandez A., Nardi A., Pacini G., Pedrotti A., Pratelli N., Puccetti G., Rabitti F., Savino P., Scotti F., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C., Versienti L., Volpi L.
The AIMH (Artificial Intelligence for Media and Humanities) laboratory is committed to advancing the field of Artificial Intelligence, with a special emphasis on its applications in digital media and the humanities. The lab aims to improve AI technologies, particularly in areas such as deep learning, text analysis, computer vision, multimedia information retrieval, content analysis, recognition, and retrieval. This report summarizes the laboratory’s achievements and activities over the course of 2024.DOI: 10.32079/isti-ar-2024/001
Metrics:


See at: CNR IRIS Open Access | CNR IRIS Restricted