2018
Software
Metadata Only Access
PyDCI repository
Moreo Fernandez AA Python Implementation of the Distributional Correspondence Indexig (DCI) algorithm for cross-domain and cross-lingual domain adaptation, described in https://arxiv.org/abs/1810.09311.
See at:
github.com
| CNR IRIS
2019
Software
Metadata Only Access
Funnelling repository
Moreo Fernandez AdSoftware repository containing the Python code implementing Funnelling, a new ensemble method for heterogeneous transfer learning described in https://arxiv.org/abs/1901.11459.
See at:
github.com
| CNR IRIS
2018
Software
Metadata Only Access
inntt: Interactive NeuralNet Trainer for pyTorch
Moreo Fernandez AInteractive NeuralNet Trainer for pyTorch (INNTT) is a Python class that allows the practitioner to modify many hyperparameters involved in the training of neural networks in PyTorch on the fly, interacting with the keyboard.
See at:
github.com
| CNR IRIS
2020
Software
Metadata Only Access
PyDRO: A Python reimplementation of the Distributional Random Oversampling method for binary text classification
Moreo Fernandez AdThis repo is a stand-alone (re)implementation of the Distributional Random Oversampling (DRO) method presented in SIGIR'16. The former implementation was part of the JaTeCs framework for Java.
Distributional Random Oversampling (DRO) is an oversampling method to counter data imbalance in binary text classification. DRO generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. The variability introduced by the oversampling method is enclosed in a latent space; the original space is replicated and left untouched.
See at:
github.com
| CNR IRIS
2024
Journal article
Open Access
Quantification using permutation-invariant networks based on histograms
Pérez-Mon O., Moreo Fernandez A., Del Coz J. J., González P.Quantification, also known as class prevalence estimation, is the supervised learning task in which a model is trained to predict the prevalence of each class in a given bag of examples. This paper investigates the application of deep neural networks for tasks of quantification in scenarios where it is possible to apply a symmetric supervised approach that eliminates the need for classification as an intermediate step, thus directly addressing the quantification problem. Additionally, it discusses existing permutation-invariant layers designed for set processing and assesses their suitability for quantification. Based on our analysis, we propose HistNetQ, a novel neural architecture that relies on a permutation-invariant representation based on histograms that is especially suited for quantification problems. Our experiments carried out in two standard competitions, which have become a reference in the quantification field, show that HistNetQ outperforms other deep neural network architectures designed for set processing, as well as the current state-of-the-art quantification methods. Furthermore, HistNetQ offers two significant advantages over traditional quantification methods: i) it does not require the labels of the training examples but only the prevalence values of a collection of training bags, making it applicable to new scenarios; and ii) it is able to optimize any custom quantification-oriented loss function.Source: NEURAL COMPUTING & APPLICATIONS
DOI: 10.1007/s00521-024-10721-1Project(s): Quantification in the Context of Dataset Shift
Metrics:
See at:
CNR IRIS
| CNR IRIS
2015
Conference article
Restricted
Distributional correspondence indexing for cross-language text categorization
Esuli A, Fernandez AmCross-Language Text Categorization (CLTC) aims at producing a classifier for a target language when the only available training examples belong to a different source language. Existing CLTC methods are usually affected by high computational costs, require external linguistic resources, or demand a considerable human annotation effort. This paper presents a simple, yet effective, CLTC method based on projecting features from both source and target languages into a common vector space, by using a computationally lightweight distributional correspondence profile with respect to a small set of pivot terms. Experiments on a popular sentiment classification dataset show that our method performs favorably to state-of-the-art methods, requiring a significantly reduced computational cost and minimal human intervention.
See at:
CNR IRIS
| CNR IRIS
| link.springer.com
2015
Conference article
Open Access
A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining
Jimenez Zafra S, Berardi G, Esuli A, Marcheggiani D, Martinvaldivia M T, Moreo Fernández AWe present the Trip-MAML dataset, a Multi-Lingual dataset of hotel reviews that have been manually annotated at the sentence-level with Multi-Aspect sentiment labels. This dataset has been built as an extension of an existent English-only dataset, adding documents written in Italian and Spanish. We detail the dataset construction process, covering the data gathering, selection, and annotation. We present inter-annotator agreement figures and baseline experimental results, comparing the three languages. Trip-MAML is a multi-lingual dataset for aspect-oriented opinion mining that enables researchers (i) to face the problem on languages other than English and (ii) to the experiment the application of cross-lingual learning methods to the task
See at:
CNR IRIS
| ISTI Repository
| www.aclweb.org
| CNR IRIS
2018
Software
Metadata Only Access
QuaNet repository
Esuli A, Moreo Fernandez AdThis repository contains the Python code implementing the QuaNet (described in https://arxiv.org/pdf/1809.00836.pdf) model for quantification and everything needed to reproduce all experiments.
See at:
github.com
| CNR IRIS
2019
Conference article
Open Access
Learning to quantify: Estimating class prevalence via supervised learning
Moreo Fernandez Ad, Sebastiani FQuantification (also known as "supervised prevalence estimation", or" class prior estimation") is the task of estimating, given a set ? of unlabelled items and a set of classes C= c1,..., c| C|, the relative frequency (or" prevalence") p (ci) of each class ci C, ie, the fraction of items in ? that belong to ci.
The goal of this course is to introduce the audience to the problem of quantification and to its importance, to the main supervised learning techniques that have been proposed for solving it, to the metrics used to evaluate them, and to what appear to be the most promising directions for further research.
See at:
dl.acm.org
| CNR IRIS
| ISTI Repository
| CNR IRIS
| CNR IRIS
2019
Conference article
Open Access
Tutorial: Supervised Learning for Prevalence Estimation
Moreo Fernandez Ad, Sebastiani FQuantification is the task of estimating, given a set of unlabelled items and a set of classes, the relative frequency (or "prevalence"). Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidemiology) which usually deal with aggregate (as opposed to individual) data. In these contexts, classifying individual unlabelled instances is usually not a primary goal, while estimating the prevalence of the classes of interest in the data is. Quantification may in principle be solved via classification, i.e., by classifying each item in and counting, for all, how many such items have been labelled with. However, it has been shown in a multitude of works that this "classify and count" (CC) method yields suboptimal quantification accuracy, one of the reasons being that most classifiers are optimized for classification accuracy, and not for quantification accuracy. As a result, quantification has come to be no longer considered a mere byproduct of classification, and has evolved as a task of its own, devoted to designing methods and algorithms that deliver better prevalence estimates than CC. The goal of this tutorial is to introduce the main supervised learning techniques that have been proposed for solving quantification, the metrics used to evaluate them, and the most promising directions for further research.
See at:
CNR IRIS
| link.springer.com
| ISTI Repository
| CNR IRIS
| CNR IRIS
2022
Journal article
Open Access
Report on the 1st International Workshop on Learning to Quantify (LQ 2021)
Del Coz J. J., González P., Moreo Fernandez A. D., Sebastiani F.The 1st International Workshop on Learning to Quantify (LQ 2021 - https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Conference on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely online, due to the COVID-19 pandemic. This report presents a summary of each keynote speech and contributed paper presented in this event, and discusses the issues that were raised during the workshop.Source: SIGKDD EXPLORATIONS, vol. 24 (issue 1), pp. 49-51
Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
,
SoBigData-PlusPlus ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
ISTI Repository
| CNR IRIS
| CNR IRIS
| kdd.org
2022
Journal article
Open Access
Tweet sentiment quantification: an experimental re-evaluation
Moreo A, Sebastiani FSentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of indi- vidual tweets). It is well-known that solving quantification by means of "classify and count" (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani 2016 carried out a systematic comparison of quantification methods on the task of tweet sentiment quantifica- tion. In hindsight, we observe that the experimentation carried out in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus question- able. We here re-evaluate those quantification methods (plus a few more modern ones) on exactly the same datasets, this time following a now consolidated and robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. Due to the above-mentioned presence, in the test data, of samples characterised by class prevalence values very different from those of the training set, the results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.Source: PLOS ONE, vol. 17 (issue 9)
Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
,
SoBigData-PlusPlus ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
CNR IRIS
| journals.plos.org
| ISTI Repository
| CNR IRIS
2022
Conference article
Open Access
Rhythmic and psycholinguistic features for authorship tasks in the Spanish parliament: evaluation and analysis
Corbara S., Chulvi B., Rosso P., Moreo Fernandez A.Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic-agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable in terms of the authors' political affiliation and communication style.Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
CNR IRIS
| link.springer.com
| ISTI Repository
| CNR IRIS
| CNR IRIS
2023
Conference article
Open Access
Ordinal quantification through regularization
Bunse M, Moreo A, Sebastiani F, Senz MQuantification,i.e.,thetaskoftrainingpredictorsoftheclass prevalence values in sets of unlabelled data items, has received increased attention in recent years. However, most quantification research has con- centrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. We here study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms that are proposed by authors from very different research fields, who were unaware of each other's developments. Third, we propose three OQ algorithms, based on the idea of preventing ordinally implausible estimates through regu- larization. Our experiments show that these algorithms outperform the existing ones if the ordinal plausibility assumption holds.Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
,
SoBigData-PlusPlus ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
CNR IRIS
| link.springer.com
| ISTI Repository
| CNR IRIS
| CNR IRIS
| CNR IRIS
2023
Journal article
Open Access
Multi-label quantification
Moreo A, Francisco M, Sebastiani FQuantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for bi- nary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightfor- ward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.Source: ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA (ONLINE), vol. 18 (issue 1)
Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
,
SoBigData-PlusPlus ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
dl.acm.org
| CNR IRIS
| ISTI Repository
| CNR IRIS
| CNR IRIS
2023
Journal article
Open Access
Same or different? Diff-vectors for authorship analysis
Corbara S., Moreo Fernandez A. D., Sebastiani F.In this paper we investigate the efects on authorship identiication tasks (including authorship veriication, closed-set authorship attribution, and closed-set and open-set same-author veriication) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ?classic? authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute diference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship veriication) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the irst time) show that feature vectors representing pairs of documents (that we here call Dif-Vectors) bring about systematic improvements in the efectiveness of authorship identiication tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identiication scenarios). Our experiments tackle same-author veriication, authorship veriication, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.Source: ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA (ONLINE)
Project(s): AI4Media ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
,
SoBigData-PlusPlus ![via OpenAIRE](/components/com_dnetindexclient/img/openaire_logo.png)
See at:
dl.acm.org
| CNR IRIS
| ISTI Repository
| CNR IRIS
| CNR IRIS