288 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2022 Conference article Open Access OPEN

LeQua@CLEF2022: learning to quantify
Esuli A., Moreo A., Sebastiani F.
LeQua 2022 is a new lab for the evaluation of methods for "learning to quantify" in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of litera- ture has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary set- ting and in the single-label multiclass setting. For each such setting we provide data either in ready-made vector form or in raw document form.Source: ECIR 2022 - 44th European Conference on IR Research, pp. 374–381, Stavanger, Norway, 10-14/04/2022
DOI: 10.1007/978-3-030-99739-7_47
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA Restricted


2022 Journal article Open Access OPEN

Report on the 1st International Workshop on Learning to Quantify (LQ 2021)
Del Coz J. J., González P., Moreo A., Sebastiani F.
The 1st International Workshop on Learning to Quantify (LQ 2021 - https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Conference on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely online, due to the COVID-19 pandemic. This report presents a summary of each keynote speech and contributed paper presented in this event, and discusses the issues that were raised during the workshop.Source: SIGKDD explorations (Online) 24 (2022): 49–51.
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: kdd.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2021 Journal article Open Access OPEN

A critical reassessment of the Saerens-Latinne-Decaestecker algorithm for posterior probability adjustment
Esuli A., Molinari A., Sebastiani F.
We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities ("priors") and adjusting posterior probabilities ("posteriors") in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine-learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as "quantification"). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.Source: ACM transactions on information systems 39 (2021). doi:10.1145/3433164
DOI: 10.1145/3433164
Project(s): AI4Media via OpenAIRE, ARIADNEplus via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: ZENODO Open Access | ACM Transactions on Information Systems Open Access | ACM Transactions on Information Systems Restricted | dl.acm.org Restricted | ACM Transactions on Information Systems Restricted | CNR ExploRA Restricted


2021 Journal article Open Access OPEN

Word-class embeddings for multiclass text classification
Moreo A., Esuli A., Sebastiani F.
Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.Source: Data mining and knowledge discovery 35 (2021): 911–963. doi:10.1007/s10618-020-00735-3
DOI: 10.1007/s10618-020-00735-3
DOI: 10.5281/zenodo.4468312
DOI: 10.5281/zenodo.4468313
Project(s): AI4Media via OpenAIRE, ARIADNEplus via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | ISTI Repository Open Access | link.springer.com Restricted | Data Mining and Knowledge Discovery Restricted | Data Mining and Knowledge Discovery Restricted | CNR ExploRA Restricted


2021 Journal article Open Access OPEN

Lost in transduction: transductive transfer learning in text classification
Moreo A., Esuli A., Sebastiani F.
Obtaining high-quality labelled data for training a classifier in a new application domain is often costly. Transfer Learning(a.k.a. "Inductive Transfer") tries to alleviate these costs by transferring, to the "target"domain of interest, knowledge available from a different "source"domain. In transfer learning the lack of labelled information from the target domain is compensated by the availability at training time of a set of unlabelled examples from the target distribution. Transductive Transfer Learning denotes the transfer learning setting in which the only set of target documents that we are interested in classifying is known and available at training time. Although this definition is indeed in line with Vapnik's original definition of "transduction", current terminology in the field is confused. In this article, we discuss how the term "transduction"has been misused in the transfer learning literature, and propose a clarification consistent with the original characterization of this term given by Vapnik. We go on to observe that the above terminology misuse has brought about misleading experimental comparisons, with inductive transfer learning methods that have been incorrectly compared with transductive transfer learning methods. We then, give empirical evidence that the difference in performance between the inductive version and the transductive version of a transfer learning method can indeed be statistically significant (i.e., that knowing at training time the only data one needs to classify indeed gives an advantage). Our clarification allows a reassessment of the field, and of the relative merits of the major, state-of-The-Art algorithms for transfer learning in text classification.Source: ACM transactions on knowledge discovery from data 16 (2021). doi:10.1145/3453146
DOI: 10.1145/3453146
Project(s): ARIADNEplus via OpenAIRE

See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN

Heterogeneous document embeddings for cross-lingual text classification
Moreo A., Pedrotti A., Sebastiani F.
Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged. We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.Source: SAC 2021: 36th ACM/SIGAPP Symposium On Applied Computing, pp. 685–688, Online conference, 22-26/03/2021
DOI: 10.1145/3412841.3442093
Project(s): AI4Media via OpenAIRE, ARIADNEplus via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: ISTI Repository Open Access | ZENODO Open Access | dl.acm.org Restricted | dl.acm.org Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN

Re-assessing the "Classify and Count" quantification method
Moreo A., Sebastiani F.
Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that "Classify and Count" (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Following this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.Source: ECIR 2021 - 43rd European Conference on Information Retrieval, pp. 75–91, Online conference, 28/03-01/04/2021
DOI: 10.1007/978-3-030-72240-1_6
DOI: 10.5281/zenodo.4468277
DOI: 10.5281/zenodo.4468276
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | ISTI Repository Open Access | ZENODO Open Access | link.springer.com Restricted | link.springer.com Restricted | CNR ExploRA Restricted


2021 Contribution to conference Open Access OPEN

Advances in Information Retrieval. 43rd European Conference on IR Research, ECIR 2021. Proceedings
Hiemstra D., Moens M. F., Mothe J., Perego R., Potthast M., Sebastiani F.
This two-volume set LNCS 12656 and 12657 constitutes the refereed proceedings of the 43rd European Conference on IR Research, ECIR 2021, held virtually in March/April 2021, due to the COVID-19 pandemic. The 50 full papers presented together with 11 reproducibility papers, 39 short papers, 15 demonstration papers, 12 CLEF lab descriptions papers, 5 doctoral consortium papers, 5 workshop abstracts, and 8 tutorials abstracts were carefully reviewed and selected from 436 submissions. The accepted contributions cover the state of the art in IR: deep learning-based information retrieval techniques, use of entities and knowledge graphs, recommender systems, retrieval methods, information extraction, question answering, topic and prediction models, multimedia retrieval, and much more.DOI: 10.1007/978-3-030-72240-1

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2021 Contribution to journal Open Access OPEN

Report on the 43rd European Conference on Information Retrieval (ECIR 2021)
Perego R., Sebastiani F.
Source: SIGIR forum 55 (2021).

See at: ISTI Repository Open Access | CNR ExploRA Open Access | sigir.org Open Access


2021 Conference article Open Access OPEN

Garbled-word embeddings for jumbled text
Sperduti G., Moreo A., Sebastiani F.
"Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe". We investigate the extent to which this phenomenon applies to computers as well. Our hypothesis is that computers are able to learn distributed word representations that are resilient to character reshuffling, without incurring a significant loss in performance in tasks that use these representations. If our hypothesis is confirmed, this may form the basis for a new and more efficient way of encoding character-based representations of text in deep learning, and one that may prove especially robust to misspellings, or to corruption of text due to OCR. This paper discusses some fundamental psycho-linguistic aspects that lie at the basis of the phenomenon we investigate, and reports on a preliminary proof of concept of the above idea.Source: IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2021 Conference article Open Access OPEN

Generalized funnelling: ensemble learning and heterogeneous document embeddings for cross-lingual text classification
Moreo A., Pedrotti A., Sebastiani F.
Funnelling (Fun) is a method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. In this paper we describe Generalized Funnelling (gFun), a generalization of Fun consisting of a HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation ("view") of the document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations. We describe preliminary results that we have obtained on a large standard dataset for multilingual multilabel text classification.Source: IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2021 Conference article Open Access OPEN

QuaPy: a Python-based framework for quantification
Moreo A., Esuli A., Sebastiani F.
QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data. While quantification can be trivially performed by applying a standard classifier to each unlabelled data item and counting how many data items have been assigned to each class, it has been shown that this "classify and count" method is outpermsngformed by methods specifically designed for quantification. QuaPy provides implementations of a number of baseline methods and advanced quantification methods, of routines for quantification-oriented model selection, of several broadly accepted evaluation measures, and of robust evaluation protocols routinely used in the field. QuaPy also makes available datasets commonly used for testing quantifiers, and offers visualization tools for facilitating the analysis and interpretation of the results. The software is open-source and publicly available under a BSD-3 licence via GitHub, and can be installed via pip.Source: CIKM 2021 - 30th International Conference on Information and Knowledge Management, pp. 4534–4543, Online conference, 01-05/11/2021
DOI: 10.1145/3459637.3482015
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: ISTI Repository Open Access | ZENODO Open Access | dl.acm.org Restricted | CNR ExploRA Restricted


2021 Software Unknown

QuaPy
Moreo A., Esuli A., Sebastiani F.
QuaPy is an open source framework for Quantification (a.k.a. Supervised Prevalence Estimation) written in Python. QuaPy roots on the concept of data sample, and provides implementations of most important concepts in quantification literature, such as the most important quantification baselines, many advanced quantification methods, quantification-oriented model selection, many evaluation measures and protocols used for evaluating quantification methods. QuaPy also integrates commonly used datasets and offers visualization tools for facilitating the analysis and interpretation of results.Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: github.com | CNR ExploRA


2021 Report Open Access OPEN

AIMH research activities 2021
Aloia N., Amato G., Bartalesi V., Benedetti F., Bolettieri P., Cafarelli D., Carrara F., Casarosa V., Coccomini D., Ciampi L., Concordia C., Corbara S., Di Benedetto M., Esuli A., Falchi F., Gennaro C., Lagani G., Massoli F. V., Meghini C., Messina N., Metilli D., Molinari A., Moreo A., Nardi A., Pedrotti A., Pratelli N., Rabitti F., Savino P., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C.
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2021 activities of the research group.Source: ISTI Annual Report, ISTI-2021-AR/003, pp.1–34, 2021
DOI: 10.32079/isti-ar-2021/003

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2020 Journal article Open Access OPEN

Evaluation measures for quantification: an axiomatic approach
Sebastiani F.
Quantification is the task of estimating, given a set ? of unlabelled items and a set of classes ?={c1,...,c|?|}, the prevalence (or "relative frequency") in ? of each class ci??. While quantification may in principle be solved by classifying each item in ? and counting how many such items have been labelled with ci, it has long been shown that this "classify and count" method yields suboptimal quantification accuracy. As a result, quantification is no longer considered a mere byproduct of classification, and has evolved as a task of its own. While the scientific community has devoted a lot of attention to devising more accurate quantification methods, it has not devoted much to discussing what properties an evaluation measure for quantification (EMQ) should enjoy, and which EMQs should be adopted as a result. This paper lays down a number of interesting properties that an EMQ may or may not enjoy, discusses if (and when) each of these properties is desirable, surveys the EMQs that have been used so far, and discusses whether they enjoy or not the above properties. As a result of this investigation, some of the EMQs that have been used in the literature turn out to be severely unfit, while others emerge as closer to what the quantification community actually needs. However, a significant result is that no existing EMQ satisfies all the properties identified as desirable, thus indicating that more research is needed in order to identify (or synthesize) a truly adequate EMQ.Source: Information retrieval (Boston) 23 (2020): 255–288. doi:10.1007/s10791-019-09363-y
DOI: 10.1007/s10791-019-09363-y

See at: arXiv.org e-Print Archive Open Access | Information Retrieval Open Access | ISTI Repository Open Access | Information Retrieval Restricted | Information Retrieval Restricted | Information Retrieval Restricted | Information Retrieval Restricted | Information Retrieval Restricted | CNR ExploRA Restricted


2020 Journal article Open Access OPEN

Cross-Lingual Sentiment Quantification
Esuli A., Moreo A., Sebastiani F.
Sentiment Quantification is the task of estimating the relative frequency of sentiment-related classes-such as Positive and Negative-in a set of unlabeled documents. It is an important topic in sentiment analysis, as the study of sentiment-related quantities and trends across a population is often of higher interest than the analysis of individual instances. In this article, we propose a method for cross-lingual sentiment quantification, the task of performing sentiment quantification when training documents are available for a source language S, but not for the target language T, for which sentiment quantification needs to be performed. Cross-lingual sentiment quantification (and cross-lingual text quantification in general) has never been discussed before in the literature; we establish baseline results for the binary case by combining state-of-the-art quantification methods with methods capable of generating cross-lingual vectorial representations of the source and target documents involved. Experiments on publicly available datasets for crosslingual sentiment classification show that the presented method performs cross-lingual sentiment quantification with high accuracy.Source: IEEE intelligent systems 35 (2020): 106–113. doi:10.1109/MIS.2020.2979203
DOI: 10.1109/mis.2020.2979203
Project(s): SoBigData-PlusPlus via OpenAIRE

See at: IEEE Intelligent Systems Open Access | ISTI Repository Open Access | ISTI Repository Open Access | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | IEEE Intelligent Systems Restricted | ieeexplore.ieee.org Restricted | IEEE Intelligent Systems Restricted | CNR ExploRA Restricted | IEEE Intelligent Systems Restricted


2020 Report Open Access OPEN

Tweet Sentiment Quantification: An Experimental Re-Evaluation
Moreo A., Sebastiani F.
Sentiment quantification is the task of estimating the relative frequency (or" prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts; this is especially important when these texts are tweets, since most sentiment classification endeavours carried out on Twitter data actually have quantification (and not the classification of individual tweets) as their ultimate goal. It is well-known that solving quantification via" classify and count"(ie, by classifying all unlabelled items via a standard classifier and counting the items that have been assigned to a given class) is suboptimal in terms of accuracy, and that more accurate quantification methods exist. In 2016, Gao and Sebastiani carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work is flawed, and that its results are thus unreliable. We now re-evaluate those quantification methods on the very same datasets, this time following a now consolidated and much more robust experimental protocol, that involves 5775 as many experiments as run in the original study. Our experimentation yields results dramatically different from those obtained by Gao and Sebastiani, and thus provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.Source: Research report, SoBigData++ and AI4Media, 2020
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: arxiv.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2020 Report Open Access OPEN

Re-Assessing the" Classify and Count" Quantification Method
Moreo A., Sebastiani F.
Learning to quantify (aka\quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that" Classify and Count"(CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy; following this observation, several methods for learning to quantify have been proposed that have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC (and its variants), and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a true quantification loss instead of a standard classification-based loss. Experiments on three publicly available binary sentiment classification datasets support these conclusions.Source: Research report, SoBigData++ and AI4Media, 2020
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: arxiv.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2020 Report Open Access OPEN

MedLatin1 and MedLatin2: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts
Corbara S., Moreo A., Sebastiani F., Tavoni M.
We present and make available MedLatin1 and MedLatin2, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatin1 and MedLatin2 consist of 294 and 30 curated texts, respectively, labelled by author, with MedLatin1 texts being of an epistolary nature and MedLatin2 texts consisting of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification.Source: Research report, 2020

See at: arxiv.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2020 Contribution to conference Open Access OPEN

Evaluation Measures for Quantification: An Axiomatic Approach
Sebastiani F.
Source: 42nd European Conference on Information Retrieval, pp. 862–862, Lisbon, PT, 14-17/04/2020
DOI: 10.1007/978-3-030-45439-5

See at: link.springer.com Open Access | ISTI Repository Open Access | CNR ExploRA Open Access