68 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2023 Conference article Open Access OPEN
Ordinal quantification through regularization
Bunse M., Moreo A., Sebastiani F., Senz M.
Quantification,i.e.,thetaskoftrainingpredictorsoftheclass prevalence values in sets of unlabelled data items, has received increased attention in recent years. However, most quantification research has con- centrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. We here study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms that are proposed by authors from very different research fields, who were unaware of each other's developments. Third, we propose three OQ algorithms, based on the idea of preventing ordinally implausible estimates through regu- larization. Our experiments show that these algorithms outperform the existing ones if the ordinal plausibility assumption holds.Source: ECML/PKDD 2022 - 33rd European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 36–52, Grenoble, FR, 19-23/09/2022
DOI: 10.1007/978-3-031-26419-1_3
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA Restricted


2023 Book Open Access OPEN
Learning to Quantify
Esuli A., Fabris A., Moreo A., Sebastiani F.
This open access book provides an introduction and an overview of learning to quantify (a.k.a. "quantification"), i.e. the task of training estimators of class proportions in unlabeled data by means of supervised learning. In data science, learning to quantify is a task of its own related to classification yet different from it, since estimating class proportions by simply classifying all data and counting the labels assigned by the classifier is known to often return inaccurate ("biased") class proportion estimates. The book introduces learning to quantify by looking at the supervised learning methods that can be used to perform it, at the evaluation measures and evaluation protocols that should be used for evaluating the quality of the returned predictions, at the numerous fields of human activity in which the use of quantification techniques may provide improved results with respect to the naive use of classification techniques, and at advanced topics in quantification research. The book is suitable to researchers, data scientists, or PhD students, who want to come up to speed with the state of the art in learning to quantify, but also to researchers wishing to apply data science technologies to fields of human activity (e.g., the social sciences, political science, epidemiology, market research) which focus on aggregate ("macro") data rather than on individual ("micro") data.DOI: 10.1007/978-3-031-20467-8
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: link.springer.com Open Access | CNR ExploRA Open Access


2022 Conference article Open Access OPEN
LeQua@CLEF2022: learning to quantify
Esuli A., Moreo A., Sebastiani F.
LeQua 2022 is a new lab for the evaluation of methods for "learning to quantify" in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of litera- ture has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary set- ting and in the single-label multiclass setting. For each such setting we provide data either in ready-made vector form or in raw document form.Source: ECIR 2022 - 44th European Conference on IR Research, pp. 374–381, Stavanger, Norway, 10-14/04/2022
DOI: 10.1007/978-3-030-99739-7_47
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA Restricted


2022 Journal article Open Access OPEN
Report on the 1st International Workshop on Learning to Quantify (LQ 2021)
Del Coz J. J., González P., Moreo A., Sebastiani F.
The 1st International Workshop on Learning to Quantify (LQ 2021 - https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Conference on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely online, due to the COVID-19 pandemic. This report presents a summary of each keynote speech and contributed paper presented in this event, and discusses the issues that were raised during the workshop.Source: SIGKDD explorations (Online) 24 (2022): 49–51.
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: kdd.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Journal article Open Access OPEN
Syllabic quantity patterns as rhythmic features for Latin authorship attribution
Corbara S., Moreo A., Sebastiani F.
It is well known that, within the Latin production of written text, peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on so-called syllabic quantity, that is, on the length of the involved syllables, and there is substantial evidence suggesting that certain authors had a preference for certain metric patterns over others. In this research we investigate the possibility to employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts. We test the impact of these features on the authorship attribution task when combined with other topic-agnostic features. Our experiments, carried out on three different datasets using support vector machines (SVMs) show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.Source: Journal of the Association for Information Science and Technology (2022). doi:10.1002/asi.24660
DOI: 10.1002/asi.24660
Metrics:


See at: asistdl.onlinelibrary.wiley.com Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Journal article Open Access OPEN
MedLatinEpi and MedLatinLit: two datasets for the computational authorship analysis of medieval Latin texts
Corbara S., Moreo A., Sebastiani F., Tavoni M.
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.Source: ACM journal on computing and cultural heritage (Print) 3 (2022). doi:10.1145/3485822
DOI: 10.1145/3485822
Metrics:


See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA Restricted


2022 Conference article Open Access OPEN
A detailed overview of LeQua 2022: learning to quantify
Esuli A., Moreo A., Sebastiani F., Sperduti G.
LeQua 2022 is a new lab for the evaluation of methods for "learning to quantify" in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest Y = {y1 , ..., yn } in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.Source: CLEF 2022 - 13th Conference and Labs of the Evaluation Forum, pp. 1849–1868, Bologna, Italy, 5-8/9/2022
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Conference article Open Access OPEN
A concise overview of LeQua@CLEF 2022: Learning to Quantify
Esuli A., Moreo A., Sebastiani F., Sperduti G.
LeQua 2022 is a new lab for the evaluation of methods for "learning to quantify" in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest Y={y1,...,yn} in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.Source: CLEF 2022 - 13th Conference and Labs of the Evaluation Forum, pp. 362–381, Bologna, Italy, 5-8/9/2022
DOI: 10.1007/978-3-031-13643-6_23
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA Restricted


2022 Journal article Open Access OPEN
Tweet sentiment quantification: an experimental re-evaluation
Moreo A., Sebastiani F.
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of indi- vidual tweets). It is well-known that solving quantification by means of "classify and count" (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani 2016 carried out a systematic comparison of quantification methods on the task of tweet sentiment quantifica- tion. In hindsight, we observe that the experimentation carried out in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus question- able. We here re-evaluate those quantification methods (plus a few more modern ones) on exactly the same datasets, this time following a now consolidated and robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. Due to the above-mentioned presence, in the test data, of samples characterised by class prevalence values very different from those of the training set, the results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.Source: PloS one 17 (2022). doi:10.1371/journal.pone.0263449
DOI: 10.1371/journal.pone.0263449
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: journals.plos.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Contribution to conference Open Access OPEN
Proceedings of the 2nd International Workshop on Learning to Quantify (LQ 2022)
Del Coz J. J, González P., Moreo A., Sebastiani F.
The 2nd International Workshop on Learning to Quantify (LQ 2022 - https: //lq-2022.github.io/) was held in Grenoble, FR, on September 23, 2022, as a satellite workshop of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2022). While the 1st edition of the workshop (LQ 2021 - https://cikmlq2021. github.io/, which was instead co-located with the 30th ACM International Conference on Information and Knowledge Management (CIKM 2021)) had to be an entirely online event, LQ 2022 was a hybrid event, with presentations given in-presence and both in-presence attendees and remote attendees. The workshop was a half-day event, and consisted of a keynote talk by Marco Saerens (Universit ?e Catholique de Louvain), presentations of four con- tributed papers, and a final collective discussion on the open problems of learning to quantify and on future initiatives. The present volume contains the four contributed papers that were ac- cepted for presentation at the workshop. Each of these papers was submitted as a response to the call for papers, was reviewed by at least three members of the international program committee, and was revised by the authors so as to take into account the feedback provided by the reviewers. We hope that the availability of the present volume will increase the interest in the subject of quantification on the part of researchers and practitioners alike, and will contribute to making quantification better known to potential users of this technology and to researchers interested in advancing the field.Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE

See at: lq-2022.github.io Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Journal article Open Access OPEN
Generalized funnelling: ensemble learning and heterogeneous document embeddings for cross-lingual text classification
Moreo A., Pedrotti A., Sebastiani F.
Funnelling (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The meta-classifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe Generalized Funnelling (gFun), a generalisation of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation ("view") of the (monolingual) document. We describe an instance of gFun in which the meta-classifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings), word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings), and word-context correlations (as encoded by multilingual BERT ). We show that this instance of gFun substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.Source: ACM transactions on information systems (2022). doi:10.1145/3544104
DOI: 10.1145/3544104
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: dl.acm.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2022 Conference article Open Access OPEN
Rhythmic and psycholinguistic features for authorship tasks in the Spanish parliament: evaluation and analysis
Corbara S., Chulvi B., Rosso P., Moreo A.
Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic-agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable in terms of the authors' political affiliation and communication style.Source: CLEF 2022 - 13th Conference of the CLEF Association, pp. 79–92, Bologna, Italy, 5-8/9/2022
DOI: 10.1007/978-3-031-13643-6_6
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA Restricted


2022 Conference article Open Access OPEN
Investigating topic-agnostic features for authorship tasks in Spanish political speeches
Corbara S., Chulvi Ferriols B., Rosso P., Moreo A.
Authorship Identification is the branch of authorship analysis concerned with uncovering the author of a written document. Methods devised for Authorship Identification typically employ stylometry (the analysis of unconscious traits that authors exhibit while writing), and are expected not to make inferences grounded on the topics the authors usually write about (as reflected in their past production). In this paper, we present a series of experiments evaluating the use of feature sets based on rhythmic and psycholinguistic patterns for Authorship Verification and Attribution in Spanish political language, via different approaches of text distortion used to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by the BETO transformer when the latter is trained on the original text, i.e., when potentially learning from topical information.Source: NLDB 2022 - 27th International Conference on Applications of Natural Language to Information Systems, pp. 394–402, Valencia, Spagna, 15-17/6/2022
DOI: 10.1007/978-3-031-08473-7_36
Project(s): AI4Media via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | link.springer.com Restricted | CNR ExploRA Restricted


2022 Report Open Access OPEN
AIMH research activities 2022
Aloia N., Amato G., Bartalesi V., Benedetti F., Bolettieri P., Cafarelli D., Carrara F., Casarosa V., Ciampi L., Coccomini D. A., Concordia C., Corbara S., Di Benedetto M., Esuli A., Falchi F., Gennaro C., Lagani G., Lenzi E., Meghini C., Messina N., Metilli D., Molinari A., Moreo A., Nardi A., Pedrotti A., Pratelli N., Rabitti F., Savino P., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C.
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2022 activities of the research group.Source: ISTI Annual reports, 2022
DOI: 10.32079/isti-ar-2022/002
Metrics:


See at: ISTI Repository Open Access | CNR ExploRA Open Access


2021 Journal article Open Access OPEN
Word-class embeddings for multiclass text classification
Moreo A., Esuli A., Sebastiani F.
Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.Source: Data mining and knowledge discovery 35 (2021): 911–963. doi:10.1007/s10618-020-00735-3
DOI: 10.1007/s10618-020-00735-3
DOI: 10.48550/arxiv.1911.11506
DOI: 10.5281/zenodo.4468312
DOI: 10.5281/zenodo.4468313
Project(s): AI4Media via OpenAIRE, ARIADNEplus via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Data Mining and Knowledge Discovery Open Access | ZENODO Open Access | ISTI Repository Open Access | Data Mining and Knowledge Discovery Restricted | doi.org Restricted | ZENODO Restricted | link.springer.com Restricted | CNR ExploRA Restricted


2021 Journal article Open Access OPEN
Lost in transduction: transductive transfer learning in text classification
Moreo A., Esuli A., Sebastiani F.
Obtaining high-quality labelled data for training a classifier in a new application domain is often costly. Transfer Learning(a.k.a. "Inductive Transfer") tries to alleviate these costs by transferring, to the "target"domain of interest, knowledge available from a different "source"domain. In transfer learning the lack of labelled information from the target domain is compensated by the availability at training time of a set of unlabelled examples from the target distribution. Transductive Transfer Learning denotes the transfer learning setting in which the only set of target documents that we are interested in classifying is known and available at training time. Although this definition is indeed in line with Vapnik's original definition of "transduction", current terminology in the field is confused. In this article, we discuss how the term "transduction"has been misused in the transfer learning literature, and propose a clarification consistent with the original characterization of this term given by Vapnik. We go on to observe that the above terminology misuse has brought about misleading experimental comparisons, with inductive transfer learning methods that have been incorrectly compared with transductive transfer learning methods. We then, give empirical evidence that the difference in performance between the inductive version and the transductive version of a transfer learning method can indeed be statistically significant (i.e., that knowing at training time the only data one needs to classify indeed gives an advantage). Our clarification allows a reassessment of the field, and of the relative merits of the major, state-of-The-Art algorithms for transfer learning in text classification.Source: ACM transactions on knowledge discovery from data 16 (2021). doi:10.1145/3453146
DOI: 10.1145/3453146
Project(s): ARIADNEplus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN
Heterogeneous document embeddings for cross-lingual text classification
Moreo A., Pedrotti A., Sebastiani F.
Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged. We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.Source: SAC 2021: 36th ACM/SIGAPP Symposium On Applied Computing, pp. 685–688, Online conference, 22-26/03/2021
DOI: 10.1145/3412841.3442093
Project(s): AI4Media via OpenAIRE, ARIADNEplus via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ZENODO Open Access | dl.acm.org Restricted | dl.acm.org Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN
Re-assessing the "Classify and Count" quantification method
Moreo A., Sebastiani F.
Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that "Classify and Count" (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Following this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.Source: ECIR 2021 - 43rd European Conference on Information Retrieval, pp. 75–91, Online conference, 28/03-01/04/2021
DOI: 10.1007/978-3-030-72240-1_6
DOI: 10.5281/zenodo.4468276
DOI: 10.48550/arxiv.2011.02552
DOI: 10.5281/zenodo.4468277
Project(s): AI4Media via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | arxiv.org Open Access | ZENODO Open Access | ZENODO Open Access | ISTI Repository Open Access | Lecture Notes in Computer Science Restricted | doi.org Restricted | link.springer.com Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN
Garbled-word embeddings for jumbled text
Sperduti G., Moreo A., Sebastiani F.
"Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe". We investigate the extent to which this phenomenon applies to computers as well. Our hypothesis is that computers are able to learn distributed word representations that are resilient to character reshuffling, without incurring a significant loss in performance in tasks that use these representations. If our hypothesis is confirmed, this may form the basis for a new and more efficient way of encoding character-based representations of text in deep learning, and one that may prove especially robust to misspellings, or to corruption of text due to OCR. This paper discusses some fundamental psycho-linguistic aspects that lie at the basis of the phenomenon we investigate, and reports on a preliminary proof of concept of the above idea.Source: IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2021 Conference article Open Access OPEN
Generalized funnelling: ensemble learning and heterogeneous document embeddings for cross-lingual text classification
Moreo A., Pedrotti A., Sebastiani F.
Funnelling (Fun) is a method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. In this paper we describe Generalized Funnelling (gFun), a generalization of Fun consisting of a HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation ("view") of the document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations. We describe preliminary results that we have obtained on a large standard dataset for multilingual multilabel text classification.Source: IIR 2021 - 11th Italian Information Retrieval Workshop, Bari, Italy, 13-15/09/21

See at: ceur-ws.org Open Access | ISTI Repository Open Access | CNR ExploRA Open Access