2002
Journal article
Restricted
Machine learning in automated text categorisation
Sebastiani FThe automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.Source: ACM COMPUTING SURVEYS, vol. 34 (issue 1), pp. 1-47
See at:
CNR IRIS | CNR IRIS
2003
Book
Restricted
Editorial activity - ECIR 2003
Sebastiani FThe European Conference on Information Retrieval Research, now in its 25th "Silver Jubilee" edition, was initially established by the Information Retrieval Specialist Group of the British Computer Society (BCS-IRSG) under the name "Annual Colloquium on Information Retrieval Research", and always held in the United Kingdom until 1997. Since 1998 the location of the colloquium has alternated between the United Kingdom and the rest of Europe, in order to reflect the growing European orientation of the event. For the same reason, in 2001 the event was renamed "European Annual Colloquium on Information Retrieval Research". Since 2002, the proceedings of the Colloquium are being published by Springer Verlag in their Lecture Notes in Computer Science series.
See at:
CNR IRIS | CNR IRIS | link.springer.com
2009
Journal article
Restricted
Preferential text classification: learning algorithms and evaluation measures
Aiolli F, Cardin R, Sebastiani F, Sperduti AIn many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary and the secondary categories that are attached to a given document. The primary categories represent the topics that are central to the document, while the secondary categories represent topics that the document somehow touches upon, albeit peripherally. This distinction has always been neglected in text categorization (TC) research. We contend that the distinction is important, and deserves to be explicitly tackled. The contribution of this paper is three-fold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by using state-of-the-art learning technology such as multiclass SVMs (for detecting the unique primary category) and binary SVMs (for detecting the secondary categories). Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e. in the form 'for document d_i, category c' is preferred to category c' '; this allows us to distinguish between primary and secondary categories not only in the testing phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.Source: INFORMATION RETRIEVAL (BOSTON), vol. 12 (issue 5), pp. 559-580
See at:
CNR IRIS | CNR IRIS | www.springerlink.com
2002
Journal article
Open Access
Extending thematic lexical resources by term categorization
Lavelli A, Magnini B, Sebastiani FResearchers from IEI-CNR, Pisa, and ITC-irst, Trento, are currently working on the automated construction of specialized lexicons1, as part of an ongoing collaboration in the fields of Machine Learning and Information Retrieval. Increasing attention is being given to the generation of thematic lexicons1 (ie sets of specialized terms, pertaining to a given theme or discipline). Such lexicons1 are useful in a variety of tasks in the natural language processing and information access fields, including supporting information retrieval applications in the context of thematic, 'vertical' portals.Source: ERCIM NEWS, vol. 50, pp. 45-46
See at:
CNR IRIS | CNR IRIS
2002
Journal article
Restricted
Report on the 2nd Workshop on Operational Text Classification Systems (OTC-02)
Dumais S, Lewis Dd, Sebastiani FThe research side of text classification has been widely discussed in conferences and journals. In contrast, operational text classification has been covered in the popular media, but less so in technical forums. Issues other than effectiveness, such as engineering and workflow issues, have not been widely discussed in published research. The goal of this workshop was to expose researchers and practitioners to the challenges encountered in building and fielding operational text classification systems.Source: SIGIR FORUM, vol. 36 (issue 2), pp. 68-71
See at:
CNR IRIS | CNR IRIS
2003
Conference article
Restricted
Discretizing continuous attributes in AdaBoost for text categorization
Nardiello P, Sebastiani F, Sperduti AWe focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, AdaBoost. MH and AdaBoost.MHKR. While the former is a realization of the well-known AdaBoost algorithm speci.cally aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations.In this paper we address the problem of exploiting the potential of weighted representations in the context of AdaBoost-like algorithms by discretizing the continuous attributes through the application of entropybased discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.
See at:
CNR IRIS | CNR IRIS | link.springer.com
2006
Journal article
Restricted
Automatic expansion of domain-specific lexicons by term categorization
Avancini H, Lavelli A, Sebastiani F, Zanoli RWe discuss an approach to the automatic expansion of domain-specific lexicons, i.e., to the problem of extending, for each ci in a predefined set C = {c1, . . . , cm} of semantic domains, an initial lexicon Li 0 into a larger lexicon Li 1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem, in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks, in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as 'implicit' representations for our terms.Source: ACM TRANSACTIONS ON SPEECH AND LANGUAGE PROCESSING, vol. 3 (issue 1), pp. 1-30
See at:
dl.acm.org | CNR IRIS | CNR IRIS
2001
Journal article
Open Access
Boosting algorithms for automated text categorization
Sebastiani F, Sperduti A, Valdambrini NAs part of its Digital Library activities, and in collaboration with the Department of Computer Science of the University of Pisa, IEI-CNR is working on the construction of tools for the automatic or semi-automatic labeling of texts with thematic categories or subject codes.Source: ERCIM NEWS, vol. 44, pp. 55-57
See at:
CNR IRIS | CNR IRIS
2001
Journal article
Restricted
Report on the workshop on operational text classification systems (OTC-01)
Lewis Dd, Sebastiani FThe Workshop on Operational Text Classification (OTC-01), occurred September 13, 2001 in New Orleans, Louisiana, US. It was co-located with ACM SIGIR 2002 and brought together researchers,practitioners, and system designers interested in building and fielding operational text classification systems. The workshop organizers were David Lewis (chair), Susan Dumais, Ronen Feldman, and Fabrizio Sebastiani.Source: SIGIR FORUM, vol. 35 (issue 2), pp. 8-11
See at:
CNR IRIS | CNR IRIS
2009
Journal article
Open Access
Preferential text classification: learning algorithms and evaluation measures
Aiolli F, Cardin R, Sebastiani F, Sperduti AResearchers from ISTI-CNR, Pisa and from the Department of Pure and Applied Mathematics at the University of Padova, are explicitly attacking the document classification problem of distinguishing primary from secondary classes by using 'preferential learning' technology.Source: ERCIM NEWS, vol. 76, pp. 60-61
See at:
CNR IRIS | CNR IRIS
2003
Conference article
Restricted
Research in automated classification of texts: trends and perspectives
Sebastiani FText categorization (also known as text classi.cation, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scienti.c articles according to prede.ned thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, automated survey coding, and even automated essay grading. Automated text classi.cation is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply infeasible given the time constraints of the application or the number of documents involved. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology. This paper will outline the fundamental traits of the technologies involved, of the applications that can feasibly be tackled through text classi.cation, and of the tools and resources that are available to the researcher and developer wishing to take up these technologies for deploying real-world applications.
See at:
CNR IRIS | CNR IRIS
2003
Conference article
Restricted
Expanding Domain-Specific Lexicons by Term Categorization
Avancini H, Lavelli A, Magnini B, Sebastiani F, Zanoli RWe discuss an approach to the automatic expansion of domain specific lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the expansion of such lexicons as a process of learning previously unknown associations between terms and domains. The process generates, for each ci in a set C = {c1,.....,cm} of domains, a lexicon L1i, bootstrapping from an initial lexicon L0i and a set of documents given as input. The method is inspired by text categorization (TC), the discipline con=cerned with labelling natural language texts with labels from a predefined set of domains, or categories. However, while TC deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labelled with domains.
See at:
CNR IRIS | CNR IRIS
2004
Conference article
Restricted
An experimental comparison of term representations for term management applications
Lavelli A, Sebastiani F, Zanoli RA number of content management tasks, including term clustering, term categorization, and automated thesaurus generation, see natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for being explicitly manipulated by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional representation for terms according to which a term is represented by the 'bag of documents' in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative extensional representation for terms, according to which a term is represented by the 'bag of terms' that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. In order to discover this we carry out experiments on a term categorization task, which allows us to compare the two different representations in closely controlled experimental conditions. We report the results of a large scale experimentation carried out by classifying under 42 different classes the terms extracted from a corpus of more than 60,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.
See at:
CNR IRIS | CNR IRIS
2004
Conference article
Restricted
Distributional term representations: an experimental comparison
Lavelli A, Sebastiani F, Zanoli RA number of content management tasks, including term categorization, term clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional (aka distributional) representation for terms according to which a term is represented by the 'bag of documents' in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative distributional representation for terms, according to which a term is represented by the 'bag of terms' that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. We carry out experiments on (i) a term categorization task, and (ii) a term clustering task; this allows us to compare the two different representations in closely controlled experimental conditions. We report the results of experiments in which we categorize/cluster under 42 different classes the terms extracted from a corpus of more than 65,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.
See at:
CNR IRIS | CNR IRIS
2004
Conference article
Restricted
Organizing digital libraries by automated text categorization
Avancini H, Rauber A, Sebastiani FText Categorization (TC) is the discipline concerned with the construction of automatic text classifiers, i.e. programs capable of assigning to a document one or more among a set of predefined categories based on the content of the document. Building these classifiers is itself done automatically, by means of a general inductive process that learns the characteristics of the categories from a set of preclassified documents. In this paper we discuss a class of applications, automatic indexing with controlled vocabularies, that is of direct concern to organizing digital libraries. We exemplify this class of applications by discussing an ongoing project aimed at classifying scientific papers about computer science with respect to the ACM Classification Scheme.
See at:
CNR IRIS | CNR IRIS
2003
Conference article
Restricted
Automatic coding of open-ended surveys using text categorization techniques
Giorgetti D, Sebastiani F, Prodanof IOpen-ended questions do not limit respondents' answers in terms of linguistic form and semantic content, but bring about severe problems in terms of cost and speed, since their coding requires trained professionals to manually identify and tag meaningful text segments. To overcome these problems, a few automatic approaches have been proposed in the past, some based on matching the answer with textual descriptions of the codes, others based on manually building rules that check the answer for the presence or absence of code-revealing words. While the former approach is scarcely effective, the major drawback of the latter approach is that the rules need to be developed manually, and before the actual observation of text data. We propose a new approach, inspired by work in information retrieval (IR), that overcomes these drawbacks. In this approach survey coding is viewed as a task of multiclass text categorization (MTC), and is tackled through techniques originally developed in the .eld of supervised machine learning. In MTC each text belonging to a given corpus has to be classi.ed into exactly one from a set of prede.ned categories. In the supervised machine learning approach to MTC, a set of categorization rules is built automatically by learning the characteristics that a text should have in order to be classified under a given category. Such characteristics are automatically learnt from a set of training examples, i.e. a set of texts whose category is known. For survey coding, we equate the set of codes with categories, and all the collected answers to a given question with texts. Giorgetti and Sebastiani have carried out automatic coding experiments with two di.erent supervised learning techniques, one based on a näve Bayesian method and the other based on multiclass support vector machines. Experiments have been run on a corpus of social surveys carried out by the National Opinion Research Center, University of Chicago (NORC). These experiments show that our methods outperform, in terms of accuracy, previous automated methods tested on the same corpus.
See at:
CNR IRIS | CNR IRIS