306 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2002 Journal article Open Access OPEN
Machine learning in automated text categorisation
Sebastiani F.
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.Source: ACM computing surveys 34 (2002): 1–47. doi:10.1145/505282.505283
DOI: 10.1145/505282.505283
DOI: 10.48550/arxiv.cs/0110053
Metrics:


See at: arXiv.org e-Print Archive Open Access | ACM Computing Surveys Open Access | ACM Computing Surveys Restricted | doi.org Restricted | CNR ExploRA


2006 Journal article Restricted
Cluster generation and cluster labelling for Web snippets
Geraci F., Maggini M., Pellegrini M., Sebastiani F.
This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric kcenter clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure.We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted 'external' metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.Source: Lecture notes in computer science (2006): 25–36.

See at: www.springerlink.com Restricted | CNR ExploRA


2004 Contribution to conference Unknown
Introduction to Special Issue on the 25th European Conference on Information Retrieval Research
Sebastiani F.
[no abstract]Source: Dordrecht: Kluwer Academic Publishers, 2004

See at: CNR ExploRA


2003 Journal article Unknown
Discretizing continuous attributes in AdaBoost for text categorization
Nardiello P., Sebastiani F., Sperduti A.
We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, AdaBoost. MH and AdaBoost.MHKR. While the former is a realization of the well-known AdaBoost algorithm speci.cally aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations.In this paper we address the problem of exploiting the potential of weighted representations in the context of AdaBoost-like algorithms by discretizing the continuous attributes through the application of entropybased discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.Source: Lecture notes in computer science 2633 (2003): 320–334.

See at: CNR ExploRA


2003 Contribution to conference Restricted
Editorial activity - ECIR 2003
Sebastiani F.
The European Conference on Information Retrieval Research, now in its 25th "Silver Jubilee" edition, was initially established by the Information Retrieval Specialist Group of the British Computer Society (BCS-IRSG) under the name "Annual Colloquium on Information Retrieval Research", and always held in the United Kingdom until 1997. Since 1998 the location of the colloquium has alternated between the United Kingdom and the rest of Europe, in order to reflect the growing European orientation of the event. For the same reason, in 2001 the event was renamed "European Annual Colloquium on Information Retrieval Research". Since 2002, the proceedings of the Colloquium are being published by Springer Verlag in their Lecture Notes in Computer Science series.DOI: 10.1007/3-540-36618-0
Metrics:


See at: link.springer.com Restricted | CNR ExploRA


2009 Journal article Open Access OPEN
Preferential text classification: learning algorithms and evaluation measures
Aiolli F., Cardin R., Sebastiani F., Sperduti A.
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary and the secondary categories that are attached to a given document. The primary categories represent the topics that are central to the document, while the secondary categories represent topics that the document somehow touches upon, albeit peripherally. This distinction has always been neglected in text categorization (TC) research. We contend that the distinction is important, and deserves to be explicitly tackled. The contribution of this paper is three-fold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by using state-of-the-art learning technology such as multiclass SVMs (for detecting the unique primary category) and binary SVMs (for detecting the secondary categories). Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e. in the form 'for document d_i, category c' is preferred to category c' '; this allows us to distinguish between primary and secondary categories not only in the testing phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.Source: Information retrieval (Boston) 12 (2009): 559–580. doi:10.1007/S10791-008-9071-Y
DOI: 10.1007/s10791-008-9071-y
Metrics:


See at: Information Retrieval Open Access | Information Retrieval Restricted | www.springerlink.com Restricted | CNR ExploRA


2010 Contribution to book Open Access OPEN
Preface to ECDL 2010
Lalmas M., Jose J., Rauber A., Sebastiani F., Frommholz I.
An abstract is not availableDOI: 10.1007/978-3-642-15464-5
Metrics:


See at: eprints.whiterose.ac.uk Open Access | doi.org Restricted | www.springerlink.com Restricted | CNR ExploRA


2002 Journal article Unknown
Extending Thematic Lexical Resources by Term Categorization
Lavelli A., Magnini B., Sebastiani F.
Researchers from IEI-CNR, Pisa, and ITC-irst, Trento, are currently working on the automated construction of specialized lexicons1, as part of an ongoing collaboration in the fields of Machine Learning and Information Retrieval. Increasing attention is being given to the generation of thematic lexicons1 (ie sets of specialized terms, pertaining to a given theme or discipline). Such lexicons1 are useful in a variety of tasks in the natural language processing and information access fields, including supporting information retrieval applications in the context of thematic, 'vertical' portals.Source: ERCIM news online edition 50 (2002): 45–46.

See at: CNR ExploRA


2002 Journal article Unknown
Report on the 2nd Workshop on Operational Text Classification Systems (OTC-02)
Dumais S., Lewis D. D., Sebastiani F.
The research side of text classification has been widely discussed in conferences and journals. In contrast, operational text classification has been covered in the popular media, but less so in technical forums. Issues other than effectiveness, such as engineering and workflow issues, have not been widely discussed in published research. The goal of this workshop was to expose researchers and practitioners to the challenges encountered in building and fielding operational text classification systems.Source: SIGIR forum 36 (2002): 68–71.

See at: CNR ExploRA


2003 Conference article Restricted
Discretizing continuous attributes in AdaBoost for text categorization
Nardiello P., Sebastiani F., Sperduti A.
We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, AdaBoost. MH and AdaBoost.MHKR. While the former is a realization of the well-known AdaBoost algorithm speci.cally aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations.In this paper we address the problem of exploiting the potential of weighted representations in the context of AdaBoost-like algorithms by discretizing the continuous attributes through the application of entropybased discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.Source: ECIR 2003 - 25th European Conference on Information Retrieval, pp. 320–334, Pisa, Italy, April 14-16, 2003
DOI: 10.1007/3-540-36618-0_23
Metrics:


See at: doi.org Restricted | link.springer.com Restricted | www.scopus.com Restricted | CNR ExploRA


2003 Journal article Unknown
Report on the 25th European Conference on Information Retrieval Research (ECIR-03)
Sebastiani F.
This is a report of the 25th European Conference on Information Retrieval Research (ECIR-03), which took place in Pisa, Italy, on 14-16 April, 2003.Source: SIGIR forum 37 (2003): 229–232.

See at: CNR ExploRA


2001 Journal article Unknown
Boosting algorithms for automated text categorization
Sebastiani F., Sperduti A., Valdambrini N.
As part of its Digital Library activities, and in collaboration with the Department of Computer Science of the University of Pisa, IEI-CNR is working on the construction of tools for the automatic or semi-automatic labeling of texts with thematic categories or subject codes.Source: ERCIM news 44 (2001): 55–57.

See at: CNR ExploRA


2001 Journal article Unknown
Report on the workshop on operational text classification systems (OTC-01)
Lewis D. D., Sebastiani F.
The Workshop on Operational Text Classification (OTC-01), occurred September 13, 2001 in New Orleans, Louisiana, US. It was co-located with ACM SIGIR 2002 and brought together researchers,practitioners, and system designers interested in building and fielding operational text classification systems. The workshop organizers were David Lewis (chair), Susan Dumais, Ronen Feldman, and Fabrizio Sebastiani.Source: SIGIR forum 35 (2001): 8–11.

See at: CNR ExploRA


2009 Journal article Unknown
Preferential text classification: learning algorithms and evaluation measures
Aiolli F., Cardin R., Sebastiani F., Sperduti A.
Researchers from ISTI-CNR, Pisa and from the Department of Pure and Applied Mathematics at the University of Padova, are explicitly attacking the document classification problem of distinguishing primary from secondary classes by using 'preferential learning' technology.Source: ERCIM news 76 (2009): 60–61.

See at: CNR ExploRA


2003 Conference article Unknown
Research in automated classification of texts: trends and perspectives
Sebastiani F.
Text categorization (also known as text classi.cation, or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scienti.c articles according to prede.ned thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, automated survey coding, and even automated essay grading. Automated text classi.cation is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply infeasible given the time constraints of the application or the number of documents involved. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology. This paper will outline the fundamental traits of the technologies involved, of the applications that can feasibly be tackled through text classi.cation, and of the tools and resources that are available to the researcher and developer wishing to take up these technologies for deploying real-world applications.Source: Fourth International Colloquium on Library and Information Science, pp. 298–311, Salamanca, 5-7 May 2003

See at: CNR ExploRA


2004 Conference article Unknown
An experimental comparison of term representations for term management applications
Lavelli A., Sebastiani F., Zanoli R.
A number of content management tasks, including term clustering, term categorization, and automated thesaurus generation, see natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for being explicitly manipulated by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional representation for terms according to which a term is represented by the 'bag of documents' in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative extensional representation for terms, according to which a term is represented by the 'bag of terms' that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. In order to discover this we carry out experiments on a term categorization task, which allows us to compare the two different representations in closely controlled experimental conditions. We report the results of a large scale experimentation carried out by classifying under 42 different classes the terms extracted from a corpus of more than 60,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.Source: SEBD 2004. 12° Convegno Nazionale su Sistemi Evoluti per Basi di Dati, pp. 190–201, S.Margherita di Pula, Cagliari, 21-23 June 2004

See at: CNR ExploRA


2004 Conference article Unknown
Distributional term representations: an experimental comparison
Lavelli A., Sebastiani F., Zanoli R.
A number of content management tasks, including term categorization, term clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used an extensional (aka distributional) representation for terms according to which a term is represented by the 'bag of documents' in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative distributional representation for terms, according to which a term is represented by the 'bag of terms' that co-occur with it in some document. This paper aims at discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. We carry out experiments on (i) a term categorization task, and (ii) a term clustering task; this allows us to compare the two different representations in closely controlled experimental conditions. We report the results of experiments in which we categorize/cluster under 42 different classes the terms extracted from a corpus of more than 65,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.Source: 13th CIKM-04, ACM International Conference on Information and Knowledge Management, pp. 615–624, Washington, US, November 8-13, 2004

See at: CNR ExploRA


2001 Conference article Unknown
Organizing and using digital libraries by automated text categorization
Sebastiani F.
When it was proclaimed that the Library contained all books, the first impression was one of extravagant happiness. All men felt themselves to be the masters of an intact and secret treasure. There was no personal or world problem whose eloquent solution did not exsist in some hexagon. (...)Source: Intelligenza Artificiale per i Beni Culturali e le Biblioteche Digitali., pp. 93–94, Bari, Italy, 25 settembre 2001

See at: CNR ExploRA


2002 Conference article Unknown
Building thematic lexical resources by bootstrapping and machine learning
Lavelli A., Magnini B., Sebastiani F.
An abstract is not availableSource: LREC 2002. Linguistic Knowledge Acquisition and Representation - Bootstrapping Annotated Language Data, pp. 53–62, Las Palmas de Gran Canaria, Spain, 1 June 2002

See at: CNR ExploRA


2002 Conference article Unknown
Building thematic lexical resources by term categorization
Lavelli A., Magnini B., Sebastiani F.
We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set C = {c1, . . . , cm} of themes, a sequence Li0 ? Li 1 ? . . . ? Li n of lexicons, bootstrapping from an initial lexicon Li 0 and a set of text corpora ? = {?0, . . . , ?n-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.Source: SIGIR 2002. The Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416, Tampere, Finland, 11-15 August 2002

See at: CNR ExploRA