Document - Discretizing continuous attributes in AdaBoost for text categorization

2003

Journal article Open Access

Discretizing continuous attributes in AdaBoost for text categorization

Nardiello P, Sebastiani F, Sperduti A

Text categorization

We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, AdaBoost. MH and AdaBoost.MHKR. While the former is a realization of the well-known AdaBoost algorithm speci.cally aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations.In this paper we address the problem of exploiting the potential of weighted representations in the context of AdaBoost-like algorithms by discretizing the continuous attributes through the application of entropybased discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:44074,
	title = {Discretizing continuous attributes in AdaBoost for text categorization},
	author = {Nardiello P and Sebastiani F and Sperduti A},
	year = {2003}
}

CNR authors and affiliations

CNR authors

Sebastiani, Fabrizio
0000-0003-4221-6427

Laboratories

Networked Multimedia Information System (2002-2020)

Download

CNR IRIS

Bibliographic record
Deposited version

Discretizing continuous attributes in AdaBoost for text categorization

Share

Cite as

CNR authors and affiliations

Download