Document - A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

2001

Book Unknown

A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Caropreso M. F., Matwin S., Sebastiani F.

Document indexing in text categorization Information search and retrieval

In this work we investigate the usefulness of {em $n$-grams} for document indexing in text categorization (TC). We call $n$-gram a set $g_k$ of $n$ word stems, and we say that $g_k$ occurs in a document $d_j$ when a sequence of words appears in $d_j$ that, after stop word removal and stemming, consists exactly of the $n$ stems in $g_k$, in some order. Previous researches have investigated the use of $n$-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of $n$-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all $k$-grams ($kleq n$), and checking how many $n$-grams score high enough to be selected in the top $sigma$ $k$-grams. We report the results of our experiments, using various feature selection measures and varying values of $sigma$, performed on the {sc Reuters-21578} standard TC benchmark. We also report results of making actual use of the selected $n$-grams in the context of a linear classifier induced by means of the Rocchio method.

Back to previous page

Cite as

BibTeX entry

@book{oai:it.cnr:prodotti:138923,
	title = {A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization},
	author = {Caropreso M. F. and Matwin S. and Sebastiani F.},
	year = {2001}
}

CNR authors and affiliations

CNR authors

Sebastiani, Fabrizio
0000-0003-4221-6427

Download

CNR ExploRA

Bibliographic record

A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Share

Cite as

CNR authors and affiliations

Download