Document - Machine learning in automated text categorisation

2002

Journal article Open Access

Machine learning in automated text categorisation

Sebastiani F.

Computer Science - Machine Learning General Computer Science I.2.3 Information Retrieval (cs.IR) FOS: Computer and information sciences Computer Science - Information Retrieval Theoretical Computer Science Machine learning Information retrieva H.3.1 Machine Learning (cs.LG) H.3.3 Text classification

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

Source: ACM computing surveys 34 (2002): 1–47. doi:10.1145/505282.505283

Publisher: Association for Computing Machinery,, New York, N.Y. , Stati Uniti d'America

Citations

Amati, G. and Crestani, F. 1999. Probabilistic learning for selective dissemination of information. Information Processing and Management 35, 5, 633-654.
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., and Spyropoulos, C. D. 2000. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, GR, 2000), pp. 160-167.
Apt´e, C., Damerau, F. J., and Weiss, S. M. 1994. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 3, 233-251.
Attardi, G., Di Marco, S., and Salvi, D. 1998. Universal Computer Science 4, 9, 719-736.
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 161-175.
Chakrabarti, S., Dom, B. E., Agrawal, R., and Raghavan, P. 1998a. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Journal of Very Large Data Bases 7, 3, 163-178.
Chakrabarti, S., Dom, B. E., and Indyk, P. 1998b. Enhanced hypertext categorization using hyperlinks. In Proceedings of SIGMOD-98, ACM International Conference on Management of Data (Seattle, US, 1998), pp. 307-318.
Clack, C., Farringdon, J., Lidwell, P., and Yu, T. 1997. Autonomous document classification for business. In Proceedings of the 1st International Conference on Autonomous Agents (Marina del Rey, US, 1997), pp. 201-208.
Cleverdon, C. 1984. Optimizing convenient online access to bibliographic databases. Information Services and Use 4, 1, 37-47. Also reprinted in [Willett 1988], pp. 32-41.
Cohen, W. W. 1995a. Learning to classify English text with ILP methods. In L. De Raedt Ed., Advances in inductive logic programming, pp. 124-143. Amsterdam, NL: IOS Press.
Cohen, W. W. 1995b. Text categorization and relational learning. In Proceedings of ICML95, 12th International Conference on Machine Learning (Lake Tahoe, US, 1995), pp. 124- 132.
Cohen, W. W. and Hirsh, H. 1998. Joins that generalize: text classification using Whirl. In Proceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining (New York, US, 1998), pp. 169-173.
Cohen, W. W. and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17, 2, 141-173.
Cooper, W. S. 1995. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Transactions on Information Systems 13, 1, 100-111.
Creecy, R. M., Masand, B. M., Smith, S. J., and Waltz, D. L. 1992. Trading MIPS and memory for knowledge engineering: classifying census returns on the Connection Machine. Communications of the ACM 35, 8, 48-63.
Crestani, F., Lalmas, M., van Rijsbergen, C. J., and Campbell, I. 1998. “Is this document relevant? . . . probably”. A survey of probabilistic models in information retrieval. ACM Computing Surveys 30, 4, 528-552.
Dagan, I., Karov, Y., and Roth, D. 1997. Mistake-driven learning in text categorization. In Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing (Providence, US, 1997), pp. 55-63.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic indexing. Journal of the American Society for Information Science 41, 6, 391-407.
Denoyer, L., Zaragoza, H., and Gallinari, P. 2001. HMM-based passage models for document classification and ranking. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, DE, 2001).
D´ıaz Esteban, A., de Buenaga Rodr´ıguez, M., Uren˜a Lo´pez, L. A., and Garc´ıa Vega, M. 1998. Integrating linguistic resources in an uniform way for text classification tasks. In Proceedings of LREC-98, 1st International Conference on Language Resources and Evaluation (Grenada, ES, 1998), pp. 1197-1204.
Domingos, P. and Pazzani, M. J. 1997. On the the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 2-3, 103-130.
Drucker, H., Vapnik, V., and Wu, D. 1999. Automatic text categorization and its applications to text retrieval. IEEE Transactions on Neural Networks 10, 5, 1048-1054.
Dumais, S. T. and Chen, H. 2000. Hierarchical classification of Web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, GR, 2000), pp. 256-263.
Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, US, 1998), pp. 148-155.
Escudero, G., Ma`rquez, L., and Rigau, G. 2000. Boosting applied to word sense disambiguation. In Proceedings of ECML-00, 11th European Conference on Machine Learning (Barcelona, ES, 2000), pp. 129-141.
Field, B. 1975. Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. Journal of Documentation 31, 4, 246-265.
Forsyth, R. S. 1999. New directions in text categorization. In A. Gammerman Ed., Causal models and intelligent data management, pp. 151-185. Heidelberg, DE: Springer.
Frasconi, P., Soda, G., and Vullo, A. 2001. Text categorization for multi-page documents: A hybrid naive Bayes HMM approach. Journal of Intelligent Information Systems. Forthcoming.
Fuhr, N. 1985. A probabilistic model of dictionary-based automatic indexing. In Proceedings of RIAO-85, 1st International Conference “Recherche d'Information Assistee par Ordinateur” (Grenoble, FR, 1985), pp. 207-216.
Fuhr, N. 1989. Models for retrieval with probabilistic indexing. Information Processing and Management 25, 1, 55-72.
Fuhr, N. and Buckley, C. 1991. A probabilistic learning approach for document indexing. ACM Transactions on Information Systems 9, 3, 223-248.
Fuhr, N., Hartmann, S., Knorz, G., Lustig, G., Schwantner, M., and Tzeras, K. 1991. AIR/X - a rule-based multistage indexing system for large subject fields. In Proceedings of RIAO-91, 3rd International Conference “Recherche d'Information Assistee par Ordinateur” (Barcelona, ES, 1991), pp. 606-623.
Fuhr, N. and Knorz, G. 1984. Retrieval test evaluation of a rule-based automated indexing (AIR/PHYS). In Proceedings of SIGIR-84, 7th ACM International Conference on Research and Development in Information Retrieval (Cambridge, UK, 1984), pp. 391-408.
Fuhr, N. and Pfeifer, U. 1994. Probabilistic information retrieval as combination of abstraction inductive learning and probabilistic assumptions. ACM Transactions on Information Systems 12, 1, 92-115.
Fu¨rnkranz, J. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis (Amsterdam, NL, 1999), pp. 487-497.
Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of ECDL00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, PT, 2000), pp. 59-68.
Gale, W. A., Church, K. W., and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26, 5, 415-439.
Go¨vert, N., Lalmas, M., and Fuhr, N. 1999. A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management (Kansas City, US, 1999), pp. 475-482.
Gray, W. A. and Harley, A. J. 1971. Computer-assisted indexing. Information Storage and Retrieval 7, 4, 167-174.
Guthrie, L., Walker, E., and Guthrie, J. A. 1994. Document classification by machine: theory and practice. In Proceedings of COLING-94, 15th International Conference on Computational Linguistics (Kyoto, JP, 1994), pp. 1059-1063.
Hayes, P. J., Andersen, P. M., Nirenburg, I. B., and Schmandt, L. M. 1990. Tcs: a shell for content-based text categorization. In Proceedings of CAIA-90, 6th IEEE Conference on Artificial Intelligence Applications (Santa Barbara, US, 1990), pp. 320-326.
Heaps, H. 1973. A theory of relevance for automatic document classification. Information and Control 22, 3, 268-278.
Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. Ohsumed: an interactive retrieval evaluation and new large text collection for research. In Proceedings of SIGIR94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 192-201.
Hull, D. A. 1994. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 282-289.
Hull, D. A., Pedersen, J. O., and Schu¨tze, H. 1996. Method combination for document filtering. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zu¨rich, CH, 1996), pp. 279-288.
Ittner, D. J., Lewis, D. D., and Ahn, D. D. 1995. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1995), pp. 301-315.
Iwayama, M. and Tokunaga, T. 1995. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 273-281.
Iyer, R. D., Lewis, D. D., Schapire, R. E., Singer, Y., and Singhal, A. 2000. Boosting for document routing. In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management (McLean, US, 2000), pp. 70-77.
Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 143-151.
Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 137-142.
Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of ICML-99, 16th International Conference on Machine Learning (Bled, SL, 1999), pp. 200-209.
Joachims, T. and Sebastiani, F. 2001. Guest editors' introduction to the special issue on automated text categorization. Journal of Intelligent Information Systems. Forthcoming.
John, G. H., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, US, 1994), pp. 121-129.
Junker, M. and Abecker, A. 1997. Exploiting thesaurus knowledge in rule induction for text classification. In Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing (Tzigov Chark, BL, 1997), pp. 202-207.
Junker, M. and Hoch, R. 1998. An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition 1, 2, 116-122.
Kessler, B., Nunberg, G., and Schu¨tze, H. 1997. Automatic detection of text genre. In Proceedings of ACL-97, 35th Annual Meeting of the Association for Computational Linguistics (Madrid, ES, 1997), pp. 32-38.
Kim, Y.-H., Hahn, S.-Y., and Zhang, B.-T. 2000. Text filtering by boosting naive Bayes classifiers. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, GR, 2000), pp. 168-75.
Klinkenberg, R. and Joachims, T. 2000. Detecting concept drift with support vector machines. In Proceedings of ICML-00, 17th International Conference on Machine Learning (Stanford, US, 2000), pp. 487-494.
Knight, K. 1999. Mining online text. Communications of the ACM 42, 11, 58-61.
Knorz, G. 1982. A decision theory approach to optimal automated indexing. In Proceedings of SIGIR-82, 5th ACM International Conference on Research and Development in Information Retrieval (Berlin, DE, 1982), pp. 174-193.
Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 170-178.
Korfhage, R. R. 1997. Information storage and retrieval. Wiley Computer Publishing, New York, US.
Lam, S. L. and Lee, D. L. 1999. Feature reduction for neural network based text categorization. In Proceedings of DASFAA-99, 6th IEEE International Conference on Database Advanced Systems for Advanced Application (Hsinchu, TW, 1999), pp. 195-202.
Lam, W. and Ho, C. Y. 1998. Using a generalized instance set for automatic text categorization. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 81-89.
Lam, W., Low, K. F., and Ho, C. Y. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence (Nagoya, JP, 1997), pp. 745-750.
Lam, W., Ruiz, M. E., and Srinivasan, P. 1999. Automatic text categorization and its applications to text retrieval. IEEE Transactions on Knowledge and Data Engineering 11, 6, 865-879.
Lang, K. 1995. NewsWeeder: learning to filter netnews. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, US, 1995), pp. 331-339.
Larkey, L. S. 1998. Automatic essay grading using text categorization techniques. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 90-95.
Larkey, L. S. 1999. A patent search and classification system. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, US, 1999), pp. 179-187.
Larkey, L. S. and Croft, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zu¨rich, CH, 1996), pp. 289-297.
Lewis, D. D. 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (Kobenhavn, DK, 1992), pp. 37-50.
Lewis, D. D. 1992b. Representation and learning in information retrieval. Ph. D. thesis, Department of Computer Science, University of Massachusetts, Amherst, US.
Lewis, D. D. 1995a. Evaluating and optmizing autonomous text classification systems. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 246-254.
Lewis, D. D. 1995b. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum 29, 2, 13-19.
Lewis, D. D. 1995c. The TREC-4 filtering track: description and analysis. In Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, US, 1995), pp. 165-180.
Lewis, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 4-15.
Lewis, D. D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, US, 1994), pp. 148-156.
Lewis, D. D. and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 3-12. See also [Lewis 1995b].
Lewis, D. D. and Hayes, P. J. 1994. Guest editorial for the special issue on text categorization. ACM Transactions on Information Systems 12, 3, 231.
Lewis, D. D. and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 81-93.
Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zu¨rich, CH, 1996), pp. 298-306.
Li, H. and Yamanishi, K. 1999. Text classification using ESC-based stochastic decision lists. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management (Kansas City, US, 1999), pp. 122-130.
Li, Y. H. and Jain, A. K. 1998. Classification of text documents. The Computer Journal 41, 8, 537-546.
Liddy, E. D., Paik, W., and Yu, E. S. 1994. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems 12, 3, 278-295.
Liere, R. and Tadepalli, P. 1997. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence (Providence, US, 1997), pp. 591-596.
Lim, J. H. 1999. Learnable visual keywords for image classification. In Proceedings of DL99, 4th ACM Conference on Digital Libraries (Berkeley, US, 1999), pp. 139-145.
Manning, C. and Schu¨tze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, US.
Maron, M. 1961. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery 8, 3, 404-417.
Masand, B. 1994. Optimising confidence of text classification by evolution of symbolic expressions. In K. E. Kinnear Ed., Advances in genetic programming, Chapter 21, pp. 459-476. Cambridge, US: The MIT Press.
Masand, B., Linoff, G., and Waltz, D. 1992. Classifying news stories using memorybased reasoning. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (Kobenhavn, DK, 1992), pp. 59-65.
McCallum, A. K. and Nigam, K. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, US, 1998), pp. 350-358.
McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, US, 1998), pp. 359-367.
Merkl, D. 1998. Text classification with self-organizing maps: Some lessons learned. Neurocomputing 21, 1/3, 61-77.
Mitchell, T. M. 1996. Machine learning. McGraw Hill, New York, US.
Mladenic´, D. 1998. Feature subset selection in text learning. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 95-100.
Mladenic´, D. and Grobelnik, M. 1998. Word sequences as features in text-learning. In Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference (Ljubljana, SL, 1998), pp. 145-148.
Moulinier, I. and Ganascia, J.-G. 1996. Applying an existing machine learning algorithm to text categorization. In S. Wermter, E. Riloff, and G. Scheler Eds., Connectionist, statistical, and symbolic approaches to learning for natural language processing (Heidelberg, DE, 1996), pp. 343-354. Springer Verlag.
Moulinier, I., Ra˘skinis, G., and Ganascia, J.-G. 1996. Text categorization: a symbolic approach. In Proceedings of SDAIR-96, 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1996), pp. 87-99.
Myers, K., Kearns, M., Singh, S., and Walker, M. A. 2000. A boosting approach to topic spotting on subdialogues. In Proceedings of ICML-00, 17th International Conference on Machine Learning (Stanford, US, 2000).
Ng, H. T., Goh, W. B., and Low, K. L. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, US, 1997), pp. 67-73.
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. M. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2/3, 103-134.
Oh, H.-J., Myaeng, S. H., and Lee, M.-H. 2000. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of SIGIR00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, GR, 2000), pp. 264-271.
Pazienza, M. T. Ed. 1997. Information extraction. Number 1299 in Lecture Notes in Computer Science. Springer, Heidelberg, DE.
Riloff, E. 1995. Little words can make a big difference for text classification. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 130-136.
Riloff, E. and Lehnert, W. 1994. Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems 12, 3, 296-333.
Robertson, S. E. and Harding, P. 1984. Probabilistic automatic indexing by learning from human indexers. Journal of Documentation 40, 4, 264-270.
Robertson, S. E. and Sparck Jones, K. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3, 129-146. Also reprinted in [Willett 1988], pp. 143-160.
Roth, D. 1998. Learning to resolve natural language ambiguities: a unified approach. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence (Madison, US, 1998), pp. 806-813.
Ruiz, M. E. and Srinivasan, P. 1999. Hierarchical neural networks for text categorization. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 281-282.
Sable, C. L. and Hatzivassiloglou, V. 2000. Text-based approaches for non-topical image categorization. International Journal of Digital Libraries 3, 3, 261-275.
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513-523. Also reprinted in [Sparck Jones and Willett 1997], pp. 323-328.
Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Communications of the ACM 18, 11, 613-620. Also reprinted in [Sparck Jones and Willett 1997], pp. 273-280.
Saracevic, T. 1975. Relevance: a review of and a framework for the thinking on the notion in information science. Journal of the American Society for Information Science 26, 6, 321- 343. Also reprinted in [Sparck Jones and Willett 1997], pp. 143-165.
Schapire, R. E. and Singer, Y. 2000. BoosTexter: a boosting-based system for text categorization. Machine Learning 39, 2/3, 135-168.
Schapire, R. E., Singer, Y., and Singhal, A. 1998. Boosting and Rocchio applied to text filtering. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 215-223.
Schu¨tze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97-124.
Schu¨tze, H., Hull, D. A., and Pedersen, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 229-237.
Scott, S. and Matwin, S. 1999. Feature engineering for text classification. In Proceedings of ICML-99, 16th International Conference on Machine Learning (Bled, SL, 1999), pp. 379-388.
Sebastiani, F., Sperduti, A., and Valdambrini, N. 2000. An improved boosting algorithm and its application to automated text categorization. In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management (McLean, US, 2000), pp. 78-85.
Singhal, A., Mitra, M., and Buckley, C. 1997. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, US, 1997), pp. 25-32.
Singhal, A., Salton, G., Mitra, M., and Buckley, C. 1996. Document length normalization. Information Processing and Management 32, 5, 619-633.
Slonim, N. and Tishby, N. 2001. The power of word clusters for text classification. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, DE, 2001).
Sparck Jones, K. and Willett, P. Eds. 1997. Readings in information retrieval. Morgan Kaufmann, San Mateo, US.
Taira, H. and Haruno, M. 1999. Feature selection in SVM text categorization. In Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence (Orlando, US, 1999), pp. 480-486.
Tauritz, D. R., Kok, J. N., and Sprinkhuizen-Kuyper, I. G. 2000. Adaptive information filtering using evolutionary computation. Information Sciences 122, 2-4, 121-140.
Tumer, K. and Ghosh, J. 1996. Error correlation and error reduction in ensemble classifiers. Connection Science 8, 3-4, 385-403.
Tzeras, K. and Hartmann, S. 1993. Automatic indexing based on Bayesian inference networks. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval (Pittsburgh, US, 1993), pp. 22-34.
van Rijsbergen, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33, 2, 106-119.
van Rijsbergen, C. J. 1979. Information Retrieval (Second ed.). Butterworths, London, UK. Available at http://www.dcs.gla.ac.uk/Keith.
Weigend, A. S., Wiener, E. D., and Pedersen, J. O. 1999. Exploiting hierarchy in text categorization. Information Retrieval 1, 3, 193-216.
Weiss, S. M., Apt´e, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T. 1999. Maximizing text-mining performance. IEEE Intelligent Systems 14, 4, 63-69.
Wiener, E. D., Pedersen, J. O., and Weigend, A. S. 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1995), pp. 317-332.
Willett, P. Ed. 1988. Document retrieval systems. Taylor Graham, London, UK.
Wong, J. W., Kan, W.-K., and Young, G. H. 1996. Action: automatic classification for full-text documents. SIGIR Forum 30, 1, 26-41.
Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 13-22.
Yang, Y. 1995. Noise reduction in a statistical approach to text categorization. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 256-263.
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval 1, 1-2, 69-90.
Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems 12, 3, 252-277.
Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 42-49.
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 412-420.
Yang, Y., Slattery, S., and Ghani, R. 2001. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems. Forthcoming.
Yu, K. L. and Lam, W. 1998. A new on-line learning algorithm for adaptive text filtering. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, US, 1998), pp. 156-160.

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:43722,
	title = {Machine learning in automated text categorisation},
	author = {Sebastiani F.},
	publisher = {Association for Computing Machinery,, New York, N.Y. , Stati Uniti d'America},
	doi = {10.1145/505282.505283 and 10.48550/arxiv.cs/0110053},
	journal = {ACM computing surveys},
	volume = {34},
	pages = {1–47},
	year = {2002}
}