Document - Utility-theoretic ranking for semiautomated text classification

2015

Journal article Open Access

Utility-theoretic ranking for semiautomated text classification

Berardi G, Esuli A, Sebastiani F

Computer Science - Machine Learning FOS: Computer and information sciences Utility theory Semi-automated text classification I.2.6 Learning General Computer Science Semiautomatd classification Machine Learning (cs.LG)

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Source: ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, vol. 10 (issue 1), p. 6

Citations

IJsbrand J. Aalbersberg. 1992. Incremental Relevance Feedback. In Proceedings of the 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1992). Copenhagen, DK, 11-22.
Paul Anand. 1993. Foundations of Rational Choice under Risk. Oxford University Press, Oxford, UK.
Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2012. A Utility-Theoretic Ranking Method for Semi-Automated Text Classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012). Portland, US, 961-970.
Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. 2014. Optimising human inspection work in automated verbatim coding. International Journal of Market Research 56, 4 (2014), 489-512.
Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. 2011. Dynamic Ranked Retrieval. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM 2011). Hong Kong, CN, 247-256.
Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11 (1999), 131-167.
Prabir Burman. 1987. Smoothing Sparse Contingency Tables. The Indian Journal of Statistics 49, 1 (1987), 24-36.
Olivier Chapelle, Bernard Scho¨lkopf, and Alexander Zien (Eds.). 2006. Semi-Supervised Learning. The MIT Press, Cambridge, US.
Stanley F. Chen and Joshua Goodman. 1996. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996). Santa Cruz, US, 310-318.
Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001). Seattle, US, 973-978.
Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani. 2006. MP-Boost: A Multiple-Pivot Boosting Algorithm and its Application to Text Categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006). Glasgow, UK, 1-12.
Andrea Esuli and Fabrizio Sebastiani. 2009. Active Learning Strategies for Multi-Label Text Classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR 2009). Toulouse, FR, 102-113.
Andrea Esuli and Fabrizio Sebastiani. 2013. Training Data Cleaning for Text Classification. ACM Transactions on Information Systems 31, 4 (2013).
Fumiyo Fukumoto and Yoshimi Suzuki. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Geneva, CH, 868-874.
William A. Gale and Kenneth W. Church. 1994. What's Wrong with Adding One? In Corpus-Based Research into Language: In honour of Jan Aarts, N. Oostdijk and P. de Haan (Eds.). Rodopi, Amsterdam, NL, 189-200.
Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. 2004. Document Classification Through Interactive Supervision of Document and Term Labels. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004). Pisa, IT, 185- 196.
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263-1284.
William Hersh, Christopher Buckley, T.J. Leone, and David Hickman. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, IE, 192-201.
Steven C. Hoi, Rong Jin, and Michael R. Lyu. 2006. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, UK, 633-642.
David J. Ittner, David D. Lewis, and David D. Ahn. 1995. Text categorization of low quality images. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995). Las Vegas, US, 301-315.
Thorsten Joachims. 1999. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 16th International Conference on Machine Learning (ICML 1999). Bled, SL, 200-209.
Ashish Kapoor, Eric Horvitz, and Sumit Basu. 2007. Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007). San Francisco, US, 877-882.
Leah S. Larkey and W. Bruce Croft. 1996. Combining classifiers in text categorization. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zu¨ rich, CH, 289-297.
David D. Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of 11th International Conference on Machine Learning (ICML 1994). New Brunswick, US, 148-156.
David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996). Zu¨ rich, CH, 298-306.
Miguel Martinez-Alvarez, Alejandro Bellogin, and Thomas Roelleke. 2013. Document Difficulty Framework for Semi-Automatic Text Classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013). Prague, CZ.
Miguel Martinez-Alvarez, Sirvan Yahyaei, and Thomas Roelleke. 2012. Semi-automatic Document classification: Exploiting Document Difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012). Barcelona, ES.
Andrew K. McCallum and Kamal Nigam. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998). Madison, US, 350-358.
Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems 27, 1 (2008).
Alexandru Niculescu-Mizil and Rich Caruana. 2005. Obtaining Calibrated Probabilities from Boosting. In Proceedings of the 21st Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005). Arlington, US, 413-420.
Douglas W. Oard, Jason R. Baron, Bruce Hedin, David D. Lewis, and Stephen Tomlinson. 2010. Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law 18, 4 (2010), 347-386.
Douglas W. Oard and William Webber. 2013. Information Retrieval for E-Discovery. Foundations and Trends in Information Retrieval 7, 2/3 (2013).
John C. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers, Alexander Smola, Peter Bartlett, Bernard Scho¨lkopf, and Dale Schuurmans (Eds.). The MIT Press, Cambridge, MA, 61-74.
Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active Learning with Feedback on Features and Instances. Journal of Machine Learning Research 7 (2006), 1655-1686.
Stephen E. Robertson. 2008. A new interpretation of average precision. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR 2008). Singapore, SN, 689-690.
Robert E. Schapire and Yoram Singer. 2000. BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3 (2000), 135-168.
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surveys 34, 1 (2002), 1-47.
Burr Settles. 2012. Active learning. Morgan & Claypool Publishers, San Rafael, US.
Jeffrey S. Simonoff. 1983. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics 11, 1 (1983), 208-218.
Simon Tong and Daphne Koller. 2001. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2 (2001), 45-66.
Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In Proceedings of the 15th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009). Miami, US, 2262-2269.
John von Neumann and Oskar Morgenstern. 1944. Theory of Games and Economic Behavior. Princeton University Press, Princeton, US.
Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, 42-49.
ChengXiang Zhai and John Lafferty. 2004. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179-214.
Xiaojin Zhu and Andrew B. Goldberg. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, US.

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:332904,
	title = {Utility-theoretic ranking for semiautomated text classification},
	author = {Berardi G and Esuli A and Sebastiani F},
	doi = {10.1145/2742548 and 10.48550/arxiv.1503.00491},
	year = {2015}
}