Document - Optimizing text quantifiers for multivariate loss functions.

2015

Journal article Open Access

Optimizing text quantifiers for multivariate loss functions.

Esuli A., Sebastiani F.

Computer Science - Machine Learning Information Retrieval (cs.IR) Computer Science - Information Retrieval FOS: Computer and information sciences Quantification General Computer Science Text quantification Machine Learning (cs.LG)

We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabeled items that have been assigned the class, and tuning the obtained counts according to some heuristics. In this article, we depart from the tradition of using general-purpose classifiers and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and nonlinear) function used for evaluating quantification accuracy. The experiments that we have run on 5,500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing state-of-the-art quantification methods.

Source: ACM transactions on knowledge discovery from data 9 (2015). doi:10.1145/2700406

Publisher: Association for Computing Machinery,, New York, NY , Stati Uniti d'America

Citations

Roc´ıo Ala´ız-Rodr´ıguez, Alicia Guerrero-Curieses, and Jesu´ s Cid-Sueiro. 2011. Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift. Neurocomputing 74, 16 (2011), 2614- 2623.
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2013. Variable-Constraint Classification and Quantification of Radiology Reports under the ACR Index. Expert Systems and Applications 40, 9 (2013), 3441-3449.
Jose Barranquero, Jorge D´ıez, and Juan Jose´ del Coz. 2015. Quantification-oriented learning based on reliable classifiers. Pattern Recognition 48, 2 (2015), 591-604.
Jose Barranquero, Pablo Gonz a´lez, Jorge D´ıez, and Juan Jose´ del Coz. 2013. On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition 46, 2 (2013), 472-482.
Antonio Bella, Ce`sar Ferri, Jose´ Herna´ ndez-Orallo, and Mar´ıa Jose´ Ram´ırez-Quintana. 2010. Quantification via Probability Estimators. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010). Sydney, AU, 737-742.
Antonio Bella, Ce`sar Ferri, Jose´ Hern a´ndez-Orallo, and Mar´ıa Jose´ Ram´ırez-Quintana. 2014. Aggregative quantification for regression. Data Mining and Knowledge Discovery 28, 2 (2014), 475-518.
Yee Seng Chan and Hwee Tou Ng. 2005. Word Sense Disambiguation with Distribution Estimation. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005). Edinburgh, UK, 1010-1015.
Yee Seng Chan and Hwee Tou Ng. 2006. Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL 2006). Sydney, AU, 89-96.
Thomas M. Cover and Joy A. Thomas. 1991. Elements of information theory. John Wiley & Sons, New York, US.
Imre Csisza´ r and Paul C. Shields. 2004. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory 1, 4 (2004), 417-528.
Peter Sheridan Dodds, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth. 2011. Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6, 12 (2011).
Andrea Esuli and Fabrizio Sebastiani. 2010a. Machines that Learn how to Code Open-Ended Survey Data. International Journal of Market Research 52, 6 (2010), 775-800.
Andrea Esuli and Fabrizio Sebastiani. 2010b. Sentiment quantification. IEEE Intelligent Systems 25, 4 (2010), 72-75.
Andrea Esuli and Fabrizio Sebastiani. 2013. Training Data Cleaning for Text Classification. ACM Transactions on Information Systems 31, 4 (2013).
Tom Fawcett and Peter Flach. 2005. A response to Webb and Ting's 'On the application of ROC analysis to predict classification performance under varying class distributions'. Machine Learning 58, 1 (2005), 33-38.
George Forman. 2005. Counting Positives Accurately Despite Inaccurate Classification. In Proceedings of the 16th European Conference on Machine Learning (ECML 2005). Porto, PT, 564-575.
George Forman. 2006a. Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2006). Philadelphia, US, 157-166.
George Forman. 2006b. Tackling concept drift by temporal inductive transfer. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR 2006). Seattle, US, 252-259.
George Forman. 2008. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery 17, 2 (2008), 164-206.
George Forman, Evan Kirshenbaum, and Jaap Suermondt. 2006. Pragmatic text mining: Minimizing human effort to quantify many issues in call logs. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2006). Philadelphia, US, 852-861.
Michael Gamon. 2004. Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Geneva, CH, 841-847.
Daniela Giorgetti and Fabrizio Sebastiani. 2003. Automating Survey Coding by Multiclass Text Categorization Techniques. Journal of the American Society for Information Science and Technology 54, 14 (2003), 1269-1277.
V´ıctor Gonza´ lez-Castro, Roc´ıo Alaiz-Rodr´ıguez, and Enrique Alegre. 2013. Class distribution estimation based on the Hellinger distance. Information Sciences 218 (2013), 146-164.
William Hersh, Christopher Buckley, T.J. Leone, and David Hickman. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, IE, 192-201.
Daniel J. Hopkins and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54, 1 (2010), 229-247.
Thorsten Joachims. 2005. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005). Bonn, DE, 377-384. DOI:http://dx.doi.org/10.1145/1102351.1102399
Thorsten Joachims. 2006. Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2006). Philadelphia, US, 217-226.
Thorsten Joachims, Thomas Finley, and Chun-Nam Yu. 2009a. Cutting-plane training of structural SVMs. Machine Learning 77, 1 (2009), 27-59.
Thorsten Joachims, Thomas Hofmann, Yisong Yue, and Chun-Nam Yu. 2009b. Predicting Structured Objects with Support Vector Machines. Commun. ACM 52, 11 (2009), 97-104.
Mark G. Kelly, David J. Hand, and Niall M. Adams. 1999. The Impact of Changing Populations on Classifier Performance. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999). San Diego, US, 367-371.
Gary King and Ying Lu. 2008. Verbal Autopsy Methods with Multiple Causes of Death. Statist. Sci. 23, 1 (2008), 78-91.
Helmut Ku¨ chenhoff, Thomas Augustin, and Anne Kunz. 2012. Partially identified prevalence estimation under misclassification using the kappa coefficient. International Journal of Approximate Reasoning 53, 8 (2012), 1168-1182.
Paul S. Levy and E. H. Kass. 1970. A three-population model for sequential screening for bacteriuria. American Journal of Epidemiology 91, 2 (1970), 148-154.
Robert A. Lew and Paul S. Levy. 1989. Estimation of prevalence on the basis of screening tests. Statistics in Medicine 8, 10 (1989), 1225-1230.
David D. Lewis. 1992. Representation and learning in information retrieval. Ph.D. Dissertation. Department of Computer Science, University of Massachusetts, Amherst, US. http://www.research.att.com/ lewis/ papers/lewis91d.ps
David D. Lewis. 1995. Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1995). Seattle, US, 246-254.
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004), 361-397.
Nachai Limsetto and Kitsana Waiyamai. 2011. Handling Concept Drift via Ensemble and Class Distribution Estimation Technique. In Proceedings of the 7th International Conference on Advanced Data Mining (ADMA 2011). Bejing, CN, 13-26.
Benjamin Mandel, Aron Culotta, John Boulahanis, Danielle Stark, Bonnie Lewis, and Jeremy Rodrigue. 2012. A Demographic Analysis of Online Sentiment during Hurricane Irene. In Proceedings of the NAACL/HLT Workshop on Language in Social Media. Montreal, CA, 27-36.
Letizia Milli, Anna Monreale, Giulio Rossetti, Fosca Giannotti, Dino Pedreschi, and Fabrizio Sebastiani. 2013. Quantification Trees. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013). Dallas, US, 528-536.
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In Proceedings of the 4th AAAI Conference on Weblogs and Social Media (ICWSM 2010). Washington, US.
Joaquin Quin˜ onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence (Eds.). 2009. Dataset shift in machine learning. The MIT Press, Cambridge, US.
Elham Rahme and Lawrence Joseph. 1998. Estimating the prevalence of a rare disease: Adjusted maximum likelihood. The Statistician 47 (1998), 149-158.
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5 (1988), 513-523.
Claude Sammut and Michael Harries. 2011. Concept drift. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey I. Webb (Eds.). Springer, Heidelberg, DE, 202-205.
Lidia S a´nchez, V´ıctor Gonz a´lez, Enrique Alegre, and Roc´ıo Alaiz. 2008. Classification and Quantification Based on Image Analysis for Sperm Samples with Uncertain Damaged/Intact Cell Proportions. In Proceedings of the 5th International Conference on Image Analysis and Recognition (ICIAR 2008). Po´voa de Varzim, PT, 827-836.
Robert E. Schapire and Yoram Singer. 2000. BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3 (2000), 135-168.
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93-106.
Bernard W. Silverman. 1986. Density estimation for statistics and data analysis. Chapman and Hall, London, UK.
Lei Tang, Huiji Gao, and Huan Liu. 2010. Network Quantification Despite Biased Labels. In Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG 2010). Washington, US, 147-154.
Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2004. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21 st International Conference on Machine Learning (ICML 2004). Banff, CA.
Gary M. Weiss, Ashwin Nathan, J. B. Kropp, and Jeffrey W. Lockhart. 2013. WagTag: A Dog Collar Accessory for Monitoring Canine Activity Levels. In Proceedings of the 2013 ACM Conference on Pervasive and Ubiquitous Computing (UBICOMP 2013). Zurich, CH, 405-414.
Jack Chongjie Xue and Gary M. Weiss. 2009. Quantification and semi-supervised classification methods for handling changes in class distribution. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2009). Paris, FR, 897-906.
ChengXiang Zhai. 2008. Statistical Language Models for Information Retrieval: A Critical Review. Foundations and Trends in Information Retrieval 2, 3 (2008), 137-213.
Zhihao Zhang and Jie Zhou. 2010. Transfer estimation of evolving class priors in data stream classification. Pattern Recognition 43, 9 (2010), 3151-3161.
Xiao-Hua Zhou, Donna K. McClish, and Nancy A. Obuchowski. 2002. Statistical Methods in Diagnostic Medicine. Wiley, New York, US.

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:331333,
	title = {Optimizing text quantifiers for multivariate loss functions.},
	author = {Esuli A. and Sebastiani F.},
	publisher = {Association for Computing Machinery,, New York, NY , Stati Uniti d'America},
	doi = {10.1145/2700406 and 10.48550/arxiv.1502.05491},
	journal = {ACM transactions on knowledge discovery from data},
	volume = {9},
	year = {2015}
}