Document - Re-assessing the "Classify and Count" quantification method

2021

Conference article Open Access

Re-assessing the "Classify and Count" quantification method

Moreo A., Sebastiani F.

Computer Science - Machine Learning quantification Quantification model selection hyperparameter optimization Prevalence estimation Information Retrieval (cs.IR) classify and count Computer Science - Information Retrieval FOS: Computer and information sciences re-assesing Artificial Intelligence (cs.AI) Classify and count Machine Learning (cs.LG) Computer Science - Artificial Intelligence Learning to quantify

Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that "Classify and Count" (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Following this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.

Source: ECIR 2021 - 43rd European Conference on Information Retrieval, pp. 75–91, Online conference, 28/03-01/04/2021

Citations

Barranquero, J., D´ıez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recognition 48(2), 591-604 (2015). https://doi.org/10.1016/j.patcog.2014.07.032
Barranquero, J., Gonz´alez, P., D´ıez, J., del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition 46(2), 472-482 (2013). https://doi.org/10.1016/j.patcog.2012.07.022
Bella, A., Ferri, C., Hern´andez-Orallo, J., Ram´ırez-Quintana, M.J.: Quantification via probability estimators. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010). pp. 737-742. Sydney, AU (2010). https://doi.org/10.1109/icdm.2010.75
Borge-Holthoefer, J., Magdy, W., Darwish, K., Weber, I.: Content and network dynamics behind Egyptian political polarization on Twitter. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2015). pp. 700-711. Vancouver, CA (2015)
Card, D., Smith, N.A.: The importance of calibration for estimating proportions from annotations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2018). pp. 1636-1646. New Orleans, US (2018). https://doi.org/10.18653/v1/n18-1148
Esuli, A., Moreo, A., Sebastiani, F.: A recurrent neural network for sentiment quantification. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018). pp. 1775-1778. Torino, IT (2018). https://doi.org/10.1145/3269206.3269287
Esuli, A., Moreo, A., Sebastiani, F.: Cross-lingual sentiment quantification. IEEE Intelligent Systems 35(3), 106-114 (2020). https://doi.org/10.1109/MIS.2020.2979203
Esuli, A., Sebastiani, F.: Explicit loss minimization in quantification applications (preliminary draft). In: Proceedings of the 8th International Workshop on Information Filtering and Retrieval (DART 2014). pp. 1-11. Pisa, IT (2014)
Esuli, A., Sebastiani, F.: Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data 9(4), Article 27 (2015). https://doi.org/10.1145/2700406
Forman, G.: Quantifying counts and costs via classification. Data Mining and Knowledge Discovery 17(2), 164-206 (2008). https://doi.org/10.1007/s10618-008-0097-y
Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Social Network Analysis and Mining 6(19), 1-22 (2016). https://doi.org/10.1007/s13278-016-0327-z
Gonz´alez, P., Castan˜o, A., Chawla, N.V., del Coz, J.J.: A review on quantification learning. ACM Computing Surveys 50(5), 74:1-74:40 (2017). https://doi.org/10.1145/3117807
Gonz´alez, P., D´ıez, J., Chawla, N., del Coz, J.J.: Why is quantification an interesting learning problem? Progress in Artificial Intelligence 6(1), 53-58 (2017). https://doi.org/10.1007/s13748-016-0103-3
Gonz´alez-Castro, V., Alaiz-Rodr´ıguez, R., Alegre, E.: Class distribution estimation based on the Hellinger distance. Information Sciences 218, 146-164 (2013). https://doi.org/10.1016/j.ins.2012.05.028
Hassan, W., Maletzke, A., Batista, G.: Accurately quantifying a billion instances per second. In: Proceedings of the 7th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2020). Sydney, AU (2020)
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1), 229-247 (2010). https://doi.org/10.1111/j.1540-5907.2009.00428.x
Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005). pp. 377-384. Bonn, DE (2005)
Levin, R., Roitman, H.: Enhanced probabilistic classify and count methods for multilabel text quantification. In: Proceedings of the 7th ACM International Conference on the Theory of Information Retrieval (ICTIR 2017). pp. 229-232. Amsterdam, NL (2017). https://doi.org/10.1145/3121050.3121083
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). pp. 142-150. Portland, US (2011)
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013). pp. 528-536. Dallas, US (2013). https://doi.org/10.1109/icdm.2013.122
Moreno-Torres, J.G., Raeder, T., Ala´ız-Rodr´ıguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521-530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th International Conference on Machine Learning (ICML 1999). pp. 268-277. Bled, SL (1999)
Platt, J.C.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola, A., Bartlett, P., Sch¨olkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61-74. The MIT Press, Cambridge, MA (2000)
P´erez-G´allego, P., Castan˜o, A., Quevedo, J.R., del Coz, J.J.: Dynamic ensemble selection for quantification tasks. Information Fusion 45, 1-15 (2019). https://doi.org/10.1016/j.inffus.2018.01.001
P´erez-G´allego, P., Quevedo, J.R., del Coz, J.J.: Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion 34, 87-100 (2017). https://doi.org/10.1016/j.inffus.2016.07.001
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation 14(1), 21-41 (2002). https://doi.org/10.1162/089976602753284446
Sebastiani, F.: Evaluation measures for quantification: An axiomatic approach. Information Retrieval Journal 23(3), 255-288 (2020). https://doi.org/10.1007/s10791-019-09363-y

Metrics

Back to previous page

Cite as

BibTeX entry

@inproceedings{oai:it.cnr:prodotti:456429,
	title = {Re-assessing the "Classify and Count" quantification method},
	author = {Moreo A. and Sebastiani F.},
	doi = {10.1007/978-3-030-72240-1_6 and 10.5281/zenodo.4468276 and 10.48550/arxiv.2011.02552 and 10.5281/zenodo.4468277},
	booktitle = {ECIR 2021 - 43rd European Conference on Information Retrieval, pp. 75–91, Online conference, 28/03-01/04/2021},
	year = {2021}
}

CNR authors and affiliations

CNR authors

Moreo Fernandez, Alejandro David
0000-0002-0377-1025
Sebastiani, Fabrizio
0000-0003-4221-6427

Laboratories

Artificial Intelligence for Media and Humanities (2021-ongoing)

Download

CNR ExploRA

Bibliographic record

ISTI Repository

Postprint version

DOI

10.1007/978-3-030-72240-1_6
10.5281/zenodo.4468276
10.48550/arxiv.2011.02552
10.5281/zenodo.4468277

Also available from

arXiv.org e-Print Archive
arxiv.org
link.springer.com

Projects (via OpenAIRE)

AI4Media
A European Excellence Centre for Media, Society and Democracy
SoBigData-PlusPlus
SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics