2021
Journal article  Open Access

Word-class embeddings for multiclass text classification

Moreo A., Esuli A., Sebastiani F.

Computer Science - Machine Learning  Machine Learning (stat.ML)  Statistics - Machine Learning  Information Systems  Computer Networks and Communications  Language models  Computer Science - Computation and Language  Deep learning  Computation and Language (cs.CL)  FOS: Computer and information sciences  Computer Science Applications  Neural networks  Machine learning  Machine Learning (cs.LG)  Text classification 

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.

Source: Data mining and knowledge discovery 35 (2021): 911–963. doi:10.1007/s10618-020-00735-3

Publisher: Kluwer Academic Publishers, Dordrecht ;, Stati Uniti d'America


[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7):1425-1438, 2015.
[2] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pages 2927-2936, Boston, US, 2015.
[3] Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005.
[4] Douglas Baker and Andrew K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), pages 96-103, Melbourne, AU, 1998.
[5] Pierre Baldi. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML 2011 Workshop on Unsupervised and Transfer Learning, pages 37-49, Bellevue, US, 2011.
[6] Marco Baroni, Georgiana Dinu, and Germa´n Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 238-247, Baltimore, US, 2014.
[7] Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183-1208, 2003.
[8] Samy Bengio, Jason Weston, and David Grangier. Label embedding trees for large multi-class tasks. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS 2010), pages 163-171, Vancouver, CA, 2010.
[9] Yoshua Bengio, Re´jean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
[10] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), pages 730-738, Montreal, CA, 2015.
[11] David M Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003.
[12] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 4th Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 120-128, Sydney, AU, 2006.
[13] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135-146, 2017.
[14] John A. Bullinaria and Joseph P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510-526, 2007.
[15] Jose´ Camacho-Collados and Mohammad T. Pilehvar. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63:743-788, 2018.
[16] Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the 10th International Conference on Machine Learning (ICML 1993), pages 41-48, Amherst, US, 1993.
[17] Ronan Collobert, Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493-2537, 2011.
[18] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273-297, 1995.
[19] Franca Debole and Fabrizio Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 18th ACM Symposium on Applied Computing (SAC 2003), pages 784-788, Melbourne, US, 2003.
[20] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019), pages 4171-4186, Minneapolis, US, 2019.
[22] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263-286, 1994.
[23] Susan T. Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM 1998), pages 148-155, Bethesda, US, 1998.
[25] George Forman. A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the 21st International Conference on Machine Learning (ICML 2004), pages 38-45, Banff, CA, 2004.
[26] Nicolas Garneau, Jean-Samuel Leboeuf, and Luc Lamontagne. Contextual generation of word embeddings for out of vocabulary words in downstream tasks. In Proceedings of the 32nd Canadian Conference on Artificial Intelligence (Canadian AI), pages 563-569, Kingston, CA, 2019.
[27] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 249-256, Chia Laguna, Italy, 2010.
[28] Pablo Gonza´lez, Alberto Castan˜o, Nitesh V. Chawla, and Juan Jose´ del Coz. A review on quantification learning. ACM Computing Surveys, 50(5):74:1-74:40, 2017.
[29] Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), pages 427-431, Valencia, ES, 2017.
[30] Zellig S. Harris. Distributional structure. Word, 10(2-3):146-162, 1954.
[31] William Hersh, Christopher Buckley, T.J. Leone, and David Hickman. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994), pages 192-201, Dublin, IE, 1994.
[32] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735-1780, 1997.
[33] Daniel J. Hsu, Sham M. Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed sensing. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS 2009), pages 772-780, Vancouver, CA, 2009.
[34] Mingyang Jiang, Yanchun Liang, Xiaoyue Feng, Xiaojing Fan, Zhili Pei, Yu Xue, and Renchu Guan. Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 29(1):61-70, 2018.
[35] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML 1998), pages 137-142, Chemnitz, DE, 1998.
[36] Thorsten Joachims. A statistical learning model of text classification for support vector machines. In Proceedings of the 24th ACM Conference on Research and Development in Information Retrieval (SIGIR 2001), pages 128- 136, New Orleans, US, 2001.
[37] Pentti Kanerva, Jan Kristoferson, and Anders Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, page 1036, Philadelphia, US, 2000.
[38] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1746-1751, Doha, QA, 2014.
[39] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016), pages 2741-2749, Phoenix, US, 2016.
[40] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, US, 2015.
[41] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI 2015), pages 2267-2273, Austin, US, 2015.
[42] Hoa T. Le, Christophe Cerisara, and Alexandre Denis. Do convolutional networks need to be deep for text classification? In Proceedings of the AAAI 2018 Workshop on Affective Content Analysis, pages 29-36, New Orleans, US, 2018.
[43] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436-444, 2015.
[45] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211-225, 2015.
[46] David D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1992), pages 37-50, Kobenhavn, DK, 1992.
[47] Jimmy Lin. The neural hype and comparisons against weak baselines. SIGIR Forum, 52(1):40-51, 2019.
[48] Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. Multi-label classification via feature-aware implicit label space encoding. In Proceedings of the 31th International Conference on Machine Learning (ICML 2014), pages 325-333, Bejing, CN, 2014.
[49] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 1412-1421, Lisbon, PT, 2015.
[50] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), pages 6294-6305, Long Beach, US, 2017.
[51] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Workshop Track Proceedings of the 1st International Conference on Learning Representations (ICLR 2013), Scottsdale, US, 2013.
[52] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pages 3111-3119, Lake Tahoe, US, 2013.
[53] Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. Distributional correspondence indexing for crosslingual and cross-domain sentiment classification. Journal of Artificial Intelligence Research, 55:131-163, 2016.
[54] Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. Learning to weight for text classification. IEEE Transactions on Knowledge and Data Engineering, Forthcoming, 2020.
[55] Katharina Morik, Peter Brockhausen, and Thorsten Joachims. Combining statistical learning with a knowledgebased approach. A case study in intensive care monitoring. In Proceedings of the 16th International Conference on Machine Learning (ICML 1999), pages 268-277, Bled, SL, 1999.
[56] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, CA, 2014.
[57] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1532-1543, Doha, QA, 2014.
[58] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018), pages 2227-2237, New Orleans, US, 2018.
[59] Jose´ A. Rodr´ıguez-Serrano and Florent Perronnin. Label embedding for text recognition. In Proceedings of the British Machine Vision Conference (BMVC 2013), Bristol, UK, 2013.
[60] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533-536, 1986.
[61] Magnus Sahlgren. An introduction to random indexing. In Proceedings of the TKE Workshop on Methods and Applications of Semantic Indexing, Copenhagen, DK, 2005.
[62] Magnus Sahlgren and Alessandro Lenci. The effects of data size and frequency range on distributional semantic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), pages 975-980, Austin, US, 2016.
[63] Terrence J. Sejnowski and Charles R. Rosenberg. Parallel networks that learn to pronounce English text. Complex Systems, 1(1):145-168, 1987.
[64] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pages 1631-1642, Seattle, US, 2013.
[65] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929-1958, 2014.
[66] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Da´niel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pages 2142-2147, Genova, IT, 2006.
[67] Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2015), pages 1165-1174, Sydney, AU, 2015.
[68] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579-2605, 2008.
[69] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), pages 5998-6008, Long Beach, US, 2017.
[70] Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. Joint embedding of words and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pages 2321-2331, Melbourne, AU, 2018.
[71] Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), pages 90-94, Jeju Island, KR, 2012.
[72] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. Machine Learning, 81(1):21-35, 2010.
[77] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pages 593-601, Beijing, CN, 2014.
[78] Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.
[79] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), pages 649-657, Montreal, CA, 2015.

Metrics



Back to previous page
BibTeX entry
@article{oai:it.cnr:prodotti:454276,
	title = {Word-class embeddings for multiclass text classification},
	author = {Moreo A. and Esuli A. and Sebastiani F.},
	publisher = {Kluwer Academic Publishers, Dordrecht ;, Stati Uniti d'America},
	doi = {10.1007/s10618-020-00735-3 and 10.48550/arxiv.1911.11506 and 10.5281/zenodo.4468312 and 10.5281/zenodo.4468313},
	journal = {Data mining and knowledge discovery},
	volume = {35},
	pages = {911–963},
	year = {2021}
}

AI4Media
A European Excellence Centre for Media, Society and Democracy

ARIADNEplus
Advanced Research Infrastructure for Archaeological Data Networking in Europe - plus

SoBigData-PlusPlus
SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics


OpenAIRE