Berti B., Esuli A., Sebastiani F.
Machine learning Native language identification Explainable AI
Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e. that of analysing the internals of an NLI classifier trained by an explainable machine learning (EML) algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena 'give a speaker's native language away'. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e. guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1; our experiments indicate that the most discriminative features are the lexical ones, followed by the morphological, syntactic, and statistical features, in this order. We also present two case studies, one on Italian and one on Spanish learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s; we show that the traits identified as most discriminative well align with our intuition, i.e. represent typical patterns of language misuse, underuse, or overuse, by speakers of the given L1. Overall, our study shows that the use of EML can be a valuable tool for the scholar who investigates interlanguage facts and language transfer.
Source: Digital Scholarship in the Humanities (2023). doi:10.1093/llc/fqad019
Publisher: Oxford University Press, Oxford, UK, Regno Unito
@article{oai:it.cnr:prodotti:481847, title = {Unravelling interlanguage facts via explainable machine learning}, author = {Berti B. and Esuli A. and Sebastiani F.}, publisher = {Oxford University Press, Oxford, UK, Regno Unito}, doi = {10.1093/llc/fqad019}, journal = {Digital Scholarship in the Humanities}, year = {2023} }
AI4Media
A European Excellence Centre for Media, Society and Democracy
SoBigData-PlusPlus
SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics