Document - Unravelling interlanguage facts via explainable machine learning

2023

Journal article Open Access

Unravelling interlanguage facts via explainable machine learning

Berti B., Esuli A., Sebastiani F.

Machine learning Native language identification Explainable AI

Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e. that of analysing the internals of an NLI classifier trained by an explainable machine learning (EML) algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena 'give a speaker's native language away'. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e. guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1; our experiments indicate that the most discriminative features are the lexical ones, followed by the morphological, syntactic, and statistical features, in this order. We also present two case studies, one on Italian and one on Spanish learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s; we show that the traits identified as most discriminative well align with our intuition, i.e. represent typical patterns of language misuse, underuse, or overuse, by speakers of the given L1. Overall, our study shows that the use of EML can be a valuable tool for the scholar who investigates interlanguage facts and language transfer.

Source: Digital Scholarship in the Humanities (2023). doi:10.1093/llc/fqad019

Publisher: Oxford University Press, Oxford, UK, Regno Unito

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:481847,
	title = {Unravelling interlanguage facts via explainable machine learning},
	author = {Berti B. and Esuli A. and Sebastiani F.},
	publisher = {Oxford University Press, Oxford, UK, Regno Unito},
	doi = {10.1093/llc/fqad019},
	journal = {Digital Scholarship in the Humanities},
	year = {2023}
}

CNR authors and affiliations

CNR authors

Esuli, Andrea
0000-0002-5725-4322
Sebastiani, Fabrizio
0000-0003-4221-6427

Laboratories

Artificial Intelligence for Media and Humanities (2021-ongoing)

Download

CNR ExploRA

Bibliographic record

ISTI Repository

Preprint version

DOI

10.1093/llc/fqad019

Also available from

academic.oup.com

Projects (via OpenAIRE)

AI4Media
A European Excellence Centre for Media, Society and Democracy
SoBigData-PlusPlus
SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics