Moroni L., Puccetti G., Huguet Cabot P. -L., Bejgu A. S., Barba E., Miaschi A., Dell'Orletta F., Esuli A., Navigli R.
Italia LLM Computation and Language (cs.CL) FOS: Computer and information sciences llms vocabulary adaptation, italian Vocabulary Adaptation Large Languiage Models Computer Science - Computation and Language
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
Publisher: Association for Computational Linguistics
@inproceedings{oai:iris.cnr.it:20.500.14243/552066,
title = {Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation},
author = {Moroni L. and Puccetti G. and Huguet Cabot P. -L. and Bejgu A. S. and Barba E. and Miaschi A. and Dell'Orletta F. and Esuli A. and Navigli R.},
publisher = {Association for Computational Linguistics},
doi = {10.18653/v1/2025.findings-naacl.371 and 10.48550/arxiv.2504.17025},
year = {2025}
}