Puccetti G., Cassese M., Esuli A.
Benchmarks Large Language Models Mathematical understanding
While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympiad. We evaluate 10 powerful language models on these benchmarks and find that their performance is limited to 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one which outperforms them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%.
@inproceedings{oai:iris.cnr.it:20.500.14243/529024, title = {The Invalsi benchmarks: measuring the linguistic and mathematical understanding of large language models in Italian}, author = {Puccetti G. and Cassese M. and Esuli A.}, year = {2025} }