2025
Conference article
Open Access
Stress-testing machine generated text detection: shifting language models writing style to fool detectors
Pedrotti A., Papucci M., Ciaccio C., Miaschi A., Puccetti G., Dell'Orletta F., Esuli A.Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.DOI: 10.18653/v1/2025.findings-acl.156Project(s): SoBigData
Metrics:
See at:
aclanthology.org
| CNR IRIS
| CNR IRIS
2025
Conference article
Open Access
DIACU: a dataset for the DIAchronic analysis of Church Slavonic
Cassese M., Puccetti G., Napolitano M., Esuli A.The Church Slavonic language has evolved over time without being formalized into a precise grammar. Therefore, there is currently no clearly outlined history of this language tracing its evolution. However, in recent years, there has been a greater effort to digitize these resources, partly motivated by increased sensitivity with respect to the need to preserve multilingual knowledge. To exploit them, we propose DIACU (DIAchronic Analysis of Church Slavonic), a comprehensive collection of several existing corpora in Church Slavonic. In this work, we thoroughly describe the collection of this novel dataset and test its effectiveness as a training set for attributing Slavonic texts to specific periods. The dataset and the code of the experiments is available at https://github.com/MariaCassese/DIACU.DOI: 10.18653/v1/2025.bsnlp-1.12Metrics:
See at:
aclanthology.org
| CNR IRIS
| CNR IRIS
2025
Conference article
Open Access
Optimizing LLMs for Italian: reducing token fertility and enhancing efficiency through vocabulary adaptation
Moroni L., Puccetti G., Huguet Cabot P. -L., Bejgu A. S., Barba E., Miaschi A., Dell'Orletta F., Esuli A., Navigli R.The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token ``fertility'') and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25{\%}, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.DOI: 10.18653/v1/2025.findings-naacl.371DOI: 10.48550/arxiv.2504.17025Metrics:
See at:
aclanthology.org
| arXiv.org e-Print Archive
| CNR IRIS
| doi.org
| doi.org
| Archivio della ricerca- Università di Roma La Sapienza
| CNR IRIS
2025
Journal article
Open Access
Automatic extraction of regesta for medieval latin text summarization
Puccetti G., Righi L., Sabbatini I., Esuli A.We produced a novel dataset of 4,533 medieval Latin regesta (summaries) paired with full texts, extracted through a meticulous pipeline involving manual annotation, custom model training, text extraction, and post-processing to ensure high-quality, structured data for AI-driven summarization tasks.Source: ERCIM NEWS, vol. 141, pp. 31-32
See at:
ercim-news.ercim.eu
| CNR IRIS
| CNR IRIS
2025
Other
Open Access
ISTI-day 2025 Proceedings
Del Corso G., Pedrotti A., Federico G., Gennaro C., Carrara F., Amato G., Di Benedetto M., Gabrielli E., Belli D., Matrullo Z., Miori V., Tolomei G., Waheed T., Marchetti E., Calabrò A., Rossetti G., Stella M., Cazabet R., Abramski K., Cau E., Citraro S., Failla A., Mesina V., Morini V., Pansanella V., Colantonio S., Germanese D., Pascali M. A., Bianchi L., Messina N., Falchi F., Barsellotti L., Pacini G., Cassese M., Puccetti G., Esuli A., Volpi L., Moreo A., Sebastiani F., Sperduti G., Nguyen D., Broccia G., Ter Beek M. H., Ferrari A., Massink M., Belmonte G., Ciancia V., Papini O., Canapa G., Catricalà B., Manca M., Paternò F., Santoro C., Zedda E., Gallo S., Maenza S., Mattioli A., Simeoli L., Rucci D., Carlini E., Dazzi P., Kavalionak H., Mordacchini M., Rulli C., Muntean Cristina Ioana, Nardini F. M., Perego R., Rocchietti G., Lettich F., Renso C., Pugliese C., Casini G., Haldimann J., Meyer T., Assante M., Candela L., Dell'Amico A., Frosini L., Mangiacrapa F., Oliviero A., Pagano P., Panichi G., Peccerillo B., Procaccini M., Mannocci A., Manghi P., Lonetti F., Kang D., Di Giandomenico F., Jee E., Lazzini G., Conti F., Scopigno R., D'Acunto M., Moroni D., Cafiso M., Paradisi P., Callieri M., Pavoni G., Corsini M., De Falco A., Sala F., Saraceni Q., Gattiglia G.ISTI-Day is an annual information and networking event organized by the Institute of Information Science and Technologies "A. Faedo" (ISTI) of the Italian National Research Council (CNR). This event features an opening talk of the Director of the Dept. DIITET (Emilio F. Campana) as well as an overview of the Institute's activities presented by the ISTI Director (Roberto Scopigno). Those institutional segments are complemented by dedicated presentations and round tables featuring former staff members, as well as internal and external collaborators. To foster a network of knowledge and collaboration among newcomers, the 2025 ISTI Day edition also includes a large poster session that provides a comprehensive overview of current research activities. Each of the 13 laboratories contributes 1–3 posters, highlighting the most innovative work and offering early-career researchers a platform for discussion. Thus these proceedings include the posters selected for ISTI-Day 2025, reflecting the diverse and innovative nature of the Institute's research.
See at:
CNR IRIS
| www.isti.cnr.it
| CNR IRIS
2025
Conference article
Open Access
GenAI content detection Task 1: English and multilingual machine-generated text detection: AI vs. Human
Wang Y., Shelmanov A., Mansurov J., Tsvigun A., Mikhailov V., Xing R., Xie Z., Geng J., Puccetti G., Artemova E., Su J., Ta M. N., Abassy M., Elozeiri K., El Dine Ahmed S., Goloburda M., Mahmoud T., Tomar R. V., Aziz A., Laiyk N., Afzal O. M., Koike R., Kaneko M., Aji A. F., Habash N., Gurevych I., Nakov P.We present the GenAI Content Detection Task 1 - a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams - to the Multilingual. We provide a comprehensive overview of the data, a summary of the results - including system rankings and performance scores - detailed descriptions of the participating systems, and an in-depth analysis of submissions.1
See at:
aclanthology.org
| CNR IRIS
| CNR IRIS
2025
Conference article
Open Access
Prompt-based bias control in large language models: a mechanistic analysis
Cassese M., Puccetti G., Esuli A.This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.Source: CEUR WORKSHOP PROCEEDINGS, vol. 4074, pp. 324-337. Pisa, Italy, 9-10 june 2025
Project(s): ITSERR Italian Strengthening of the ESFRI RI RESILIENCE
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2025
Conference article
Open Access
Scaling laws for robust comparison of open foundation language-vision models and datasets
Nezhurina M., Porian T., Puccetti G., Kerssies T., Beaumont R., Cherti M., Jitsev J.In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. Taking language-vision learning as example, we show here how scaling law deriva- tion can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. Full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. For the first time, we use derived scaling laws to compare both models and three open datasets, DataComp-1.4B, Re-LAION-1.4B and DFN-1.4B, while ensuring sufficient prediction accuracy on held out points. From comparison, we obtain evidence for (i) MaMMUT’s stronger improvement with scale and better sample efficiency than standard CLIP (ii) DFN-1.4B outperforming other open datasets. To strengthen validity of the comparison, we show scaling laws for various down- stream tasks, classification, retrieval, and segmentation, observing consistently the same scaling trends for models and datasets across tasks. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison on aligned common compute axis across large scale span, avoiding misleading conclusions based on measurements from few isolated single reference scales only. This paves road for guided collective improvement of open foundation models and training datasets, as scaling law based comparisons from various studies executed in common frame can be combined to identify overall better procedures. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves 80.3% zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B1.Source: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 39. San Diego, CA, USA, 2-7 december 2025
See at:
CNR IRIS
| neurips.cc
| CNR IRIS
2025
Conference article
Open Access
REVERINO: REgesta generation VERsus latIN summarizatiOn
Puccetti G., Righi L., Sabbatini I., Esuli A.In this work we introduce the REVERINO dataset, a collection of 4533 pairs of Latin regesta with their respective full text medieval pontifical document extracted from two collections, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) and Les Registres de Gregoire IX (1227/41). We describe the pipeline used to extract the text from the images of the printed pages and we make high level analysis of the corpus. After developing REVERINO we use it as a benchmark to test the ability of Large Language Models (LLMs) to generate the regestum of a given Latin text. We test 3 LLMs among the best performing ones, GPT-4o, Llama 3.1 70b and Llama 3.1 405b and find that GPT-4o is the best at generating text in Latin. Interestingly, we also find that for Llama models it can be beneficial to first generate a text in English and then translate it in Latin to write better regesta.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3937. Udine, Italy, 2021/02/2025
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
SemEval-2024 task 8: multidomain, multimodel and multilingual machine-generated text detection
Wang Y., Mansurov J., Ivanov P., Su J., Shelmanov A., Tsvigun A., Afzal O. M., Mahmoud T., Puccetti G., Arnold T., Whitehouse C., Aji A. F., Habash N., Gurevych I., Nakov P.We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.DOI: 10.18653/v1/2024.semeval-1.279Metrics:
See at:
aclanthology.org
| CNR IRIS
| doi.org
| CNR IRIS
2024
Conference article
Open Access
INVALSI - mathematical and language understanding in Italian: a CALAMITA challenge
Puccetti G., Cassese M., Esuli A.While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students' performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students' skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students. Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question), altro (other). We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Other
Open Access
AIMH Research Activities 2024
Aloia N., Amato G., Bartalesi Lenzi V., Bianchi L., Bolettieri P., Bosio C., Carraglia M., Carrara F., Casarosa V., Cassese M., Ciampi L., Coccomini D. A., Concordia C., Connor R., Corbara S., De Martino C., Di Benedetto M., Esuli A., Falchi F., Fazzari E., Gennaro C., Iannello L., Negi K., Lagani G., Lenzi E., Leocata M., Malvaldi M., Meghini C., Messina N., Moreo Fernandez A., Nardi A., Pacini G., Pedrotti A., Pratelli N., Puccetti G., Rabitti F., Savino P., Scotti F., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C., Versienti L., Volpi L.The AIMH (Artificial Intelligence for Media and Humanities) laboratory is committed to advancing the field of Artificial Intelligence, with a special emphasis on its applications in digital media and the humanities. The lab aims to improve AI technologies, particularly in areas such as deep learning, text analysis, computer vision, multimedia information retrieval, content analysis, recognition, and retrieval. This report summarizes the laboratory’s achievements and activities over the course of 2024.DOI: 10.32079/isti-ar-2024/001Metrics:
See at:
CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
You write like a GPT
Esuli A., Falchi F., Malvaldi M., Puccetti G.We investigate how Raymond Queneau's \textit{Exercises in Style} are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau's original French version, and the Italian translation by Umberto Eco. We start by comparing how various methods for the detection of automatically generated text, also using different large language models, evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles. This work is an initial attempt at exploring how methods for the detection of artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
Project(s): Future Artificial Intelligence Research
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Puccetti G., Rogers A., Alzetta C., Dell'Orletta F., Esuli A.Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.Source: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING, vol. 1, pp. 15312-15338. tha, 2024
DOI: 10.18653/v1/2024.acl-long.817DOI: 10.48550/arxiv.2406.12128Metrics:
See at:
IRIS Cnr
| aclanthology.org
| arXiv.org e-Print Archive
| IRIS Cnr
| CNR IRIS
| doi.org
| doi.org
| CNR IRIS
2024
Conference article
Open Access
ABRICOT - ABstRactness and Inclusiveness in COntexT: a CALAMITA challenge
Puccetti G., Collacciani C., Ravelli A. A., Esuli A., Bolognesi M. M.The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
M4GT-Bench: evaluation benchmark for black-box machine-generated text detection
Wang Y., Mansurov J., Ivanov P., Su J., Shelmanov A., Tsvigun A., Mohammed Afzal O., Mahmoud T., Puccetti G., Arnold T., Aji A., Habash N., Gurevych I., Nakov P.The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators.DOI: 10.18653/v1/2024.acl-long.218DOI: 10.48550/arxiv.2402.11175Metrics:
See at:
aclanthology.org
| arXiv.org e-Print Archive
| CNR IRIS
| doi.org
| doi.org
| CNR IRIS