Page 1 of 1

2024 Conference article Open Access

SemEval-2024 task 8: multidomain, multimodel and multilingual machine-generated text detection
Wang Y., Mansurov J., Ivanov P., Su J., Shelmanov A., Tsvigun A., Afzal O. M., Mahmoud T., Puccetti G., Arnold T., Whitehouse C., Aji A. F., Habash N., Gurevych I., Nakov P.
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.DOI: 10.18653/v1/2024.semeval-1.279
Metrics:

See at: aclanthology.org Open Access | CNR IRIS | doi.org Restricted | CNR IRIS

2022 Conference article Open Access

Outlier dimensions that disrupt transformers are driven by frequency
Puccetti G., Rogers A., Drozd A., Dell'Orletta F.
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.DOI: 10.18653/v1/2022.findings-emnlp.93
DOI: 10.48550/arxiv.2205.11380
Metrics:

2023 Journal article Restricted

Unveiling the inventive process from patents by extracting problems, solutions and advantages with natural language processing
Giordano V., Puccetti G., Chiarello F., Pavanello T., Fantoni G.
Patents are the main means for disclosing an invention. These documents encompass many steps of the inventive process starting with the definition of the problem to be solved and ending with the identification of a solution. In this study we focus on three fundamental concepts of the inventive process: (A) technical problems; (B) solutions; and (C) advantageous effects of the invention, which, based on the WIPO guidelines, any patent should include. We propose a system based on Natural Language Processing (NLP) pipeline that uses transformer language models to identify technical problems, solutions and advantageous effects from patents. We use a training dataset composed of 480,000 patents sentences contained in sections manually labelled by inventors or attorneys. Our model reaches a F1 score of 90%. The model is evaluated on a random set of patents to assess its deployability in a real-world scenario. The proposed model can be used as a novel tool for prior art mapping, novel ideas generation and technological evolution identification and can help to disclose valuable information hidden in patent documents.Source: EXPERT SYSTEMS WITH APPLICATIONS, vol. 229 (issue part A)
DOI: 10.1016/j.eswa.2023.120499
DOI: 10.2139/ssrn.4223458
Metrics:

2021 Journal article Restricted

A simple and fast method for Named Entity context extraction from patents
Puccetti G., Chiarello F., Fantoni G.
The process of extracting relevant technical information from patents or technical literature is as valuable as it is challenging. It deals with highly relevant information extraction from a corpus of documents with particular structure, and a mix of technical and legal jargon. Patents are the wider free source of technical information where homogeneous entities can be found. From a technical perspective the approaches refer to Named Entity Recognition (NER) and make use of Machine Learning techniques for Natural Language Processing (NLP). However, due to the large amount of data, to the complexity of the lexicon, the peculiarity of the structure and the scarcity of the examples to be used to feed the machine learning system, new approaches should be studied. NER methods are increasing their performances in many contexts, but a gap still exists when dealing with technical documentation. The aim of this work is to create an automatic training sets for NER systems by exploiting the nature and structure of patents, an open and massive source of technical documentation. In particular, we focus on collecting the context where users of the invention appear within patents. We then measure to which extent we achieve our goal and discuss how much our method is generalizable to other entities and documents.Source: EXPERT SYSTEMS WITH APPLICATIONS, vol. 184
DOI: 10.1016/j.eswa.2021.115570
Metrics:

2022 Journal article Open Access

Technology identification from patent texts: a novel named entity recognition method
Puccetti G., Giordano V., Spada I., Chiarello F., Fantoni G.
Identifying technologies is a key element for mapping a domain and its evolution. It allows managers and decision makers to anticipate trends for an accurate forecast and effective foresight. Researchers and practitioners are taking advantage of the rapid growth of the publicly accessible sources to map technological domains. Among these sources, patents are the widest technical open access database used in the literature and in practice. Nowadays, Natural Language Processing (NLP) techniques enable new methods for the analysis of patent texts. Among these techniques, in this paper we explore the use of Named Entity Recognition (NER) with the purpose to identify the technologies mentioned in patents' text. We compare three different NER methods, gazetteer-based, rule-based and deep learning-based (e.g. BERT), measuring their performances in terms of precision, recall and computational time. We test the approaches on 1600 patents from four assorted IPC classes as case studies. Our NER systems collected over 4500 fine-grained technologies, achieving the best results thanks to the combination of the three methodologies. The proposed method overcomes the literature thanks to the ability to filter generic technological terms. Our study delineates a valid technology identification tool that can be integrated in any text analysis pipeline to support academics and companies in investigating a technological domain.Source: TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, vol. 186 (issue part B)
DOI: 10.1016/j.techfore.2022.122160
Metrics:

2020 Conference article Open Access

B4DS @ PRELEARN: ensemble method for prerequisite learning
Puccetti G., Bolanos L., Chiarello F., Fantoni G.
In this paper we describe the methodologies we proposed to tackle the EVALITA 2020 shared task PRELEARN. We propose both a methodology based on gated recurrent units as well as one using more classical word embeddings together with ensemble methods. Our goal in choosing these approaches, is twofold, on one side we wish to see how much of the prerequisite information is present within the pages themselves. On the other we would like to compare how much using the information from the rest of Wikipedia can help in identifying this type of relation. This second approach is particularly useful in terms of extension to new entities close to the one in the corpus provided for the task but not actually present in it. With this methodologies we reached second position in the challengeSource: CEUR WORKSHOP PROCEEDINGS, vol. 2765. Online, 17/12/2020

See at: ceur-ws.org Open Access | CNR IRIS | CNR IRIS Restricted

2021 Conference article Open Access

How do BERT embeddings organize linguistic knowledge?
Puccetti G., Miaschi A., Dell'Orletta F.
Several studies investigated the linguistic information implicitly encoded in Neural Language Models. Most of these works focused on quantifying the amount and type of information available within their internal representations and across their layers. In line with this scenario, we proposed a different study, based on Lasso regression, aimed at understanding how the information encoded by BERT sentence-level representations is arrange within its hidden units. Using a suite of several probing tasks, we showed the existence of a relationship between the implicit knowledge learned by the model and the number of individual units involved in the encodings of this competence. Moreover, we found that it is possible to identify groups of hidden units more relevant for specific linguistic properties.DOI: 10.18653/v1/2021.deelio-1.6
Metrics:

2024 Conference article Open Access

M4GT-Bench: evaluation benchmark for black-box machine-generated text detection
Wang Y., Mansurov J., Ivanov P., Su J., Shelmanov A., Tsvigun A., Mohammed Afzal O., Mahmoud T., Puccetti G., Arnold T., Aji A., Habash N., Gurevych I., Nakov P.
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators.DOI: 10.18653/v1/2024.acl-long.218
DOI: 10.48550/arxiv.2402.11175
Metrics:

2023 Conference article Open Access

AIMH at MULTI-Fake-DetectIVE: system report
Puccetti G, Esuli A
This report describes our contribution to the EVALITA 2023 shared task MULTI-Fake-DetectIVE which involves the classification of news including textual and visual components. To experiment on this task we focus on textual data augmentation, extending the Italian text and the Images available in the training set using machine translation models and image captioning ones. To train using different set of input features, we use different transformer encoders for each variant of text (Italian, English) and modality (Image). For Task 1, among the models we test, we find that using the Italian text together with its translation improves the model performance while the captions don't provide any improvement. We test the same architecture also on Task 2 although in this case we achieve less satisfactory resultsSource: CEUR WORKSHOP PROCEEDINGS. Parma, Italy, 7-9/09/2023
Project(s): SoBigData via OpenAIRE

See at: ceur-ws.org Open Access | CNR IRIS | ISTI Repository | CNR IRIS Restricted

2024 Conference article Open Access

INVALSI - mathematical and language understanding in Italian: a CALAMITA challenge
Puccetti G., Cassese M., Esuli A.
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students' performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students' skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students. Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question), altro (other). We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024

See at: ceur-ws.org Open Access | CNR IRIS | CNR IRIS Restricted

2025 Conference article Open Access

The Invalsi benchmarks: measuring the linguistic and mathematical understanding of large language models in Italian
Puccetti G., Cassese M., Esuli A.
While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympiad. We evaluate 10 powerful language models on these benchmarks and find that their performance is limited to 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one which outperforms them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%.

See at: aclanthology.org Open Access | CNR IRIS | CNR IRIS Restricted

2024 Conference article Open Access

AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Puccetti G., Rogers A., Alzetta C., Dell'Orletta F., Esuli A.
Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.Source: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING, vol. 1, pp. 15312-15338. tha, 2024
DOI: 10.18653/v1/2024.acl-long.817
DOI: 10.48550/arxiv.2406.12128
Metrics:

2024 Conference article Open Access

ABRICOT - ABstRactness and Inclusiveness in COntexT: a CALAMITA challenge
Puccetti G., Collacciani C., Ravelli A. A., Esuli A., Bolognesi M. M.
The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024

See at: ceur-ws.org Open Access | CNR IRIS | CNR IRIS Restricted

2024 Conference article Open Access

You write like a GPT
Esuli A., Falchi F., Malvaldi M., Puccetti G.
We investigate how Raymond Queneau's \textit{Exercises in Style} are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau's original French version, and the Italian translation by Umberto Eco. We start by comparing how various methods for the detection of automatically generated text, also using different large language models, evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles. This work is an initial attempt at exploring how methods for the detection of artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
Project(s): Future Artificial Intelligence Research

See at: ceur-ws.org Open Access | CNR IRIS | CNR IRIS Restricted

2023 Other Open Access

AIMH Research Activities 2023
Aloia N., Amato G., Bartalesi Lenzi V., Bianchi L., Bolettieri P., Bosio C., Carraglia M., Carrara F., Casarosa V., Ciampi L., Coccomini D. A., Concordia C., Corbara S., De Martino C., Di Benedetto M., Esuli A., Falchi F., Fazzari E., Gennaro C., Lagani G., Lenzi E., Meghini C., Messina N., Molinari A., Moreo Fernandez A., Nardi A., Pedrotti A., Pratelli N., Puccetti G., Rabitti F., Savino P., Sebastiani F., Sperduti G., Thanos C., Trupiano L., Vadicamo L., Vairo C., Versienti L.
The AIMH (Artificial Intelligence for Media and Humanities) laboratory is dedicated to exploring and pushing the boundaries in the field of Artificial Intelligence, with a particular focus on its application in digital media and humanities. This lab's objective is to enhance the current state of AI technology particularly on deep learning, text analysis, computer vision, multimedia information retrieval, multimedia content analysis, recognition, and retrieval. This report encapsulates the laboratory's progress and activities throughout the year 2023.DOI: 10.32079/isti-ar-2023/001
Metrics:

See at: CNR IRIS Open Access | ISTI Repository | CNR IRIS Restricted