2024
Conference article
Open Access
SemEval-2024 task 8: multidomain, multimodel and multilingual machine-generated text detection
Wang Y., Mansurov J., Ivanov P., Su J., Shelmanov A., Tsvigun A., Afzal O. M., Mahmoud T., Puccetti G., Arnold T., Whitehouse C., Aji A. F., Habash N., Gurevych I., Nakov P.We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
See at:
aclanthology.org
| CNR IRIS
| CNR IRIS
2023
Journal article
Restricted
Unveiling the inventive process from patents by extracting problems, solutions and advantages with natural language processing
Giordano V., Puccetti G., Chiarello F., Pavanello T., Fantoni G.Patents are the main means for disclosing an invention. These documents encompass many steps of the inventive process starting with the definition of the problem to be solved and ending with the identification of a solution. In this study we focus on three fundamental concepts of the inventive process: (A) technical problems; (B) solutions; and (C) advantageous effects of the invention, which, based on the WIPO guidelines, any patent should include. We propose a system based on Natural Language Processing (NLP) pipeline that uses transformer language models to identify technical problems, solutions and advantageous effects from patents. We use a training dataset composed of 480,000 patents sentences contained in sections manually labelled by inventors or attorneys. Our model reaches a F1 score of 90%. The model is evaluated on a random set of patents to assess its deployability in a real-world scenario. The proposed model can be used as a novel tool for prior art mapping, novel ideas generation and technological evolution identification and can help to disclose valuable information hidden in patent documents.Source: EXPERT SYSTEMS WITH APPLICATIONS, vol. 229 (issue part A)
See at:
CNR IRIS
| CNR IRIS
| www.sciencedirect.com
2021
Journal article
Restricted
A simple and fast method for Named Entity context extraction from patents
Puccetti G., Chiarello F., Fantoni G.The process of extracting relevant technical information from patents or technical literature is as valuable as it is challenging. It deals with highly relevant information extraction from a corpus of documents with particular structure, and a mix of technical and legal jargon. Patents are the wider free source of technical information where homogeneous entities can be found. From a technical perspective the approaches refer to Named Entity Recognition (NER) and make use of Machine Learning techniques for Natural Language Processing (NLP). However, due to the large amount of data, to the complexity of the lexicon, the peculiarity of the structure and the scarcity of the examples to be used to feed the machine learning system, new approaches should be studied. NER methods are increasing their performances in many contexts, but a gap still exists when dealing with technical documentation. The aim of this work is to create an automatic training sets for NER systems by exploiting the nature and structure of patents, an open and massive source of technical documentation. In particular, we focus on collecting the context where users of the invention appear within patents. We then measure to which extent we achieve our goal and discuss how much our method is generalizable to other entities and documents.Source: EXPERT SYSTEMS WITH APPLICATIONS, vol. 184
See at:
CNR IRIS
| CNR IRIS
| www.sciencedirect.com
2022
Journal article
Open Access
Technology identification from patent texts: a novel named entity recognition method
Puccetti G., Giordano V., Spada I., Chiarello F., Fantoni G.Identifying technologies is a key element for mapping a domain and its evolution. It allows managers and decision makers to anticipate trends for an accurate forecast and effective foresight. Researchers and practitioners are taking advantage of the rapid growth of the publicly accessible sources to map technological domains. Among these sources, patents are the widest technical open access database used in the literature and in practice. Nowadays, Natural Language Processing (NLP) techniques enable new methods for the analysis of patent texts. Among these techniques, in this paper we explore the use of Named Entity Recognition (NER) with the purpose to identify the technologies mentioned in patents' text. We compare three different NER methods, gazetteer-based, rule-based and deep learning-based (e.g. BERT), measuring their performances in terms of precision, recall and computational time. We test the approaches on 1600 patents from four assorted IPC classes as case studies. Our NER systems collected over 4500 fine-grained technologies, achieving the best results thanks to the combination of the three methodologies. The proposed method overcomes the literature thanks to the ability to filter generic technological terms. Our study delineates a valid technology identification tool that can be integrated in any text analysis pipeline to support academics and companies in investigating a technological domain.Source: TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, vol. 186 (issue part B)
See at:
CNR IRIS
| www.sciencedirect.com
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
INVALSI - mathematical and language understanding in Italian: a CALAMITA challenge
Puccetti G., Cassese M., Esuli A.While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian. These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students' performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students' skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students. Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question), altro (other). We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
AI 'News' Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Puccetti G., Rogers A., Alzetta C., Dell'Orletta F., Esuli A.Large Language Models (LLMs) are increasingly used as 'content farm' models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real 'content farm'. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.Source: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING, vol. 1, pp. 15312-15338. tha, 2024
See at:
aclanthology.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
ABRICOT - ABstRactness and Inclusiveness in COntexT: a CALAMITA challenge
Puccetti G., Collacciani C., Ravelli A. A., Esuli A., Bolognesi M. M.The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Conference article
Open Access
You write like a GPT
Esuli A., Falchi F., Malvaldi M., Puccetti G.We investigate how Raymond Queneau's \textit{Exercises in Style} are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau's original French version, and the Italian translation by Umberto Eco. We start by comparing how various methods for the detection of automatically generated text, also using different large language models, evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles. This work is an initial attempt at exploring how methods for the detection of artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3878. Pisa, Italy, 4-6/12/2024
Project(s): Future Artificial Intelligence Research
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2023
Other
Open Access
AIMH Research Activities 2023
Aloia N, Amato G, Bartalesi V, Bianchi L, Bolettieri P, Bosio C, Carraglia M, Carrara F, Casarosa V, Ciampi L, Coccomini Da, Concordia C, Corbara S, De Martino C, Di Benedetto M, Esuli A, Falchi F, Fazzari E, Gennaro C, Lagani G, Lenzi E, Meghini C, Messina N, Molinari A, Moreo A, Nardi A, Pedrotti A, Pratelli N, Puccetti G, Rabitti F, Savino P, Sebastiani F, Sperduti G, Thanos C, Trupiano L, Vadicamo L, Vairo C, Versienti LThe AIMH (Artificial Intelligence for Media and Humanities) laboratory is dedicated to exploring and pushing the boundaries in the field of Artificial Intelligence, with a particular focus on its application in digital media and humanities. This lab's objective is to enhance the current state of AI technology particularly on deep learning, text analysis, computer vision, multimedia information retrieval, multimedia content analysis, recognition, and retrieval. This report encapsulates the laboratory's progress and activities throughout the year 2023.DOI: 10.32079/isti-ar-2023/001Metrics:
See at:
CNR IRIS
| ISTI Repository
| CNR IRIS