2017
Other
Open Access
Data Flow Quality Monitoring in Data Infrastructures
Mannocci AIn the last decade, a lot of attention worldwide has been brought by researchers, organizations, and funders on the realization ofData Infrastructures (DIs), namely systems supporting researchers with the broad spectrum of resources they need to perform science. DIs are here intended as ICT (eco)systems offering data and processing components which can be combined into data flows so as to enable arbitrarily complex data manipulation actions serving the consumption needs of DI customers, be them humans or machines.Data resulting from the execution of data flows, represent an important asset both for the DI users, typically craving for the information they need, and for the organization (or community) operating the DI, whose existence and cost sustainability depends on the adoption and usefulness of the DI. On the other hand, when operating several data processing data flows over time, several issues, well-known to practitioners, may arise and compromise the behaviour of the DI, and therefore undermine its reliability and generate stakeholders dissatisfaction. Such issues span a plethora of causes, such as(i) the lack of any kind of guarantees (e.g. quality, stability, findability, etc.) from integrated external data sources, typically not under the jurisdiction of the DI; (ii) the occurrence at any abstraction level of subtle, unexpected errors in the data flows; and(iii) the nature in ever changing evolution of the DI, in terms of data flow composition and algorithms/configurations in use.The autonomy of DI components, their use across several data flows, the evolution of end-user requirements over time, make the one of DI data flows a critical environment, subject to the most subtle inconsistencies. Accordingly, DI users demand guarantees, while quality managers are called to provide them, on the "correctness" of the DI data flows behaviour over time, to be somehow quantified in terms of "data quality" and in terms of "processing quality". Monitoring the quality of data flows is therefore a key activity of paramount importance to ensure the up-taking and long term existence of a DI. Indeed, monitoring can detect or anticipate misbehaviours of DI's data flows, in order to prevent and adjust the errors, or at least "formally" justify to the stakeholders the underlying reasons, possibly not due to the DI, of such errors. Not only, monitoring can also be vital for DIs operation, as having hardware and software resources actively employed in processing low quality data can yield inefficient resource allocation and waste of time.However, data flow quality monitoring is further hindered by the "hybrid" nature of such infrastructures, which typically consist of a patchwork of individual components("system of systems") possibly developed by distinct stakeholders with possibly distinct life-cycles, evolving over time, whose interactions are regulated mainly by shared policies agreed at infrastructural level. Due to such heterogeneity, generally DIs are not equipped with built-in monitoring systems in this sense and to date DI quality managers are therefore bound to use combinations of existing tools - with non trivial integration efforts - or to develop and integrate ex-post their own ad-hoc solutions, at high cost of realization and maintenance.In this thesis, we introduce MoniQ, a general-purpose Data Flow Quality Monitoring system enabling the monitoring of critical data flow components, which are routinely checked during and after every run of the data flow against a set of user-defined quality control rules to make sure the data flow meets the expected behaviour and quality criteria over time, as established upfront by the quality manager. MoniQ introduces a monitoring description language capable of (i) describing the semantic and the time ordering of the observational intents and capture the essence of the DI data flows to be monitored; and (ii) describing monitoring intents over the monitoring flows in terms of metrics to be extracted and controls to be ensured. The novelty of the language is that it incorporates the essence of existing data quality monitoring approaches, identifies and captures process monitoring scenarios, and, above all, provides abstractions to represent monitoring scenarios that combine data and process quality monitoring within the scope of a data flow. The study is provided with an extensive analysis of two real-world use cases used as support and validation of the proposed approach, and discusses an implementation of MoniQ providing quality managers with high-level tools to integrate the solution in a DI in an easy, technology transparent and cost efficient way in order to start to get insight out data flows by visualizing the trends of the metrics defined and the outcome of the controls declared against them.Project(s): OPENAIRE 
See at:
etd.adm.unipi.it
| CNR IRIS
| ISTI Repository
| CNR IRIS
2024
Conference article
Open Access
Exploring the Italian research landscape on Digital Library in the Conference IRCDL
Bernasconi E., Mannocci A., Tammaro A. M.This study aims to explore the structure of knowledge around digital libraries embedded in IRCDL Conference presentations and examine research trends over time. It also analysed the published articles' subject, the authors, their affiliations and provenance and the collaboration network in IRCDL. We applied several bibliometric techniques, including productivity visualisation, authorship network analysis, and subject analysis.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3643, pp. 230-245. Brixen, 22-23 February 2024.
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Book
Open Access
Preface for Joint Proceedings of Posters, Demos, Workshops, and Tutorials of SEMANTiCS 2024
Garijo D., Gentile A. L., Kurteva A., Mannocci A., Osborne F., Vahdati S.This volume contains the proceedings of the Poster and Demo Track of the 20th International Conference on Semantic Systems, SEMANTiCS 2024, which took place from September 17-19, 2024, in Amsterdam. It also features the proceedings of the First International Workshop on Scaling Knowledge Graphs for Industry, along with an overview of the NeXt-generation Data Governance Workshop 2024 (NXDG 2024), both of which were co-located with SEMANTiCS 2024. SEMANTiCS is the annual meeting place for professionals who make semantic computing work, understand its benefits, and encounter its limitations. Every year, SEMANTiCS attracts information managers, IT architects, software engineers, and researchers from organizations ranging from research facilities and NPOs through public administrations to the largest and/or most innovative companies in the world. Conference participants learn from top researchers and industry experts about emerging trends and topics in the wide area of semantic computing. The SEMANTiCS community is highly diverse; attendees have responsibilities in interlinking areas such as Artificial Intelligence, knowledge discovery and management, bigdata analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3759
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2024
Book
Open Access
Preface to the proceedings of IRCDL 2024 - 20th conference on Information and Research science Connecting to
Digital and Library Science
Bernasconi E., Mannocci A., Poggi A., Salatino A., Silvello G.The IRCDL 2024 conference, marking its 20th edition since its inception in 2005, celebrates two decades of advancements in the field of Digital Libraries (DL). Originating at the University of Padua, the conference has traversed various locations, embodying the evolution of DL over time. The 20th-anniversary edition featured a special panel titled “20 Years of IRCDL,” where Prof. Maristella Agosti explored the history of Information Retrieval in the DL landscape. Prof. Floriana Esposito emphasized the pivotal role of Machine Learning in DL and IRCDL. At the same time, Prof. Domenico Saccà provided insights into the significance of databases in Italy and within DL, focusing on structured, semi-structured, and unstructured data. The conference addressed diverse topics, including applications of DL, machine learning in research data, cultural heritage analysis, data citation and provenance, digital preservation, document analysis, knowledge acquisition, user experience, and more. These topics underscored the multidisciplinary nature of IRCDL and its role in shaping the future of DL.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3643
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
2013
Conference article
Restricted
Data searchery: preliminary analysis of data sources interlinking
Manghi P, Mannocci AThe novel e-Science's data-centric paradigm has proved that interlinking publications and research data objects coming from different realms and data sources (e.g. publication repositories, data repositories) makes dissemination, re-use, and validation of research activities more effective. Scholarly Communication Infrastructures are advocated for bridging such data sources, by offering tools for identification, creation, and navigation of relationships. Since realization and maintenance of such infrastructures is expensive, in this demo we propose a lightweight approach for "preliminary analysis of data source interlinking" to help practitioners at evaluating whether and to what extent realizing them can be effective. We present Data Searchery, a congurable tool enabling users to easily plug-in data sources from different realms with the purpose of cross-relating their objects, be them publications or research data, by identifying relationships between their metadata descriptions.DOI: 10.1007/978-3-642-40501-3_60Project(s): RDA EUROPE
Metrics:
See at:
CNR IRIS
| CNR IRIS
| link.springer.com
2014
Contribution to book
Restricted
Preliminary analysis of data sources interlinking
Mannocci A., Manghi P.The novel e-Science's data-centric paradigm has proved that interlinking publications and research data objects coming from different realms and data sources (e.g. publication repositories, data repositories) makes dissemination, re-use, and validation of research activities more effective. Scholarly Communication Infrastructures (SCIs) are advocated for bridging such data sources by offering an overlay of services for identification, creation, and navigation of relationships among objects of different nature. Since realization and maintenance of such infrastructures is in general very cost-consuming, in this paper we propose a lightweight approach for "preliminary analysis of data source interlinking" to help practitioners at evaluating whether and to what extent realizing them can be effective. We present Data Searchery, a configurable tool delivering a service for relating objects across data sources, be them publications or research data, by identifying relationships between their metadata descriptions in real-time.DOI: 10.1007/978-3-319-08425-1_6DOI: 10.1007/978-3-319-14226-5_6Metrics:
See at:
biblioproxy.cnr.it
| doi.org
| doi.org
| CNR IRIS
| CNR IRIS
2020
Conference article
Open Access
Open Science Graphs must interoperate!
Aryani A, Fenner M, Manghi P, Mannocci A, Stocker MOpen Science Graphs (OSGs) are Scientific Knowledge Graphs whose intent is to improve the overall FAIRness of science, by enabling open access to graph representations of metadata about people, artefacts, institutions involved in the research lifecycle, as well as the relationships between these entities, in order to support stakeholder needs, such as discovery, reuse, reproducibility, statistics, trends, monitoring, impact, validation, and assessment. The represented information may span across entities such as research artefacts (e.g. publications, data, software, samples, instruments) and items of their content (e.g. statistical hypothesis tests reported in publications), research organisations, researchers, services, projects, and funders. OSGs include relationships between such entities and sometimes formalised (semantic) concepts characterising them, such as machine-readable concept descriptions for advanced discoverability, interoperability, and reuse. OSGs are generally valuable individually, but would greatly benefit from information exchange across their collections, thereby improving their efficacy to serve stakeholder needs. They could, therefore, reuse and exploit the data aggregation and added value that characterise each OSG, decentralising the effort and capitalising on synergies, as no one-size-fits-all solution exists. The RDA IG on Open Science Graphs for FAIR Data is investigating the motivation and challenges underpinning the realisation of an Interoperability Framework for OSGs. This work describes the key motivations for i) the definition of a classification for OSGs to compare their features, identify commonalities and differences, and added value and for ii) the definition of an Interoperability Framework, specifically an information model and APIs that enable a seamless exchange of information across graphs.Source: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE (PRINT), pp. 195-206. Lyon, France, 25-27/08/2020
DOI: 10.1007/978-3-030-55814-7_16Project(s): OpenAIRE-Advance
Metrics:
See at:
CNR IRIS
| link.springer.com
| ISTI Repository
| CNR IRIS
| CNR IRIS
2022
Conference article
Open Access
Will open science change authorship for good? Towards a quantitative analysis
Mannocci A, Irrera O, Manghi PAuthorship of scientific articles has profoundly changed from early science until now.
If once upon a time a paper was authored by a handful of authors, scientific collaborations are much more prominent on average nowadays.
As authorship (and citation) is essentially the primary reward mechanism according to the traditional research evaluation frameworks, it turned to be a rather hot-button topic from which a significant portion of academic disputes stems.
However, the novel Open Science practices could be an opportunity to disrupt such dynamics and diversify the credit of the different scientific contributors involved in the diverse phases of the lifecycle of the same research effort.
In fact, a paper and research data (or software) contextually published could exhibit different authorship to give credit to the various contributors right where it feels most appropriate.
We argue that this can be computationally analysed by taking advantage of the wealth of information in model Open Science Graphs.
Such a study can pave the way to understand better the dynamics and patterns of authorship in linked literature, research data and software, and how they evolved over the years.Source: CEUR WORKSHOP PROCEEDINGS. Padua, Italy, 24-25/02/2022
Project(s): OpenAIRE Nexus 
See at:
ceur-ws.org
| CNR IRIS
| ISTI Repository
| CNR IRIS
2022
Conference article
Open Access
BIP! scholar: a service to facilitate fair researcher assessment
Vergoulis T, Chatzopoulos S, Vichos K, Kanellos I, Mannocci A, Manola N, Manghi PIn recent years, assessing the performance of researchers has become a burden due to the extensive volume of the existing research output. As a result, evaluators often end up relying heavily on a selection of performance indicators like the h-index. However, over-reliance on such indicators may result in reinforcing dubious research practices, while overlooking important aspects of a researcher's career, such as their exact role in the production of particular research works or their contribution to other important types of academic or research activities (e.g., production of datasets, peer reviewing). In response, a number of initiatives that attempt to provide guidelines towards fairer research assessment frameworks have been established. In this work, we present BIP! Scholar, a Web-based service that offers researchers the opportunity to set up profiles that summarise their research careers taking into consideration well-established guidelines for fair research assessment, facilitating the work of evaluators who want to be more compliant with the respective practices.DOI: 10.1145/3529372.3533296DOI: 10.48550/arxiv.2205.03152Project(s): OpenAIRE Nexus
Metrics:
See at:
arXiv.org e-Print Archive
| dl.acm.org
| CNR IRIS
| ISTI Repository
| doi.org
| doi.org
| CNR IRIS
| CNR IRIS
2022
Conference article
Restricted
Sci-K 2022 - International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment
Manghi P, Mannocci A, Osborne F, Sacharidis D, Salatino A, Vergoulis TIn this paper we present the 2nd edition of the Scientific Knowledge: Representation, Discovery, and Assessment (Sci-K 2022) workshop. Sci-K aims to explore innovative solutions and ideas for the generation of approaches, data models, and infrastructures (e.g., knowledge graphs) for supporting, directing, monitoring and assessing the scientific knowledge and progress. This edition is also a reflection point as the community is seeking alternative solutions to the now-defunct Microsoft Academic Graph (MAG).DOI: 10.1145/3487553.3524883Metrics:
See at:
dl.acm.org
| CNR IRIS
| CNR IRIS
2022
Conference article
Open Access
Open Science and authorship of supplementary material. Evidence from a research community
Mannocci A, Irrera O, Manghi PWhile, in early science, most of the papers were authored by a handful of scientists, modern science is characterised by more extensive collaborations, and the average number of authors per article has increased across many disciplines (Baethge, 2008; Cronin, 2001; Fernandes & Monteiro, 2017; Frandsen & Nicolaisen, 2010; Wren et al., 2007). Indeed, in some fields of science (e.g., High Energy Physics), it is not infrequent to encounter hundreds or thousands of authors co-participating in the same piece of research. Such intricate collaboration patterns make it difficult to establish a correct relationship between contributor and scientific contribution and hence get an accurate and fair reward during research evaluation (Brand, Allen, Altman, Hlava, & Scott, 2015; Vasilevsky et al., 2021; Vergoulis et al., 2022). Thus, as widely known, scientific authorship tends to be a rather hot-button topic in academia, as roughly one-fifth of academic disputes among authors stem from this (Dance, 2012). Open Science, however, has the potential to disrupt such traditional mechanisms by injecting into the "academic market" new kinds of "currency" for credit attribution, merit and impact assessment (Mooney & Newton, 2012; Silvello, 2018). To this end, the new practices of supplementary research data (and software) deposition and citation could be perceived as an opportunity to diversify the attribution portfolio and eventually give credit to the different contributors involved in the diverse phases of the lifecycle within the same research endeavour (Bierer, Crosas, & Pierce, 2017; Brand et al., 2015). While, on the one hand, it is known that authors' ordering tells little or nothing about authors' roles and contributions (Kosmulski, 2012), on the other hand, we argue that variations of any kind in author sets of paired publications and supplementary material can be indicative. Despite being unclear the actual reason behind such a variation, the presence of a fracture between the publication and research data realms might suggest once more that current practices for research assessment and reward should be revised and updated to capture such peculiarities as well. In (Mannocci, Irrera, & Manghi, 2022), we argue that modern Open Science Graphs (OSGs) can be used to analyse whether this is the case or not and understand if the opportunity has been seized already. By offering extensive metadata descriptions of both literature, research data, software, and their semantic relations, OSGs constitute a fertile ground to analyse this phenomenon computationally and thus analyse the emergence of significant patterns. As a preliminary study, in this paper, we conduct a focused analysis on a subset of publications with supplementary material drawn from the European Marine Science3 (MES) research community. The results are promising and suggest our hypothesis is worth exploring further. Indeed, in 702 cases out of 3,075 (22.83%), there are substantial variations between the authors participating in the publication and the authors participating in the supplementary dataset (or software), thus posing the premises for a longitudinal, large-scale analysis of the phenomenon.DOI: 10.5281/zenodo.6975411Project(s): OpenAIRE Nexus
Metrics:
See at:
CNR IRIS
| ISTI Repository
| zenodo.org
| CNR IRIS
2023
Journal article
Open Access
A novel curated scholarly graph connecting textual and data publications
Irrera O, Mannocci A, Manghi P, Silvello GIn the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete.This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product.To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.Source: ACM JOURNAL OF DATA AND INFORMATION QUALITY, vol. 15 (issue 3)
DOI: 10.1145/3597310Project(s): OpenAIRE Nexus
Metrics:
See at:
dl.acm.org
| CNR IRIS
| ISTI Repository
| CNR IRIS
| CNR IRIS
2023
Conference article
Open Access
Tracing data footprints: formal and informal data citations in the scientific literature
Irrera O, Mannocci A, Manghi P, Silvello GData citation has become a prevalent practice within the scientific community, serving the purpose of facilitating data discovery, reproducibility, and credit attribution. Consequently, data has gained significant importance in the scholarly process. Despite its growing prominence, data citation is still at an early stage, with considerable variations in practices observed across scientific domains. Such diversity hampers the ability to consistently analyze, detect, and quantify data citations.We focus on the European Marine Science (MES) community to examine how data is cited in this specific context. We identify four types of data citations: formal, informal, complete, and incomplete. By analyzing the usage of these diverse data citation modalities, we investigate their impact on the widespread adoption of data citation practices.DOI: 10.1007/978-3-031-43849-3_7Metrics:
See at:
CNR IRIS
| link.springer.com
| ISTI Repository
| doi.org
| CNR IRIS
| CNR IRIS
not yet published
Conference article
Open Access
Exploring scientometrics with the OpenAIRE Graph: introducing the OpenAIRE Beginner's Kit
Mannocci A., Baglioni M.The OpenAIRE Graph is an extensive resource housing diverse information onresearch products, including literature, datasets, and software, alongsideresearch projects and other scholarly outputs and context. It stands as acornerstone among contemporary research information databases, offeringinvaluable insights for scientometric investigations. Despite its wealth ofdata, its sheer size may initially appear daunting, potentially hindering itswidespread adoption. To address this challenge, this paper introduces theOpenAIRE Beginner's Kit, a user-friendly solution providing access to a subsetof the OpenAIRE Graph within a sandboxed environment coupled with a Jupyternotebook for analysis. The OpenAIRE Beginner's Kit is meticulously designed todemocratise research and data exploration, offering accessibility from standarddesktop and laptop setups. Within this paper, we provide a brief overview ofthe included dataset and offer guidance on leveraging the kit through aselection of illustrative queries tailored to address common scientometricinquiries.
See at:
arxiv.org
| CNR IRIS
| CNR IRIS
2020
Conference article
Open Access
Context-Driven Discoverability of Research Data
Baglioni M, Manghi P, Mannocci AResearch data sharing has been proved to be key for accelerating scientific progress and fostering interdisciplinary research; hence, the ability to search, discover and reuse data items is nowadays vital in doing science. However, research data discovery is yet an open challenge. In many cases, descriptive metadata exhibit poor quality, and the ability to automatically enrich metadata with semantic information is limited by the data files format, which is typically not textual and hard to mine. More generally, however, researchers would like to find data used across different research experiments or even disciplines. Such needs are not met by traditional metadata description schemata, which are designed to freeze research data features at deposition time. In this paper, we propose a methodology that enables "context-driven discovery" for research data thanks to their proven usage across research activities that might differ from the original one, potentially across diverse disciplines. The methodology exploits the collection of publication-dataset and dataset-dataset links provided by OpenAIRE Scholexplorer data citation index so to propagate articles metadata into related research datasets by leveraging semantic relatedness. Such "context propagation" process enables the construction of "context-enriched" metadata of datasets, which enables "context-driven" discoverability of research data. To this end, we provide a real-case evaluation of this technique applied to Scholexplorer. Due to the broad coverage of Scholexplorer, the evaluation documents the effectiveness of this technique at improving data discovery on a variety of research data repositories and databases.DOI: 10.1007/978-3-030-54956-5_15Project(s): OpenAIRE-Advance
Metrics:
See at:
ZENODO
| zenodo.org
| Lecture Notes in Computer Science
| CNR IRIS
| CNR IRIS
| link.springer.com
2014
Conference article
Restricted
The Europeana network of Ancient Greek and Latin Epigraphy data infrastructure
Mannocci A, Casarosa V, Manghi P, Zoppi FEpigraphic archives, containing collections of editions about ancient Greek and Latin inscriptions, have been created in several European countries during the last couple of centuries. Today, the project EAGLE (Europeana network of Ancient Greek and Latin Epigraphy, a Best Practice Network partially funded by the European Commission) aims at providing a single access point for the content of about 15 epigraphic archives, totaling about 1,5M digital objects. This paper illustrates some of the challenges encountered and their solution for the realization of the EAGLE data infrastructure. The challenges mainly concern the harmonization, interoperability and service integration issues caused by the aggregation of metadata from heterogeneous archives (different data models and metadata schemas, and exchange formats). EAGLE has defined a common data model for epigraphic information, into which data models from different archives can be optimally mapped. The data infrastructure is based on the D-NET software toolkit, capable of dealing with data collection, mapping, cleaning, indexing, and access provisioning through web portals or standard access protocols.Source: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE (PRINT), pp. 286-300. Karlsruhe, Germany, 27-29 November 2014
DOI: 10.1007/978-3-319-13674-5_27Metrics:
See at:
doi.org
| CNR IRIS
| CNR IRIS
| link.springer.com
| www.scopus.com