63 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2016 Doctoral thesis Open Access OPEN
gDup: an integrated and scalable graph deduplication system.
Atzori C.
In this thesis we start from the experiences and solutions for duplicate identification in Big Data collections and address the broader and more complex problem of 'Entity Deduplication over Big Graphs'. By 'Graph' we mean any digital representation of an Entity Relationship model, hence entity types (structured properties) and relationships between them. By 'Big' we mean that duplicate identification over the objects of such entity types cannot be handled with traditional backends and solutions, e.g .ranging from tens of millions of objects to any higher number. By 'entity deduplication' we mean the combined process of duplicate identification and graph disambiguation. Duplicate identification has the aim of efficiently identifying pairs of equivalent objects for the same entity type, while graph disambiguation has the goal of removing the duplication anomaly from the graph. A large number of Big Graphs are today being maintained, e.g. collections populated over time with no duplicate controls, aggregations of multiple collections, which need continuous or extemporaneous entity deduplication cleaning. Examples are person deduplication in census records, deduplication of authors on library bibliographical collections (e.g. Google Scholar graph, Thomson Reuters citation graph, OpenAIRE graph), deduplication of catalogues from multiple stores, deduplication of Linked Open Data clouds resulting from integration of multiple clouds, any subset of the Web, etc.. As things stand today, data curators can find a plethora of tools supporting duplicate identification for Big collections of objects, which they can adopt to efficiently process the objects of individual entity type collections. However, the extension of such tools to the Big Data scenario is absent, as well as the support for graph disambiguation. In order to implement a full entity deduplication workflow for Big Graphs data curators end-up realizing patchwork systems, tailored to their graph data model, often bound to their physical representation of the graph (i.e. graph storage), expensive in terms of design, development, and maintenance, and in general not reusable by other practitioners with similar problems in different domains. This first contribution of this thesis is a reference architecture for 'Big Graph Entity Deduplication Systems' (BGEDSs), which are integrated, scalable, general purpose systems for entity deduplication over Big Graphs. BGEDSs are intended to support data curators with the out-of-the-box functionalities they need to implement all phases of duplicates identification and graph disambiguation. The architecture formally defines the challenge, by providing graph type language and graph object language, defining the specifics of the entity deduplication phases, and explaining how such phases manipulate the initial graph to eventually return the final disambiguated graph. Most importantly, it defines the level of configuration, i.e. customization, that data curators should be able to exploit when relying on BGEDSs to implement entity deduplication. The second contribution of this thesis is GDup, an implementation of a BGEDS whose instantiation is today used in the real production environment of the OpenAIRE infrastructure, the European e-infrastructure for Open Science and Access. GDup can be used to operate over Big Graphs represented using standards such as RDF-graphs or JSON-LD graphs and conforming to any graph schema. The system supports highly configurable duplicate identification and graph disambiguation settings, allowing data curators to tailor object matching functions by entity type properties and define the strategy of duplicate objects merging that will disambiguate the graph. GDup also provides functionalities to semi-automatically manage a Ground Truth, i.e. a set of trustworthy assertions of equality between objects, that can be used to preprocess objects of the same entity type and reduce computation time. The system is conceived to be extensible with other, possibly new methods in the deduplication domain (e.g. clustering functions, similarity functions) and supports scalability and performance over Big Graphs by exploiting an HBase - Hadoop MapReduce stack.Project(s): OpenAIRE2020 via OpenAIRE

See at: etd.adm.unipi.it Open Access | ISTI Repository Open Access | CNR ExploRA


2016 Other Unknown
OpenAIRE API documentation
Atzori C., Bardi A., Iatropoulou K.
Documentation of the API for consuming the OpenAIRE information spaceProject(s): OpenAIRE2020 via OpenAIRE, OpenAIRE-Advance via OpenAIRE

See at: api.openaire.eu | CNR ExploRA


2021 Contribution to conference Open Access OPEN
OpenOrgs: bridging registries of research organizations. Supporting disambiguation and improving the quality of data
Pavone G., Atzori C.
This presentation was given for OpenAIRE Tech Clinic webinar on 21 June 2021, focusing on the OpenOrgs tool. Unambiguously identifying organizations involved in the research work may not be a trivial task. Their names can be derived from various data sources, each of which often contains a different version of the organization's name (full legal name, short or alternative names, acronym, and so on) and different metadata fields. In OpenOrgs, data curators can enrich the metadata description of organizations and resolve the ambiguity of duplicates detected with an automated process by stating whether two or more entities correspond or not to the same organization. With these tasks, OpenOrgs users can compensate for the lack of information available and improve the organizations' discoverability.Source: OpenAIRE Tech Clinic webinar, Online event, 21 June 2021
DOI: 10.5281/zenodo.5101096
Project(s): OpenAIRE Nexus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | zenodo.org Open Access | CNR ExploRA


2012 Report Open Access OPEN
OPENAIREPLUS - Specification of the authority file service (de-duplication service)
Manghi P., Mikulicic M., Atzori C.
The goal of this deliverable is to (i) describe the motivations behind the realization of an authority file management service in OpenAIREplus, (ii) explicit the requirements of such a service, and (iii) define the specification of the service. Finally, it proposes a high-level description of the technical solution devised in the project and currently driving the implementation of the service.Source: Project report, OpenAIRE, Deliverable D6.4, 2012
Project(s): OPENAIREPLUS via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2012 Report Open Access OPEN
OPENAIREPLUS - Specification of adaptation of content management services
Manghi P., Mikulicic M., Atzori C., Artini M.
The deliverable describes how the services of the OpenAIRE infrastructure dedicated to collecting, storing, curating and indexing of content have been adapted to cope with management of data conforming to the data model (D6.1) of the OpenAIREplus infrastructure.Source: Project report, OpenAIREplus, Deliverable D6.2, 2012
Project(s): OPENAIREPLUS via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2012 Report Open Access OPEN
OpenAIREplus Data Model Specification
Manghi P., Mikulicic M., Atzori C.
The OpenAIREplus web site will offer functionalities for administrators, anonymous and registered users to manage an Information Space of publications, together with their connections with funding projects (from the EC and national agencies) and research datasets. The aim of this document is to describe the conceived structure and semantics of this Information Space, i.e., the OpenAIREplus data model, by providing an abstract definition of its main entities and the relationships between them. In this definitional process, the interaction with the EuroCRIS initiatives, several scientific institutes (i.e., KNAW-DANS, EBI-EMBL, BADC) as well as inspiration from DataCite and LinkedData play an important role in the specification of project data, i.e., how project data should be described, stored and exported in OpenAIRE, dataset metadata, how datasets should be described, and in the specification of how such interconnected entities can be made available and consumable by third-party systems. The data model will be subject to changes in the future and therefore result in further versions. Such changes will be described in the following Section, in order to summarize to the reader the differences form the previous versions.Source: Project report, OpenAIREplus, Deliverable D6.1, 2012
Project(s): OPENAIREPLUS via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2012 Journal article Open Access OPEN
De-duplication of aggregation authority files
Manghi P., Mikulicic M., Atzori C.
This paper presents PACE (Programmable Authority Control Engine), an authority control tool conceived to maintain 'aggregation authority fi les'. These are obtained as continuous aggregations of records originating from a variable set of information systems with heterogeneous and duplicated content. To facilitate record deduplication in the presence of such heterogeneity and dynamicity, PACE user interfaces enable an iterative curation process, where data curators can: (i) confi gure algorithms for the identifi cation of record duplicates; (ii) open work sessions where algorithm confi gurations can be run and evaluated; (iii) merge the identifi ed record duplicates to disambiguate the authority fi le and (iv) repeat this cycle several times. PACE supports a tunable probabilistic similarity measure and performs record matching with a customisable variation of the sorted neighbourhood heuristic. Finally, it addresses the underlying performance and scalability issues by exploiting multi-core parallel processing and Cassandra's storage systems, to support I/O performances that scale up linearly with the number of records.Source: International journal of metadata, semantics and ontologies (Online) 7 (2012): 114–130. doi:10.1504/IJMSO.2012.050014
DOI: 10.1504/ijmso.2012.050014
Project(s): OPENAIRE via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | International Journal of Metadata Semantics and Ontologies Restricted | www.inderscience.com Restricted | CNR ExploRA


2014 Conference article Restricted
Keeping your aggregative infrastructure under control
Artini M., Atzori C., Manghi P.
'Aggregative Data Infrastructures' (ADIs) are systems devised to collect metadata descriptions (and files) from several data sources to construct uniform Information Spaces, hence providing cross-data source access via standard APIs or custom portals. ADIs typically deal with data collection workflows from arbitrary numbers of data sources, with heterogeneous access protocols, data exchange formats, and data models. Besides, they handle data processing work-flows for the harmonization and enrichment of aggregated metadata. Correct workflow management is crucial to ensure Information Space consistency, but is in general hard to sustain. This demo will present the solution offered in the context of the OpenAIRE infrastructure, which today collects metadata and files from around 450+ data sources (and growing) of several typologies. The D-NET Workflow Management Suite user interfaces support data curators at orchestrating overtime and in a sustainable way the configuration, execution, and monitoring of data collection and processing workflows for thousands of data sources.Source: JCDL - IEEE/ACM Joint Conference on Digital Libraries, pp. 409–410, London, UK, 8-12 September 2014
DOI: 10.1109/jcdl.2014.6970199
Project(s): OPENAIREPLUS via OpenAIRE
Metrics:


See at: doi.org Restricted | www.scopus.com Restricted | CNR ExploRA


2015 Report Open Access OPEN
OpenAIRE2020 - OpenAIRE specification and release plan
Manghi P., Bardi A., Atzori C., Iatropoulou K., Shirrwagen J., Summann F., Jahn N., Kobos M., Nielsen L., Lange C.
This deliverable presents the plan of design and development of the OpenAIRE infrastructure services for the next 12 months. At month 12 and 24 an update of this deliverable will be produced. Its intended use is mainly for technical partners to locate their software release duties in the wider context of the project software release, but also for the generic reader to get an overall picture and insight view of the technical activities.Source: Project report, OpenAIRE2020, Deliverable D6.1, 2015
Project(s): OpenAIRE2020 via OpenAIRE

See at: issue.openaire.research-infrastructures.eu Open Access | ISTI Repository Open Access | CNR ExploRA


2015 Report Open Access OPEN
OpenAIRE2020 - OpenAIRE Data Model - D8.1
Manghi P., Bardi A., Atzori C.
The aim of this deliverable is to describe the structure and semantics of the OpenAIRE Information Space, i.e., the OpenAIRE data model, by providing an abstract definition of its main entities and the relationships between them. Requirements have been collected over the years from all "consumers" of the OpenAIRE infrastructure, including data sources (providing content to OpenAIRE), portal end-users of various roles (researchers, project coordinators, general public, research communities), OpenAIRE data curators (responsible of the workflows for collecting, harmonizing, de-duplicating, inferring content), and third-party services (accessing content via APIs). The data model will be subject to changes in the future, depending on the evolution of of the requirements of the OpenAIRE infrastructure, and this document will be updated accordingly.Source: Deliverable D8.1, 2015, 2015
Project(s): OpenAIRE2020 via OpenAIRE

See at: issue.openaire.research-infrastructures.eu Open Access | ISTI Repository Open Access | CNR ExploRA


2015 Report Open Access OPEN
OpenAIRE2020 - OpenAIRE Literature Broker Service - D9.4
Manghi P., Bardi A., Atzori C., Artini M.
This deliverable describes the OpenAIRE Literature Broker Service. The Service is designed to offer subscription and notification functionalities for institutional repositories to: (i) learn about publication objects in OpenAIRE that do not appear in their collection but may be pertinent to it, and (ii) learn about extra properties or relationships relative to publication objects in their collection.Source: Project report, OpenAIRE2020, Deliverable D9.4, 2015
Project(s): OpenAIRE2020 via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2016 Report Open Access OPEN
A subscription and notification broker for scholarly communication: functionalities and architecture
Artini M., Atzori C., La Bruzzo S.
The OpenAIRE infrastructure services populate and provide access to a graph of objects relative to publications, datasets, people, organizations, projects, and funders aggregated from a variety of data sources. Not only, objects in the graph are harmonized to achieve semantic homogeneity, de-duplicated and merged, and enriched by inference with missing properties and/or relationships. The OpenAIRE Literature Broker Service is designed to offer subscription and notification functionalities for institutional repositories to: (i) learn about publication objects in OpenAIRE that do not appear in their collection but may be pertinent to it, and (ii) learn about extra properties or relationships relative to publication objects in their collection. Due to the high variability of the information space the following problems may arise: (i) subscriptions may vary over time to adapt to information space evolution, (ii) repository managers need to be able to quickly test their configurations before activating them, (iii) notifications may be redundant, and (iv) notifications may be very large over time. This paper presents the data model and software architecture of the OLBS, specifically designed to address these issues.Source: ISTI Technical reports, 2016
Project(s): OpenAIRE2020 via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2014 Report Open Access OPEN
The OpenAIRE action manager framework
Artini M., Atzori C., La Bruzzo S.
The OpenAire infrastructure offers services for collecting records (publications, datasets, persons, organizations, data sources, projects) from external data sources with the purpose of identifying relationships between them. The collected objects and their relationships are stored in HBASE according to the OpenAIRE data model. The Action Manager framework has been designed to offer an OpenAIRE data model oriented API for the enrichment and the fixing of the OpenAIRE HBASE information space.Source: ISTI Technical reports, 2014
Project(s): OPENAIRE via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2014 Report Open Access OPEN
OpenAIRE APIs for data access to third party services
Artini M., Atzori C., Dell'Amico A., Labruzzo S.
The OpenAIRE infrastructure services populate and provide access to a graph of objects relative to publications, datasets, people, organizations, projects, and funders aggregated from a variety of data sources. Not only, objects in the graph are harmonized to achieve semantic homogeneity, de-duplicated and merged, and enriched by inference with missing properties and/or relationships. The aim of this technical report is to describe to third-party service managers (developers in the need of accessing data) how the OpenAIRE information space can be accessed and according to which combination of protocol and format. The document is organized according to a data centric view, where managers should first identify the typology of data they would like to access, and then verify which protocols and formats are available.Source: ISTI Technical reports, 2014
Project(s): OPENAIREPLUS via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA


2017 Contribution to book Open Access OPEN
The OpenAIRE workflows for data management
Atzori C., Bardi A., Manghi P., Mannocci A.
The OpenAIRE initiative is the point of reference for Open Access in Europe and aims at the creation of an e-Infrastructure for the free flow, access, sharing, and re-use of research outcomes, services and processes for the advancement of research and the dissemination of scientific knowledge. OpenAIRE makes openly accessible a rich Information Space Graph (ISG) where products of the research life-cycle (e.g. publications, datasets, projects) are semantically linked to each other. Such an information space graph is constructed by a set of autonomic (orchestrated) workflows operating in a regimen of continuous data integration. This paper discusses the principal workflows operated by the OpenAIRE technical infrastructure in its different functional areas and provides the reader with the extent of the several challenges faced and the solutions realized.Source: Digital Libraries and Archives, edited by Costantino Grana, Lorenzo Baraldi, pp. 95–107, 2017
DOI: 10.1007/978-3-319-68130-6_8
DOI: 10.5281/zenodo.996006
DOI: 10.5281/zenodo.996005
Project(s): OpenAIRE2020 via OpenAIRE
Metrics:


See at: ZENODO Open Access | ZENODO Open Access | zenodo.org Open Access | doi.org Restricted | link.springer.com Restricted | CNR ExploRA


2017 Report Restricted
OpenAIRE - OpenAIRE back-end and Invenio upgrade: specification and releaseplan
Manghi P., Atzori C., Bardi A., Baglioni M., Nielsen L. H.
The aim of this document is to explain in detail how the software release plan for upgrading OpenAIRE back-ends and Invenio according to the data model described in D4.1OpenAIREDatamodelextension will be accomplished by the technical partners. For this, it will illustrate the plan of design, development, testing, and integration into beta and production of the infrastructure services to be delivered by T4.1 (OpenAIRE extension to research methods and artifact packages) and T4.2 OpenAIRE's Zenodo for research methodsandartifactpackages).Theplan's technical activities will be supervised and led by CNR and carried out across the technical partners CNR and CERN, in synergy with the partners Jisc, UMinho, UniHB, PIN, CNRS,IRD,ICRE8. The deliverable is on-going and will be updated at M15 (before first BETA release of the service), M23 (beforesecondBETAreleaseoftheservice),andM27 (beforeproductionreleaseoftheservice).The first release of this document reports on extensions to be provided byM8. Toease the update of the planand support the collaborativeapproach, the deliverableis published as a wiki and is available at https://support.d4science.org/projects/openaire-connect-wiki/wiki/D4_2.Source: Project report, OpenAIRE, Deliverable D4.2, 2017
Project(s): OpenAIRE-Connect via OpenAIRE

See at: support.d4science.org Restricted | CNR ExploRA


2017 Report Restricted
OpenAIRE - OpenAIRE publishing APIs: specification and release plan
Manghi P., Bardi A., Baglioni M., Atzori C.
The aim of this deliverable is to provide the specification of the software and the release plan for the OpenAIRE publishing APIsthat support third-party services at publishingmetadata about interlinked and packaged research productsin to the OpenAIRE Information Graph. The OpenAIRE publishing APIs supports the concept of "continuous publishing" in digital research settings where researchers conduct their activities in digital laboratories using ICT tools and services for processing and analysing research data. By using the OpenAIRE publishing APIs, a service/tool can automatically publish metadata on behalf of the researchers. The service/tool and its underlying infrastructure is responsible for keeping persistent identifiers, preserving the payload of the objects and the metadata.The service pushes metadata into OpenAIRE,the effect being: o the metadata record is immediately visible via the OpenAIRE search portal and APIs; o the metadata record will be cleaned and de-duplicated in a second stage according to the OpenAIRE content provision workflow described at https://www.openaire.eu/aggregation-and-contentprovision-workflows and http://doi.org/10.5281/zenodo.996006. Researchers benefit from a service that uses the OpenAIRE publishing APIs in several ways: o The service will support the generation of metadata to improve the FAIRness of the relative research products; o Researchers are relieved of the burden of depositing the products they want to publish in a repositoryexternaltotheirdigitallaboratory; o Researchers can choose to publish research products at any step of their research activity. The full specification of the APIs are published as a wiki and available at https://support.d4science.org/projects/openaire-connect-wiki/wiki/D4_5.Source: Project report, OpenAIRE, Deliverable D4.5, 2017
Project(s): OpenAIRE-Connect via OpenAIRE

See at: support.d4science.org Restricted | CNR ExploRA


2017 Report Restricted
OpeanAIRE - Catch-All Notification BrokerBack-end: specification and release plan
Atzori C., Baglioni M., Bardi A., Manghi P.
The aim of this deliverable is to present the functional requirements, a specification of the software,and a release plan for the deployment of the OpenAIRE-connect Catch-All Notification Broker Service ( CAB Service ). The CAB Service will connect all types of research artefacts providers(institutional repositories,publishers, data, repositories,and CRIS systems) and allow them to subscribe and be notified by OpenAIRE of events interesting to them. These notifications will comprise: 1) the existence of artefacts of interest to the providers (which may pertain their collection) 2)the existence of links from artefacts in their collection to other artefacts. The CAB Service will extend OpenAIRE's notification brokering services, which serves literature repositories, and will broaden the content provider base with the ones that serve specific research communities. Content provider managers will be allowed to register as consumers of the service, set and test the service (preview the results of the service over some subscriptions), to commit their subscriptions, and finally to manage their history of notifications overtime. The broker service will be tested in two bata releases and changed and/or updated following the requirements obtained from the betas. The deliverable is published as a wiki and is available at https://support.d4science.org/projects/openaire-connect-wiki/wiki/D5_1.Source: Project report, OpenAIRE, Deliverable D5.1, 2017
Project(s): OpenAIRE-Connect via OpenAIRE

See at: support.d4science.org Restricted | CNR ExploRA


2018 Conference article Open Access OPEN
GDup: De-duplication of Scholarly Communication Big Graphs
Atzori C., Manghi P., Bardi A.
Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.Source: 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT), pp. 142–151, Zurigo, 17-20/12/2018
DOI: 10.1109/bdcat.2018.00025
Project(s): OpenAIRE2020 via OpenAIRE, OpenAIRE-Advance via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ISTI Repository Open Access | ZENODO Open Access | zenodo.org Open Access | doi.org Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA


2018 Conference article Open Access OPEN
De-duplicating the OpenAIRE scholarly communication big graph
Atzori C., Manghi P., Bardi A.
The OpenAIRE infrastructure populates a scholarly communication big graph interlinking metadata objects of publications, datasets, software, organizations, funders, and projects. In order to de-duplicate this graph, OpenAIRE has developed GDup, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup offers functionalities to realize a hilly-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph.Source: e-science 2018 - 14th IEEE International Conference on e-Science (e-Science), pp. 372–373, Amsterdam, the Netherlands, 29 October - 01 November 2018
DOI: 10.1109/escience.2018.00104
DOI: 10.5281/zenodo.1489139
DOI: 10.5281/zenodo.1489140
Project(s): OpenAIRE2020 via OpenAIRE, OpenAIRE-Advance via OpenAIRE
Metrics:


See at: ZENODO Open Access | ZENODO Open Access | ISTI Repository Open Access | zenodo.org Open Access | doi.org Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA