Page 1 of 2

2017 Doctoral thesis Open Access

Improving the Efficiency and Effectiveness of Document Understanding in Web Search
Trani S.
Web Search Engines (WSEs) are probably nowadays the most complex information systems since they need to handle an ever-increasing amount of web pages and match them with the information needs expressed in short and often ambiguous queries by a multitude of heterogeneous users. In addressing this challenging task they have to deal at an unprecedented scale with two classic and contrasting IR problems: the satisfaction of effectiveness requirements and efficiency constraints. While the former refers to the user-perceived quality of query results, the latter regards the time spent by the system in retrieving and presenting them to the user. Due to the importance of text data in the Web, natural language understanding techniques acquired popularity in the latest years and are profitably exploited by WSEs to overcome ambiguities in natural language queries given for example by polysemy and synonymy. A promising approach in this direction is represented by the so-called Web of Data, a paradigm shift which originates from the Semantic Web and promotes the enrichment of Web documents with the semantic concepts they refer to. Enriching unstructured text with an entity-based representation of documents - where entities can precisely identify persons, companies, locations, etc. - allows in fact, a remarkable improvement of retrieval effectiveness to be achieved. In this thesis, we argue that it is possible to improve both efficiency and effectiveness of document understanding in Web search by exploiting learning-to-rank, i.e., a supervised technique aimed at learning effective ranking functions from training data. Indeed, on one hand, enriching documents with machine-learnt semantic annotations leads to an improvement of WSE effectiveness, since the retrieval of relevant documents can exploit a finer comprehension of the documents. On the other hand, by enhancing the efficiency of learning to rank techniques we can improve both WSE efficiency and effectiveness, since a faster ranking technique can reduce query processing time or, alternatively, allow a more complex and accurate ranking model to be deployed. The contribution of this thesis are manifold: i) we discuss a novel machine- learnt measure for estimating the relatedness among entities mentioned in a document, thus enhancing the accuracy of text disambiguation techniques for document understanding; ii) we propose novel machine-learnt technique to label the mentioned entities according to a notion of saliency, where the most salient entities are those that have the highest utility in understanding the topics discussed; iii) we enhance state-of-the-art ensemble-based ranking models by means of a general learning-to-rank framework that is able to iteratively prune the less useful part of the ensemble and re-weight the remaining part accordingly to the loss function adopted. Finally, we share with the research community working in this area several open source tools to promote collaborative developments and favoring the reproducibility of research results.

See at: etd.adm.unipi.it Open Access | ISTI Repository | CNR ExploRA

2012 Journal article Open Access

Cite-as-you-write
Jack K., Sambati M., Silvestri F., Trani S., Venturini R.
Engines and dedicated social networks are generally used to search for relevant literature. Current technologies rely on keyword based searches which, however, do not provide the support of a wider context. Cite-as-you-write aims to simplify and shorten this exploratory task: given a verbose description of the problem to be investigated, the system automatically recommends related papers/citations.Source: ERCIM news 90 (2012).
Project(s): ADVANCE via OpenAIRE

See at: ercim-news.ercim.eu Open Access | CNR ExploRA

2013 Conference article Open Access

Learning relatedness measures for entity linking
Ceccarelli D., Lucchese C., Orlando S., Perego R., Trani S.
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowledge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant entities selected for annotation, since this minimizes errors in disambiguating entity-linking. The definition of an effective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high quality entity relatedness functions. First, we formalize the problem of learning entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of different state-of-the-art entity-linking algorithms.Source: CIKM'2013 - 22nd ACM International Conference on Information & Knowledge Management, pp. 139–148, San Francisco, Usa, 27 October - 1 November 2013
DOI: 10.1145/2505515.2505711
Metrics:

See at: www.dsi.unive.it Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2013 Journal article Unknown

A cloud-based platform for sharing geodata across Europe
Trani S., Lucchese, C., Perego R., Atzemoglou M., Baurens B., Kotzinos D.
The Inspired Geodata Cloud Services (InGeoCloudS) project, coordinated by AKKA Informatique et Systèmes (France) was launched in February 2012 with the aim of establishing the feasibility of using a Cloud-based approach for the publication and use of geodata across Europe. The initiative seeks to leverage the economies of scale achievable for a multi-consumer consortium and its ubiquitous availability of access for the geographically distributed end-users of the European institutions in the environmental field. The purpose is to demonstrate that a Cloud infrastructure can be used by public organizations to provide more efficient, scalable and flexible services for creating, sharing and disseminating spatial environmental data. InGeoCloudS is exploiting this concept based on the work of eight partner institutions from five different countries (including ERCIM members CNR, Italy and FORTH, Greece); some partners are IT enterprises and some are public data providers, covering hydrogeology and natural hazards applications. The project roadmap entails two main steps: Pilot1, which is currently available to project partners, and Pilot2 that will open up the services to a broader audience in summer 2013. The whole set of services will be available for free for the duration of the project.Source: ERCIM news 94 (2013): 33–34.

See at: CNR ExploRA

2016 Conference article Restricted

Post-learning optimization of tree ensembles for efficient ranking
Lucchese C., Perego R., Nardini F. M., Silvestri F., Orlando S., Trani S.
Learning to Rank (LtR) is the machine learning method of choice for producing high quality document ranking functions from a ground-truth of training examples. In practice, efficiency and effectiveness are intertwined concepts and trading off effectiveness for meeting efficiency constraints typically existing in large-scale systems is one of the most urgent issues. In this paper we propose a new framework, named CLEaVER, for optimizing machine-learned ranking models based on ensembles of regression trees. The goal is to improve efficiency at document scoring time without affecting quality. Since the cost of an ensemble is linear in its size, CLEaVER first removes a subset of the trees in the ensemble, and then fine-tunes the weights of the remaining trees according to any given quality measure. Experiments conducted on two publicly available LtR datasets show that CLEaVER is able to prune up to 80% of the trees and provides an efficiency speed-up up to 2.6x without affecting the effectiveness of the model.Source: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952, Pisa, Italy, 17-21 July 2016
DOI: 10.1145/2911451.2914763
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2016 Conference article Restricted

SEL: A unified algorithm for entity linking and saliency detection
Trani S., Ceccarelli D., Lucchese C., Orlando S., Perego R.
The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest. In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy.Source: ACM Symposium on Document Engineering, pp. 85–94, Vienna, Austria, 13-16 September 2016
DOI: 10.1145/2960811.2960819
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2018 Journal article Open Access

SEL: a unified algorithm for salient entity linking
Trani S., Lucchese C., Perego R., Losada D. E., Ceccarelli D., Orlando S.
The entity linking task consists in automatically identifying and linking the entities mentioned in a text to their uniform resource identifiers in a given knowledge base. This task is very challenging due to its natural language ambiguity. However, not all the entities mentioned in the document have the same utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in the document, also known as salient entities (SE), is attracting increasing interest. In this paper, we propose salient entity linking, a novel supervised 2-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is aimed at identifying a set of candidate entities that are likely to be mentioned in the document. The second step, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on 2 different data sets show that the proposed algorithm outperforms state-of-the-art competitors and is able to detect SE with high accuracy. Furthermore, we used salient entity linking for extractive text summarization. We found that entity saliency can be incorporated into text summarizers to extract salient sentences from text. The resulting summarizers outperform well-known summarization systems, proving the importance of using the SE information.Source: Computational intelligence 34 (2018): 2–29. doi:10.1111/coin.12147
DOI: 10.1111/coin.12147
Project(s): SoBigData via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | Computational Intelligence Restricted | onlinelibrary.wiley.com | CNR ExploRA

2017 Conference article Open Access

The impact of negative samples on learning to rank
Lucchese C., Nardini F. M., Perego R., Trani S.
Learning-to-Rank (LtR) techniques leverage machine learning algorithms and large amounts of training data to induce high-quality ranking functions. Given a set of documents and a user query, these functions are able to predict a score for each of the documents that is in turn exploited to induce a relevance ranking. The effectiveness of these learned functions has been proved to be significantly affected by the data used to learn them. Several analysis and document selection strategies have been proposed in the past to deal with this aspect. In this paper we review the state-of-the-art proposals and we report the results of a preliminary investigation of a new sampling strategy aimed at reducing the number of not relevant query-document pairs, so to significantly decrease the training time of the learning algorithm and to increase the final effectiveness of the model by reducing noise and redundancy in the training set.Source: 1st International Workshop on LEARning Next GEneration Rankers, LEARNER 2017, Amsterdam, Netherlands, 1 October, 2017

See at: ceur-ws.org Open Access | ISTI Repository | CNR ExploRA

2020 Conference article Open Access

Query-level early exit for additive learning-to-rank ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Search engine ranking pipelines are commonly based on large ensembles of machine-learned decision trees. The tight constraints on query response time recently motivated researchers to investigate algorithms to make faster the traversal of the additive ensemble or to early terminate the evaluation of documents that are unlikely to be ranked among the top-k. In this paper, we investigate the novel problem of query-level early exiting, aimed at deciding the profitability of early stopping the traversal of the ranking ensemble for all the candidate documents to be scored for a query, by simply returning a ranking based on the additive scores computed by a limited portion of the ensemble. Besides the obvious advantage on query latency and throughput, we address the possible positive impact on ranking effectiveness. To this end, we study the actual contribution of incremental portions of the tree ensemble to the ranking of the top-k documents scored for a given query. Our main finding is that queries exhibit different behaviors as scores are accumulated during the traversal of the ensemble and that query-level early stopping can remarkably improve ranking quality. We present a reproducible and comprehensive experimental evaluation, conducted on two public datasets, showing that query-level early exiting achieves an overall gain of up to 7.5% in terms of NDCG@10 with a speedup of the scoring process of up to 2.2x.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2033–2036, Online Conference, 25-30 July, 2020
DOI: 10.1145/3397271.3401256
DOI: 10.48550/arxiv.2004.14641
Project(s): BigDataGrapes via OpenAIRE

Metrics:

2020 Journal article Open Access

RankEval: Evaluation and investigation of ranking models
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
RankEval is a Python open-source tool for the analysis and evaluation of ranking models based on ensembles of decision trees. Learning-to-Rank (LtR) approaches that generate tree-ensembles are considered the most effective solution for difficult ranking tasks and several impactful LtR libraries have been developed aimed at improving ranking quality and training efficiency. However, these libraries are not very helpful in terms of hyper-parameters tuning and in-depth analysis of the learned models, and even the implementation of most popular Information Retrieval (IR) metrics differ among them, thus making difficult to compare different models. RankEval overcomes these limitations by providing a unified environment where to perform an easy, comprehensive inspection and assessment of ranking models trained using different machine learning libraries. The tool focuses on ensuring efficiency, flexibility and extensibility and is fully interoperable with most popular LtR libraries.Source: Softwarex (Amsterdam) 12 (2020). doi:10.1016/j.softx.2020.100614
DOI: 10.1016/j.softx.2020.100614
Project(s): BigDataGrapes via OpenAIRE

Metrics:

See at: SoftwareX Open Access | ISTI Repository | SoftwareX | www.sciencedirect.com | CNR ExploRA

2021 Journal article Closed Access

Efficient traversal of decision tree ensembles with FPGAs
Molina R., Loor F., Gil-Costa V., Nardini F. M., Perego R., Trani S.
System-on-Chip (SoC) based Field Programmable Gate Arrays (FPGAs) provide a hardware acceleration technology that can be rapidly deployed and tuned, thus providing a flexible solution adaptable to specific design requirements and to changing demands. In this paper, we present three SoC architecture designs for speeding-up inference tasks based on machine learned ensembles of decision trees. We focus on QuickScorer, the state-of-the-art algorithm for the efficient traversal of tree ensembles and present the issues and the advantages related to its deployment on two SoC devices with different capacities. The results of the experiments conducted using publicly available datasets show that the solution proposed is very efficient and scalable. More importantly, it provides almost constant inference times, independently of the number of trees in the model and the number of instances to score. This allows the SoC solution deployed to be fine tuned on the basis of the accuracy and latency constraints of the application scenario considered.Source: Journal of parallel and distributed computing (Print) 155 (2021): 38–49. doi:10.1016/j.jpdc.2021.04.008
DOI: 10.1016/j.jpdc.2021.04.008
Metrics:

See at: Journal of Parallel and Distributed Computing Restricted | Journal of Parallel and Distributed Computing | CNR ExploRA

2021 Conference article Open Access

Learning early exit strategies for additive ranking ensembles
Busolin F., Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Modern search engine ranking pipelines are commonly based on large machine-learned ensembles of regression trees. We propose LEAR, a novel - learned - technique aimed to reduce the average number of trees traversed by documents to accumulate the scores, thus reducing the overall query response time. LEAR exploits a classifier that predicts whether a document can early exit the ensemble because it is unlikely to be ranked among the final top-k results. The early exit decision occurs at a sentinel point, i.e., after having evaluated a limited number of trees, and the partial scores are exploited to filter out non-promising documents. We evaluate LEAR by deploying it in a production-like setting, adopting a state-of-the-art algorithm for ensembles traversal. We provide a comprehensive experimental evaluation on two public datasets. The experiments show that LEAR has a significant impact on the efficiency of the query processing without hindering its ranking quality. In detail, on a first dataset, LEAR is able to achieve a speedup of 3x without any loss in NDCG@10, while on a second dataset the speedup is larger than 5x with a negligible NDCG@10 loss (< 0.05%).Source: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2217–2221, Online conference, 11-15/07/ 2021
DOI: 10.1145/3404835.3463088
Metrics:

See at: arXiv.org e-Print Archive Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari | dl.acm.org Restricted | dl.acm.org | CNR ExploRA

2022 Journal article Open Access

Distilled neural networks for efficient learning to rank
Nardini F. M., Rulli C., Trani S., Venturini R.
Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU. In this paper, we propose an approach for speeding up neural scoring time by applying a combination of Distillation, Pruning and Fast Matrix multiplication. We employ knowledge distillation to learn shallow neural networks from an ensemble of regression trees. Then, we exploit an efficiency-oriented pruning technique that performs a sparsification of the most computationally-intensive layers of the neural network that is then scored with optimized sparse matrix multiplication. Moreover, by studying both dense and sparse high performance matrix multiplication, we develop a scoring time prediction model which helps in devising neural network architectures that match the desired efficiency requirements. Comprehensive experiments on two public learning-to-rank datasets show that neural networks produced with our novel approach are competitive at any point of the effectiveness-efficiency trade-off when compared with tree-based ensembles, providing up to 4x scoring time speed-up without affecting the ranking quality.Source: IEEE transactions on knowledge and data engineering (Online) 35 (2022): 4695–4712. doi:10.1109/TKDE.2022.3152585
DOI: 10.1109/tkde.2022.3152585
Metrics:

See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA

2022 Conference article Closed Access

Ensemble model compression for fast and energy-efficient ranking on FPGAs
Gil-Costa V., Loor F., Molina R., Nardini F. M., Perego R., Trani S.
We investigate novel SoC-FPGA solutions for fast and energy-efficient ranking based on machine-learned ensembles of decision trees. Since the memory footprint of ranking ensembles limits the effective exploitation of programmable logic for large-scale inference tasks, we investigate binning and quantization techniques to reduce the memory occupation of the learned model and we optimize the state-of-the-art ensemble-traversal algorithm for deployment on low-cost, energy-efficient FPGA devices. The results of the experiments conducted using publicly available Learning-to-Rank datasets, show that our model compression techniques do not impact significantly the accuracy. Moreover, the reduced space requirements allow the models and the logic to be replicated on the FPGA device in order to execute several inference tasks in parallel. We discuss in details the experimental settings and the feasibility of the deployment of the proposed solution in a real setting. The results of the experiments conducted show that our FPGA solution achieves performances at the state of the art and consumes from 9 × up to 19.8 × less energy than an equivalent multi-threaded CPU implementation.Source: ECIR 2022 - 44th European Conference on IR Research, pp. 260–273, Stavanger, Norway, 10-14/04/2022
DOI: 10.1007/978-3-030-99736-6_18
Metrics:

See at: doi.org Restricted | link.springer.com | CNR ExploRA

2022 Contribution to conference Open Access

Energy-efficient ranking on FPGAs through ensemble model compression (Abstract)
Gil-Costa V., Loor F., Molina R., Nardini F. M., Perego R., Trani S.
In this talk, we present the main results of a paper accepted at ECIR 2022 [1]. We investigate novel SoC-FPGA solutions for fast and energy-efficient ranking based on machine learned ensembles of decision trees. Since the memory footprint of ranking ensembles limits the effective exploitation of programmable logic for large-scale inference tasks [2], we investigate binning and quantization techniques to reduce the memory occupation of the learned model and we optimize the state-of-the-art ensemble-traversal algorithm for deployment on lowcost, energy-efficient FPGA devices. The results of the experiments conducted using publicly available Learning-to-Rank datasets, show that our model compression techniques do not impact significantly the accuracy. Moreover, the reduced space requirements allow the models and the logic to be replicated on the FPGA device in order to execute several inference tasks in parallel. We discuss in details the experimental settings and the feasibility of the deployment of the proposed solution in a real setting. The results of the experiments conducted show that our FPGA solution achieves performances at the state of the art and consumes from 9× up to 19.8× less energy than an equivalent multi-threaded CPU implementation.Source: IIR 2022 - 12th Italian Information Retrieval Workshop 2022, Tirrenia, Pisa, Italy, 19-22/06/2022

See at: ceur-ws.org Open Access | ISTI Repository | CNR ExploRA

2023 Journal article Open Access

Early exit strategies for learning-to-rank cascades
Busolin F., Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
The ranking pipelines of modern search platforms commonly exploit complex machine-learned models and have a significant impact on the query response time. In this paper, we discuss several techniques to speed up the document scoring process based on large ensembles of decision trees without hindering ranking quality. Specifically, we study the problem of document early exit within the framework of a cascading ranker made of three components: 1) an efficient but sub-optimal ranking stage; 2) a pruner that exploits signals from the previous component to force the early exit of documents classified as not relevant; and 3) a final high-quality component aimed at finely ranking the documents that survived the previous phase. To maximize speedup and preserve effectiveness, we aim to increase the accuracy of the pruner in identifying non-relevant documents without early exiting documents that are likely to be ranked among the final top-k results. We propose an in-depth study of heuristic and machine-learning techniques for designing the pruner. While the heuristic technique only exploits the score/ranking information supplied by the first sub-optimal ranker, the machine-learned solution named LEAR uses these signals as additional features along with those representing query-document pairs. Moreover, we study alternative solutions to implement the first ranker, either a small prefix of the original forest or an auxiliary machine-learned ranker explicitly trained for this purpose. We evaluated our techniques through reproducible experiments using publicly available datasets and state-of-the-art competitors. The experiments confirm that our early-exit strategies achieve speedups ranging from 3× to 10× without statistically significant differences in effectiveness.Source: IEEE access 11 (2023): 126691–126704. doi:10.1109/ACCESS.2023.3331088
DOI: 10.1109/access.2023.3331088
Metrics:

See at: CNR ExploRA

2013 Conference article Open Access

Dexter: an open source framework for entity linking
Ceccarelli D., Lucchese C., Orlando S., Perego R., Trani S.
We introduce Dexter, an open source framework for entity linking. The entity linking task aims at identifying all the small text fragments in a document referring to an entity contained in a given knowledge base, eg, Wikipedia. The annotation is usually organized in three tasks. Given an input document the first task consists in discovering the fragments that could refer to an entity. Since a mention could refer to multiple entities, it is necessary to perform a disambiguation step, where the correct entity is selected among the candidates.Source: ESAIR'13 - Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 17–20, San Francisco, USA, 27 October - 1 November 2013
DOI: 10.1145/2513204.2513212
Metrics:

See at: doi.org Restricted | CNR ExploRA

2014 Conference article Open Access

Dexter 2.0 - an open source tool for semantically enriching data
Ceccarelli D., Lucchese C., Orlando S., Perego R., Trani S.
Entity Linking (EL) enables to automatically link unstruc- tured data with entities in a Knowledge Base. Linking unstructured data (like news, blog posts, tweets) has several important applications: for ex- ample it allows to enrich the text with external useful contents or to improve the categorization and the retrieval of documents. In the latest years many effective approaches for performing EL have been proposed but only a few authors published the code to perform the task. In this work we describe Dexter 2.0, a major revision of our open source frame- work to experiment with different EL approaches. We designed Dexter in order to make it easy to deploy and to use. The new version provides several important features: the possibility to adopt different EL strate- gies at run-time and to annotate semi-structured documents, as well as a well-documented REST-API. In this demo we present the current state of the system, the improvements made, its architecture and the APIs provided.Source: ISWC 2014 Posters & Demonstrations Track. A track within the 13th International Semantic Web Conference, pp. 417–420, ISWC-P&D 2014, 21 October 2014

See at: ceur-ws.org Open Access | CNR ExploRA

2014 Conference article Restricted

Manual annotation of semi-structured documents for entity-linking
Ceccarelli D., Lucchese C., Orlando S., Perego R., Trani S.
The Entity Linking (EL) problem consists in automatically linking short fragments of text within a document to entities in a given Knowledge Base like Wikipedia. Due to its impact in several text-understanding related tasks, EL is an hot research topic. The correlated problem of devising the most relevant entities mentioned in the document, a.k.a. salient entities (SE), is also attracting increasing interest. Unfortunately, publicly available evaluation datasets that contain accurate and supervised knowledge about mentioned entities and their relevance ranking are currently very poor both in number and quality. This lack makes very difficult to compare different EL and SE solutions on a fair basis, as well as to devise innovative techniques that relies on these datasets to train machine learning models, in turn used to automatically link and rank entities. In this demo paper we propose a Web-deployed tool that allows to crowdsource the creation of these datasets, by sup- porting the collaborative human annotation of semi-structured documents. The tool, called Elianto, is actually an open source framework, which provides a user friendly and re- active Web interface to support both EL and SE labelling tasks, through a guided two-step process.Source: CIKM'14 - 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 2075–2077, Shanghai, China, 3-7 November 2014
DOI: 10.1145/2661829.2661854
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Open Access

Entity linking on philosophical documents
Trani S., Ceccarelli D., De Francesco A., Perego R., Segala M., Tonellotto N.
Entity Linking consists in automatically enriching a document by detecting the text fragments mentioning a given entity in an external knowledge base, e.g., Wikipedia. This problem is a hot research topic due to its impact in several text-understanding related tasks. However, its application to some specfiic, restricted topic domains has not received much attention. In this work we study how we can improve entity linking performance by exploiting a domain-oriented knowledge base, obtained by filtering out from Wikipedia the entities that are not relevant for the target domain. We focus on the philosophical domain, and we experiment a combination of three different entity filtering approaches: one based on the \Philosophy" category of Wikipedia, and two based on similarity metrics between philosophical documents and the textual description of the entities in the knowledge base, namely cosine similarity and Kullback-Leibler divergence. We apply traditional entity linking strategies to the domainoriented knowledge base obtained with these filtering techniques. Finally, we use the resulting enriched documents to conduct a preliminary user study with an expert in the area.Source: Italian Information Retrieval Workshop, pp. 12–12, Cagliari, Italy, 25-26/05/2015

See at: ceur-ws.org Open Access | CNR ExploRA