7 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2021 Conference article Open Access OPEN

PTHash: revisiting FCH minimal perfect hashing
Pibiri G. M., Trani R.
Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range {0,...,n-1} is called a minimal perfect hash function for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f . Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide 2 - 4X better lookup time.Source: ACM Conference on Research and Development in Information Retrieval, Canada, Montreal (Virtual Event), 11-15/07/2021
DOI: 10.1145/3404835.3462849

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2020 Doctoral thesis Closed Access

Efficiency-Effectiveness Trade-offs in Modern Query Processing
Trani R.
Modern search engines face enormous performance challenges. The most popular ones process tens of thousands of queries per second and manage billions of documents. Besides being efficient in processing the queries, they have also to be effective in satisfying the information needs of the users. However, techniques that improve the quality of the results can also require longer processing time. Search engines have thus often to attain a compromise between effectiveness and efficiency when employing new query processing solutions. In this thesis, we propose three new solutions for improving the efficiency-effectiveness trade-offs in several different tasks of modern query processors, namely: query expansion, search results filtering, and top-1 retrieval. Query expansion is the task of augmenting the user query with additional terms so to overcome some of the difficulties arising from the natural language, such as synonymy and polysemy. To this regard, we propose a thesaurus-based query expansion framework relying on structured Conjunctive Normal Form queries. We also propose three novel term selection supervised models capturing efficiency and effectiveness of the expansion candidates to include into the expanded query. Search results filtering refers to the task of discarding some search results from an attribute-sorted list, e.g., by recency or by price, to improve the effectiveness of the returned results while preserving the ordering. To this end, we propose one efficient optimal algorithm and one faster approximate algorithm with strong approximation guarantees, which enable filtering in scenarios with tight time constraints. Top-1 retrieval regards the identification of the utmost relevant result. It is critical in many applications, such as conversational AI and question answering, returning only one utmost relevant result to each user query. To this regard, we propose an efficient algorithm for finding the result winning the highest number of pairwise comparisons, given by a pairwise machine learning classifier, while performing the minimum number of comparisons needed to solve this problem. In this thesis, we show that by trading efficiency for effectiveness it is possible to achieve a huge improvement of efficiency on these tasks at the cost of a negligible loss of effectiveness. We also demonstrate that new efficient algorithms can enable effective yet inefficient techniques to be efficiently used, thus providing new appealing solutions in the efficiency-effectiveness trade-off space.

See at: etd.adm.unipi.it Restricted | CNR ExploRA Restricted


2020 Journal article Restricted

A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems
Melo Junior L., Nardini F. M. Renso C., Trani R., Macedo J. A.
Lenders, such as banks and credit card companies, use credit scoring models to evaluate the potential risk posed by lending money to customers, and therefore to mitigate losses due to bad credit. The profitability of the banks thus highly depends on the models used to decide on the customer's loans. State-of-the-art credit scoring models are based on machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with ensemble methods and preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the local competence of each base classifier. Then, the technique selects only competent classifiers to predict the query sample. In this paper, we evaluate the suitability of dynamic selection techniques for credit scoring problem, and we present Reduced Minority k-Nearest Neighbors (RMkNN), an approach that enhances state of the art in defining the local region of dynamic selection techniques for imbalanced credit scoring datasets. This proposed technique has a superior prediction performance in imbalanced credit scoring datasets compared to state of the art. Furthermore, RMkNN does not need any preprocessing or sampling method to generate the dynamic selection dataset (called DSEL). Additionally, we observe an equivalence between dynamic selection and static selection classification. We conduct a comprehensive evaluation of the proposed technique against state-of-the-art competitors on six real-world public datasets and one private one. Experiments show that RMkNN improves the classification performance of the evaluated datasets regarding AUC, balanced accuracy, H-measure, G-mean, F-measure, and Recall.Source: Expert systems with applications 152 (2020). doi:10.1016/j.eswa.2020.113351
DOI: 10.1016/j.eswa.2020.113351
Project(s): MC2020 via OpenAIRE, BigDataGrapes via OpenAIRE, MASTER via OpenAIRE

See at: Expert Systems with Applications Restricted | Expert Systems with Applications Restricted | Expert Systems with Applications Restricted | Expert Systems with Applications Restricted | Expert Systems with Applications Restricted | CNR ExploRA Restricted | Expert Systems with Applications Restricted


2019 Conference article Open Access OPEN

Fast Approximate Filtering of Search Results Sorted by Attribute
Nardini F. M., Trani R., Venturini R.
Several Web search services enable their users with the possibility of sorting the list of results by a specific attribute, e.g., sort "by price" in e-commerce. However, sorting the results by attribute could bring marginally relevant results in the top positions thus leading to a poor user experience. This motivates the definition of the relevance-aware filtering problem. This problem asks to remove results from the attribute-sorted list to maximize its final overall relevance. Recently, an optimal solution to this problem has been proposed. However, it has strong limitations in the Web scenario due to its high computational cost. In this paper, we propose ?-Filtering: an efficient approximate algorithm with strong approximation guarantees on the relevance of the final list. More precisely, given an allowed approximation error ?, the proposed algorithm finds a(1-?)"optimal filtering, i.e., the relevance of its solution is at least (1-?) times the optimum. We conduct a comprehensive evaluation of ?-Filtering against state-of-the-art competitors on two real-world public datasets. Experiments show that ?-Filtering achieves the desired levels of effectiveness with a speedup of up to two orders of magnitude with respect to the optimal solution while guaranteeing very small approximation errors.Source: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 815–824, Parigi, Francia, 21/07/2019, 25/07/2019
DOI: 10.1145/3331184.3331227
Project(s): BigDataGrapes via OpenAIRE

See at: ISTI Repository Open Access | academic.microsoft.com Restricted | dblp.uni-trier.de Restricted | dl.acm.org Restricted | dl.acm.org Restricted | dl.acm.org Restricted | Archivio della Ricerca - Università di Pisa Restricted | CNR ExploRA Restricted


2019 Conference article Open Access OPEN

An Optimal Algorithm to Find Champions of Tournament Graphs
Beretta L., Nardini F. M., Trani R., Venturini R.
A tournament graph T=(V,E) is an oriented complete graph, which can be used to model a round-robin tournament between n players. In this short paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. Our goal is to solve the problem by minimizing the number of arc lookups, i.e., the number of matches played. We prove that finding a champion requires ?(ln) comparisons, where l is the number of matches lost by the champion, and we present a deterministic algorithm matching this lower bound without knowing l . Solving this problem has important implications on several Information Retrieval applications including Web search, conversational AI, machine translation, question answering, recommender systems, etc.Source: SPIRE 2019: International Symposium on String Processing and Information Retrieval, pp. 267–273, Segovia, Spagna, 07/10/2019, 09/10/2019
DOI: 10.1007/978-3-030-32686-9_19
Project(s): BigDataGrapes via OpenAIRE

See at: ISTI Repository Open Access | academic.microsoft.com Restricted | dblp.uni-trier.de Restricted | doi.org Restricted | Archivio della Ricerca - Università di Pisa Restricted | link.springer.com Restricted | link.springer.com Restricted | link.springer.com Restricted | CNR ExploRA Restricted | rd.springer.com Restricted


2019 Journal article Embargo

Speed prediction in large and dynamic traffic sensor networks
Magalhaes R. P., Lettich F., Macedo J. A., Nardini F. M., Perego R., Renso C., Trani R.
Smart cities are nowadays equipped with pervasive networks of sensors that monitor traffic in real-time and record huge volumes of traffic data. These datasets constitute a rich source of information that can be used to extract knowledge useful for municipalities and citizens. In this paper we are interested in exploiting such data to estimate future speed in traffic sensor networks, as accurate predictions have the potential to enhance decision making capabilities of traffic management systems. Building effective speed prediction models in large cities poses important challenges that stem from the complexity of traffic patterns, the number of traffic sensors typically deployed, and the evolving nature of sensor networks. Indeed, sensors are frequently added to monitor new road segments or replaced/removed due to different reasons (e.g., maintenance). Exploiting a large number of sensors for effective speed prediction thus requires smart solutions to collect vast volumes of data and train effective prediction models. Furthermore, the dynamic nature of real-world sensor networks calls for solutions that are resilient not only to changes in traffic behavior, but also to changes in the network structure, where the cold start problem represents an important challenge. We study three different approaches in the context of large and dynamic sensor networks: local, global, and cluster-based. The local approach builds a specific prediction model for each sensor of the network. Conversely, the global approach builds a single prediction model for the whole sensor network. Finally, the cluster-based approach groups sensors into homogeneous clusters and generates a model for each cluster. We provide a large dataset, generated from ~1.3 billion records collected by up to 272 sensors deployed in Fortaleza, Brazil, and use it to experimentally assess the effectiveness and resilience of prediction models built according to the three aforementioned approaches. The results show that the global and cluster-based approaches provide very accurate prediction models that prove to be robust to changes in traffic behavior and in the structure of sensor networks.Source: Information systems (Oxf.) 98 (2019). doi:10.1016/j.is.2019.101444
DOI: 10.1016/j.is.2019.101444
Project(s): BigDataGrapes via OpenAIRE, MASTER via OpenAIRE

See at: Information Systems Restricted | Information Systems Restricted | Information Systems Restricted | Information Systems Restricted | CNR ExploRA Restricted | Information Systems Restricted


2018 Conference article Open Access OPEN

Efficient and effective query expansion for web search
Lucchese C., Nardini F. M., Perego R., Trani R., Venturini R.
Query Expansion (QE) techniques expand the user queries with additional terms, e.g., synonyms and acronyms, to enhance the system recall. State-of-the-art solutions employ machine learning methods to select the most suitable terms. However, most of them neglect the cost of processing the expanded queries, thus selecting effective, yet very expensive, terms. The goal of this paper is to enable QE in scenarios with tight time constraints proposing a QE framework based on structured queries and efficiency-aware term selection strategies. In particular, the proposed expansion selection strategies aim at capturing the efficiency and the effectiveness of the expansion candidates, as well as the dependencies among them. We evaluate our proposals by conducting an extensive experimental assessment on real-world search engine data and public TREC data. Results confirm that our approach leads to a remarkable efficiency improvement w.r.t. the state-of-the-art: a reduction of the retrieval time up to 30 times, with only a small loss of effectiveness.Source: International Conference on Information and Knowledge Management (CIKM), pp. 1551–1554, 22-26/10/2018
DOI: 10.1145/3269206.3269305
DOI: 10.5281/zenodo.2668249
DOI: 10.5281/zenodo.2668248
Project(s): BigDataGrapes via OpenAIRE

See at: ISTI Repository Open Access | zenodo.org Open Access | zenodo.org Open Access | academic.microsoft.com Restricted | dblp.uni-trier.de Restricted | dl.acm.org Restricted | dl.acm.org Restricted | dl.acm.org Restricted | iris.unive.it Restricted | CNR ExploRA Restricted