Page 1 of 1

2018 Conference article Open Access

Efficient and effective query expansion for web search
Lucchese C., Nardini F. M., Perego R., Trani R., Venturini R.
Query Expansion (QE) techniques expand the user queries with additional terms, e.g., synonyms and acronyms, to enhance the system recall. State-of-the-art solutions employ machine learning methods to select the most suitable terms. However, most of them neglect the cost of processing the expanded queries, thus selecting effective, yet very expensive, terms. The goal of this paper is to enable QE in scenarios with tight time constraints proposing a QE framework based on structured queries and efficiency-aware term selection strategies. In particular, the proposed expansion selection strategies aim at capturing the efficiency and the effectiveness of the expansion candidates, as well as the dependencies among them. We evaluate our proposals by conducting an extensive experimental assessment on real-world search engine data and public TREC data. Results confirm that our approach leads to a remarkable efficiency improvement w.r.t. the state-of-the-art: a reduction of the retrieval time up to 30 times, with only a small loss of effectiveness.Source: International Conference on Information and Knowledge Management (CIKM), pp. 1551–1554, 22-26/10/2018
DOI: 10.1145/3269206.3269305
DOI: 10.5281/zenodo.2668248
DOI: 10.5281/zenodo.2668249
Project(s): BigDataGrapes via OpenAIRE

Metrics:

2019 Conference article Open Access

Fast Approximate Filtering of Search Results Sorted by Attribute
Nardini F. M., Trani R., Venturini R.
Several Web search services enable their users with the possibility of sorting the list of results by a specific attribute, e.g., sort "by price" in e-commerce. However, sorting the results by attribute could bring marginally relevant results in the top positions thus leading to a poor user experience. This motivates the definition of the relevance-aware filtering problem. This problem asks to remove results from the attribute-sorted list to maximize its final overall relevance. Recently, an optimal solution to this problem has been proposed. However, it has strong limitations in the Web scenario due to its high computational cost. In this paper, we propose ?-Filtering: an efficient approximate algorithm with strong approximation guarantees on the relevance of the final list. More precisely, given an allowed approximation error ?, the proposed algorithm finds a(1-?)"optimal filtering, i.e., the relevance of its solution is at least (1-?) times the optimum. We conduct a comprehensive evaluation of ?-Filtering against state-of-the-art competitors on two real-world public datasets. Experiments show that ?-Filtering achieves the desired levels of effectiveness with a speedup of up to two orders of magnitude with respect to the optimal solution while guaranteeing very small approximation errors.Source: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 815–824, Parigi, Francia, 21/07/2019, 25/07/2019
DOI: 10.1145/3331184.3331227
Project(s): BigDataGrapes via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2019 Conference article Open Access

An Optimal Algorithm to Find Champions of Tournament Graphs
Beretta L., Nardini F. M., Trani R., Venturini R.
A tournament graph T=(V,E) is an oriented complete graph, which can be used to model a round-robin tournament between n players. In this short paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. Our goal is to solve the problem by minimizing the number of arc lookups, i.e., the number of matches played. We prove that finding a champion requires ?(ln) comparisons, where l is the number of matches lost by the champion, and we present a deterministic algorithm matching this lower bound without knowing l . Solving this problem has important implications on several Information Retrieval applications including Web search, conversational AI, machine translation, question answering, recommender systems, etc.Source: SPIRE 2019: International Symposium on String Processing and Information Retrieval, pp. 267–273, Segovia, Spagna, 07/10/2019, 09/10/2019
DOI: 10.1007/978-3-030-32686-9_19
Project(s): BigDataGrapes via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | Lecture Notes in Computer Science Restricted | link.springer.com | CNR ExploRA

2021 Conference article Open Access

TSXor: a simple time series compression algorithm
Bruno A., Nardini F. M., Pibiri G. E., Trani R., Venturini R.
Time series are ubiquitous in computing as a key ingredient of many machine learning analytics, ranging from classification to forecasting. Typically, the training of such machine learning algorithms on time series requires to access the data in temporal order for several times. Therefore, a compression algorithm providing good compression ratios and fast decompression speed is desirable. In this paper, we present TSXor, a simple yet effective lossless compressor for time series. The main idea is to exploit the redundancy/similarity between close-in-time values through a window that acts as a cache, as to improve the compression ratio and decompression speed. We show that TSXor achieves up to 3× better compression and up to 2× faster decompression than the state of the art on real-world datasets.Source: SPIRE 2021 - International Symposium on String Processing and Information Retrieval, pp. 217–223, France, Lille (Virtual Event), 04/10/2021-06/10/2021
DOI: 10.1007/978-3-030-86692-1_18
Metrics:

See at: ISTI Repository Open Access | link.springer.com Restricted | CNR ExploRA

2023 Journal article Open Access

Parallel and external-memory construction of minimal perfect hash functions with PTHash
Pibiri G. E., Trani R.
A function is a minimal perfect hash function for a set of size , if bijectively maps into the first n natural numbers. These functions are important for many practical applications in computing, such as search engines, computer networks, and databases. Several algorithms have been proposed to build minimal perfect hash functions that: scale well to large sets, retain fast evaluation time, and take very little space, e.g., 2 - 3 bits/key. PTHash is one such algorithm, achieving very fast evaluation in compressed space, typically many times faster than other techniques. In this work, we propose a new construction algorithm for PTHash enabling: (1) , to either build functions more quickly or more space-efficiently, and (2) , to scale to inputs much larger than the available internal memory. Only few other algorithms in the literature share these features, despite of their practical impact. We conduct an extensive experimental assessment on large real-world string collections and show that, with respect to other techniques, PTHash is competitive in construction time and space consumption, but retains 2 - 6× better lookup time.Source: IEEE transactions on knowledge and data engineering (Online) (2023). doi:10.1109/TKDE.2023.3303341
DOI: 10.1109/tkde.2023.3303341
Metrics:

See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA

2021 Conference article Open Access

PTHash: revisiting FCH minimal perfect hashing
Pibiri G. M., Trani R.
Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range {0,...,n-1} is called a minimal perfect hash function for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f . Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide 2 - 4X better lookup time.Source: SIGIR '21 - 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1339–1348, Canada, Montreal (Virtual Event), 11-15/07/2021
DOI: 10.1145/3404835.3462849
Metrics:

See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA

2023 Journal article Open Access

An optimal algorithm for finding champions in tournament graphs
Beretta L., Nardini F. M., Trani R., Venturini R.
A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between n players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solving this problem allows us to speed up several Information Retrieval and Recommender System applications, including question answering, conversational search, etc. Indeed, these applications often search for the champion inducing a round-robin tournament among the players by employing a machine learning model to estimate who wins each pairwise comparison. Our contribution, thus, allows finding the champion by performing a low number of model inferences. We prove that any deterministic or randomized algorithm finding a champion with constant success probability requires ?(ln) comparisons, where l is the number of matches lost by the champion. We then present an asymptotically-optimal deterministic algorithm matching this lower bound without knowing l , and we extend our analysis to three variants of the problem. Lastly, we conduct a comprehensive experimental assessment of the proposed algorithms on a question answering task on public data. Results show that our proposed algorithms speed up the retrieval of the champion up to 13× with respect to the state-of-the-art algorithm that perform the full tournament.Source: IEEE transactions on knowledge and data engineering (Online) 35 (2023): 10197–10209. doi:10.1109/TKDE.2023.3267345
DOI: 10.1109/tkde.2023.3267345
DOI: 10.48550/arxiv.2111.13621
Metrics:

See at: IEEE Transactions on Knowledge and Data Engineering Open Access | ISTI Repository | doi.org Restricted | ieeexplore.ieee.org | CNR ExploRA

2019 Journal article Embargo

Speed prediction in large and dynamic traffic sensor networks
Magalhaes R. P., Lettich F., Macedo J. A., Nardini F. M., Perego R., Renso C., Trani R.
Smart cities are nowadays equipped with pervasive networks of sensors that monitor traffic in real-time and record huge volumes of traffic data. These datasets constitute a rich source of information that can be used to extract knowledge useful for municipalities and citizens. In this paper we are interested in exploiting such data to estimate future speed in traffic sensor networks, as accurate predictions have the potential to enhance decision making capabilities of traffic management systems. Building effective speed prediction models in large cities poses important challenges that stem from the complexity of traffic patterns, the number of traffic sensors typically deployed, and the evolving nature of sensor networks. Indeed, sensors are frequently added to monitor new road segments or replaced/removed due to different reasons (e.g., maintenance). Exploiting a large number of sensors for effective speed prediction thus requires smart solutions to collect vast volumes of data and train effective prediction models. Furthermore, the dynamic nature of real-world sensor networks calls for solutions that are resilient not only to changes in traffic behavior, but also to changes in the network structure, where the cold start problem represents an important challenge. We study three different approaches in the context of large and dynamic sensor networks: local, global, and cluster-based. The local approach builds a specific prediction model for each sensor of the network. Conversely, the global approach builds a single prediction model for the whole sensor network. Finally, the cluster-based approach groups sensors into homogeneous clusters and generates a model for each cluster. We provide a large dataset, generated from ~1.3 billion records collected by up to 272 sensors deployed in Fortaleza, Brazil, and use it to experimentally assess the effectiveness and resilience of prediction models built according to the three aforementioned approaches. The results show that the global and cluster-based approaches provide very accurate prediction models that prove to be robust to changes in traffic behavior and in the structure of sensor networks.Source: Information systems (Oxf.) 98 (2019). doi:10.1016/j.is.2019.101444
DOI: 10.1016/j.is.2019.101444
Project(s): BigDataGrapes via OpenAIRE

, MASTER

Metrics:

See at: Information Systems Restricted | www.sciencedirect.com | CNR ExploRA

2020 Doctoral thesis Closed Access

Efficiency-Effectiveness Trade-offs in Modern Query Processing
Trani R.
Modern search engines face enormous performance challenges. The most popular ones process tens of thousands of queries per second and manage billions of documents. Besides being efficient in processing the queries, they have also to be effective in satisfying the information needs of the users. However, techniques that improve the quality of the results can also require longer processing time. Search engines have thus often to attain a compromise between effectiveness and efficiency when employing new query processing solutions. In this thesis, we propose three new solutions for improving the efficiency-effectiveness trade-offs in several different tasks of modern query processors, namely: query expansion, search results filtering, and top-1 retrieval. Query expansion is the task of augmenting the user query with additional terms so to overcome some of the difficulties arising from the natural language, such as synonymy and polysemy. To this regard, we propose a thesaurus-based query expansion framework relying on structured Conjunctive Normal Form queries. We also propose three novel term selection supervised models capturing efficiency and effectiveness of the expansion candidates to include into the expanded query. Search results filtering refers to the task of discarding some search results from an attribute-sorted list, e.g., by recency or by price, to improve the effectiveness of the returned results while preserving the ordering. To this end, we propose one efficient optimal algorithm and one faster approximate algorithm with strong approximation guarantees, which enable filtering in scenarios with tight time constraints. Top-1 retrieval regards the identification of the utmost relevant result. It is critical in many applications, such as conversational AI and question answering, returning only one utmost relevant result to each user query. To this regard, we propose an efficient algorithm for finding the result winning the highest number of pairwise comparisons, given by a pairwise machine learning classifier, while performing the minimum number of comparisons needed to solve this problem. In this thesis, we show that by trading efficiency for effectiveness it is possible to achieve a huge improvement of efficiency on these tasks at the cost of a negligible loss of effectiveness. We also demonstrate that new efficient algorithms can enable effective yet inefficient techniques to be efficiently used, thus providing new appealing solutions in the efficiency-effectiveness trade-off space.

See at: etd.adm.unipi.it Restricted | CNR ExploRA

2020 Journal article Open Access

A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems
Melo Junior L., Nardini F. M. Renso C., Trani R., Macedo J. A.
Lenders, such as banks and credit card companies, use credit scoring models to evaluate the potential risk posed by lending money to customers, and therefore to mitigate losses due to bad credit. The profitability of the banks thus highly depends on the models used to decide on the customer's loans. State-of-the-art credit scoring models are based on machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with ensemble methods and preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the local competence of each base classifier. Then, the technique selects only competent classifiers to predict the query sample. In this paper, we evaluate the suitability of dynamic selection techniques for credit scoring problem, and we present Reduced Minority k-Nearest Neighbors (RMkNN), an approach that enhances state of the art in defining the local region of dynamic selection techniques for imbalanced credit scoring datasets. This proposed technique has a superior prediction performance in imbalanced credit scoring datasets compared to state of the art. Furthermore, RMkNN does not need any preprocessing or sampling method to generate the dynamic selection dataset (called DSEL). Additionally, we observe an equivalence between dynamic selection and static selection classification. We conduct a comprehensive evaluation of the proposed technique against state-of-the-art competitors on six real-world public datasets and one private one. Experiments show that RMkNN improves the classification performance of the evaluated datasets regarding AUC, balanced accuracy, H-measure, G-mean, F-measure, and Recall.Source: Expert systems with applications 152 (2020). doi:10.1016/j.eswa.2020.113351
DOI: 10.1016/j.eswa.2020.113351
Project(s): BigDataGrapes via OpenAIRE

, MASTER

Metrics:

See at: ISTI Repository Open Access | Expert Systems with Applications Restricted | www.sciencedirect.com | CNR ExploRA

2021 Journal article Open Access

Fast filtering of search results sorted by attribute
Nardini F. M., Trani R., Venturini R.
Modern search services often provide multiple options to rank the search results, e.g., sort "by relevance", "by price" or "by discount" in e-commerce. While the traditional rank by relevance effectively places the relevant results in the top positions of the results list, the rank by attribute could place many marginally relevant results in the head of the results list leading to poor user experience. In the past, this issue has been addressed by investigating the relevance-aware filtering problem, which asks to select the subset of results maximizing the relevance of the attribute-sorted list. Recently, an exact algorithm has been proposed to solve this problem optimally. However, the high computational cost of the algorithm makes it impractical for the Web search scenario, which is characterized by huge lists of results and strict time constraints. For this reason, the problem is often solved using efficient yet inaccurate heuristic algorithms. In this paper, we first prove the performance bounds of the existing heuristics. We then propose two efficient and effective algorithms to solve the relevance-aware filtering problem. First, we propose OPT-Filtering, a novel exact algorithm that is faster than the existing state-of-the-art optimal algorithm. Second, we propose an approximate and even more efficient algorithm, ?-Filtering, which, given an allowed approximation error ?, finds a (1-?)-optimal filtering, i.e., the relevance of its solution is at least (1-?) times the optimum. We conduct a comprehensive evaluation of the two proposed algorithms against state-of-the-art competitors on two real-world public datasets. Experimental results show that OPT-Filtering achieves a significant speedup of up to two orders of magnitude with respect to the existing optimal solution, while ?-Filtering further improves this result by trading effectiveness for efficiency. In particular, experiments show that ?-Filtering can achieve quasi-optimal solutions while being faster than all state-of-the-art competitors in most of the tested configurations.Source: ACM transactions on information systems 40 (2021). doi:10.1145/3477982
DOI: 10.1145/3477982
Metrics:

See at: ISTI Repository Open Access | dl.acm.org Restricted | CNR ExploRA