111 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2026 Journal article Open Access OPEN
Projection-displacement-based query performance prediction for embedded space of dense retrievers
Datta Suchana, Faggioli Guglielmo, Ferro Nicola, Ganguly Debasis, Muntean Cristina Ioana, Perego Raffaele, Tonellotto Nicola
Recent advances in representation learning have enabled neural Information Retrieval (IR) systems to use learned dense representations for queries and documents to effectively handle semantics, language nuances, and vocabulary mismatch problems. In contrast to traditional IR systems that rely on word matching, dense IR models exploit query/document similarity in dense latent spaces to account for semantics. This requires substantial training data and comes with increased computational demands. Thus, it would be beneficial to predict how a system will perform for a given query to decide whether a dense IR model is the best option or alternatives should be used. Traditional Query Performance Prediction (QPP) models are designed for lexical IR approaches and perform sub-optimally when applied to dense neural IR systems. Therefore, there has been a renewed interest in QPP methods to improve their effectiveness for dense neural IR models. While the results of the new QPP methods are generally encouraging, there is ample room for improvement in absolute performance and stability. We argue that by using features more aligned with the underlying rationale of dense IR models, we can enhance the performance of QPP. In this respect, we propose the Projection-Displacement-Based QPP (PDQPP), which exploits the geometric properties of dense IR models, projects queries and retrieved documents onto subspaces defined by pseudo-relevant documents, and considers changes in retrieval scores within them as a proxy for retrieval coherence. Minor score changes suggest robust and coherent retrieval, while significant alterations indicate semantic divergence and potentially poor performance. Results over a wide range of experimental settings on both traditional (TREC Robust) and neural-oriented (TREC Deep Learning) test collections show that PDQPP mostly outperforms the state-of-the-art QPP baselines.Source: ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 44 (issue 1), pp. 1-30
DOI: 10.1145/3765617
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ACM Transactions on Information Systems Restricted | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Maybe you are looking for CroQS Cross-Modal Query Suggestion for text-to-image retrieval
Pacini G., Carrara F., Messina N., Tonellotto N., Amato G., Falchi F.
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.com/CroQS-benchmark/.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 15573, pp. 138-152. Lucca, Italy, April 6–10, 2025
DOI: 10.1007/978-3-031-88711-6_9
Project(s): Future Artificial Intelligence Research, a MUltimedia platform for Content Enrichment and Search in audiovisual archives
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2025 Book Open Access OPEN
Early-exit graph neural networks
Di Francesco A. G., Bucarelli M. S., Nardini F. M., Perego R., Tonellotto N., Silvestri F.
Early-exit mechanisms allow deep neural networks to stop inference once prediction confidence is high, reducing latency and energy on easy inputs while retaining full-depth accuracy on harder ones. Similarly, adding early exit mechanisms to Graph Neural Networks (GNNs), the go-to models for graph-structured data, allows for dynamic trading depth for confidence on simple graphs while maintaining full-depth accuracy on harder ones to capture intricate relationships. Yet, their potential in deep GNNs, where over-smoothing, over-squashing or more generally vanishing gradients prevent these model to properly learn, remains largely unexplored. To address this, we introduce Symmetric-Anti-Symmetric GNNs (SAS-GNN), whose symmetry-based inductive biases yield stable intermediate representations that support safe early exits. Building on this backbone, we propose Early-Exit GNNs (EEGNNs), which attach confidence-aware exit neural heads which are trainable end-to-end based on the task objective, enabling on-the-fly termination at node or graph level. Experiments show that EEGNNs learn task-driven exit strategies, while achieving competitive results on heterophilic graphs and long-range tasks. Even when not outperforming the strongest baselines, EEGNNs consistently deliver favorable accuracy-efficiency trade-offs thanks to their adaptive and parameter-efficient design. We plan to release the code to reproduce our experiments.DOI: 10.48550/arxiv.2505.18088
Metrics:


See at: arXiv.org e-Print Archive Open Access | CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Breaking the 2D dependency: what limits 3D-only open-vocabulary scene understanding
D’orsi D., Carrara F., Falchi F., Tonellotto N.
Open-vocabulary 3D scene understanding, i.e., recognizing and classifying objects in 3D scenes without being limited to a predefined set of classes, is a foundational task for robotics and extended reality applications. Current leading methods often rely on 2D foundation models to extract semantics, then projected in 3D. This paper investigates the viability of a purely 3D-native pipeline, thereby eliminating dependencies on 2D models and reprojections. We systematically explored various architectural combinations using established 3D components. However, our extensive experiments on benchmark datasets reveal significant performance limitations with this direct 3D-native approach, with performance metrics falling short of expectations. Rather than a simple failure, these outcomes provide critical insights into the current deficiencies of existing 3D models when cascaded for complex open-vocabulary tasks. We highlight the lessons learned, identify the pipeline's limitations (e.g., segmenter-encoder domain gap, robustness to imperfect segmentations), and posit future research directions. We argue that a fundamental rethinking of model design and interplay is necessary to realize the potential of truly 3D-native open-vocabulary understanding.Source: PROCEEDINGS INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA. Dublino, Irlanda, 22-24 October 2025
DOI: 10.1109/cbmi66578.2025.11339286
Project(s): Social and Human Centered XR
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2024 Patent Restricted
Caching historical embeddings in conversational search
Frieder O., Mele I., Muntean C., Nardini F. M., Perego R., Tonellotto N.
A method and system are described for improving the speed and efficiency of obtaining conversational search results. A user may speak a phrase to perform a conversational search or a series of phrases to perform a series of searches. These spoken phrases may be enriched by context and then converted into a query embedding. A similarity between the query embedding and document embeddings is used to determine the search results including a query cutoff number of documents and a cache cutoff number of documents. A second search phrase may use the cache of documents along with comparisons of the returned documents and the first query embedding to determine the quality of the cache for responding to the second search query. If the results are high-quality then the search may proceed much more rapidly by applying the second query only to the cached documents rather than to the server.

See at: CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
Learning to rank for non independent and identically distributed datasets
Cecchetti J., Tonellotto N., Perego R.
With the growing data privacy concerns, federated machine learning algorithms capable of preserving the confidentiality of sensitive information while enabling collaborative model training across decentralized data sources are attracting increasing interest. In this paper, we address the problem of collaboratively learning effective ranking models from non-independently and identically distributed (non-IID) training data owned by distinct search clients. We assume that the learning agents cannot access each other's data, and that the models learned from local datasets might be biased or underperforming due to a skewed distribution of certain document features or query topics in the learning-to-rank training data. Thus, we aim to instill in the local ranking model learned from local data the knowledge from other models to obtain a more robust ranker capable of effectively handling documents and queries underrepresented in the local collection. To achieve this, we explore different methods for merging the ranking models, thus obtaining in each client a model that excels in ranking documents from the local data distribution but also performs well on queries retrieving documents having distributions typical of a partner's node. In particular, our findings suggest that by relying on a linear combination of the local models, we can improve IR models effectiveness by up to +17.92% in NDCG@10 (moving from 0.619 to 0.730), and by up to +19.64% in MAP (moving from 0.713 to 0.853).DOI: 10.1145/3664190.3672513
Project(s): EFRA via OpenAIRE, Future Artificial Intelligence Research
Metrics:


See at: IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | Archivio della Ricerca - Università di Pisa Restricted | Archivio della Ricerca - Università di Pisa Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
DESIRE-ME: Domain-Enhanced Supervised Information Retrieval Using Mixture-of-Experts
Kasela P., Pasi G., Perego R., Tonellotto N.
Open-domain question answering requires retrieval systems able to cope with the diverse and varied nature of questions, providing accurate answers across a broad spectrum of query types and topics. To deal with such topic heterogeneity through a unique model, we propose DESIRE-ME, a neural information retrieval model that leverages the Mixture-of-Experts framework to combine multiple specialized neural models. We rely on Wikipedia data to train an effective neural gating mechanism that classifies the incoming query and that weighs the predictions of the different domain-specific experts correspondingly. This allows DESIRE-ME to specialize adaptively in multiple domains. Through extensive experiments on publicly available datasets, we show that our proposal can effectively generalize domain-enhanced neural models. DESIRE-ME excels in handling open-domain questions adaptively, boosting by up to 12% in NDCG@10 and 22% in P@1, the underlying state-of-the-art dense retrieval modelSource: LECTURE NOTES IN COMPUTER SCIENCE, vol. 14609, pp. 111-125. Glasgow, UK, 24–28/03/2024
DOI: 10.1007/978-3-031-56060-6_8
DOI: 10.48550/arxiv.2403.13468
Project(s): EFRA via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | doi.org Restricted | doi.org Restricted | BOA - Bicocca Open Archive Restricted | Archivio della Ricerca - Università di Pisa Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Restricted
A geometric framework for query performance prediction in conversational search
Faggioli G., Ferro N., Muntean C., Perego R., Tonellotto N.
Thanks to recent advances in IR and NLP, the way users interact with search engines is evolving rapidly, with multi-turn conversations replacing traditional one-shot textual queries. Given its interactive nature, Conversational Search (CS) is one of the scenarios that can benefit the most from Query Performance Prediction (QPP) techniques. QPP for the CS domain is a relatively new field and lacks proper framing. In this study, we address this gap by proposing a framework for the application of QPP in the CS domain and use it to evaluate the performance of predictors. We characterize what it means to predict the performance in the CS scenario, where information needs are not independent queries but a series of closely related utterances. We identify three main ways to use QPP models in the CS domain: as a diagnostic tool, as a way to adjust the system's behaviour during a conversation, or as a way to predict the system's performance on the next utterance. Due to the lack of established evaluation procedures for QPP in the CS domain, we propose a protocol to evaluate QPPs for each of the use cases. Additionally, we introduce a set of spatial-based QPP models designed to work the best in the conversational search domain, where dense neural retrieval models are the most common approaches and query cutoffs are typically small. We show how the proposed QPP approaches improve significantly the predictive performance over the state-of-the-art in different scenarios and collections.DOI: 10.1145/3539618.3591625
Project(s): SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: dl.acm.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Artificial intelligence of things at the edge: scalable and efficient distributed learning for massive scenarios
Bano S., Tonellotto N., Cassarà P., Gotta A.
Federated Learning (FL) is a distributed optimization method in which multiple client nodes collaborate to train a machine learning model without sharing data with a central server. However, communication between numerous clients and the central aggregation server to share model parameters can cause several problems, including latency and network congestion. To address these issues, we propose a scalable communication infrastructure based on Information-Centric Networking built and tested on Apache Kafka®. The proposed architecture consists of a two-tier communication model. In the first layer, client updates are cached at the edge between clients and the server, while in the second layer, the server computes global model updates by aggregating the cached models. The data stored in the intermediate nodes at the edge enables reliable and effective data transmission and solves the problem of intermittent connectivity of mobile nodes. While many local model updates provided by clients can result in a more accurate global model in FL, they can also result in massive data traffic that negatively impacts congestion at the edge. For this reason, we couple a client selection procedure based on a congestion control mechanism at the edge for the given architecture of FL. The proposed algorithm selects a subset of clients based on their resources through a time-based backoff system to account for the time-averaged accuracy of FL while limiting the traffic load. Experiments show that our proposed architecture has an improvement of over 40% over the network-centric based FL architecture, i.e., Flower. The architecture also provides scalability and reliability in the case of mobile nodes. It also improves client resource utilization, avoids overflow, and ensures fairness in client selection. The experiments show that the proposed algorithm leads to the desired client selection patterns and is adaptable to changing network environments.Source: COMPUTER COMMUNICATIONS, vol. 205, pp. 45-57
DOI: 10.1016/j.comcom.2023.04.010
DOI: https://doi.org/10.1016/j.comcom.2023.04.010
Project(s): TEACHING via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | Computer Communications Restricted | CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2020 Journal article Open Access OPEN
Topical result caching in web search engines
Mele I, Tonellotto N, Frieder O, Perego R
Caching search results is employed in information retrieval systems to expedite query processing and reduce back-end server workload. Motivated by the observation that queries belonging to different topics have different temporal-locality patterns, we investigate a novel caching model called STD (Static-Topic-Dynamic cache), a refinement of the traditional SDC (Static-Dynamic Cache) that stores in a static cache the results of popular queries and manages the dynamic cache with a replacement policy for intercepting the temporal variations in the query stream. Our proposed caching scheme includes another layer for topic-based caching, where the entries are allocated to different topics (e.g., weather, education). The results of queries characterized by a topic are kept in the fraction of the cache dedicated to it. This permits to adapt the cache-space utilization to the temporal locality of the various topics and reduces cache misses due to those queries that are neither sufficiently popular to be in the static portion nor requested within short-time intervals to be in the dynamic portion. We simulate different configurations for STD using two real-world query streams. Experiments demonstrate that our approach outperforms SDC with an increase up to 3% in terms of hit rates, and up to 36% of gap reduction w.r.t. SDC from the theoretical optimal caching algorithm.Source: INFORMATION PROCESSING & MANAGEMENT, vol. 57 (issue 3), pp. 1-21
DOI: 10.1016/j.ipm.2019.102193
DOI: 10.48550/arxiv.2001.03010
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Information Processing & Management Open Access | CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | Information Processing & Management Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2019 Conference article Open Access OPEN
Performance analysis of WebRTC-based video streaming over power constrained platforms
Bacco M, Catena M, De Cola T, Gotta A, Tonellotto N
This work analyses the use of the Web Real-Time Communications (WebRTC) framework on resource-constrained platforms. WebRTC is a consolidated solution for real-time video streaming, and it is an appealing solution in a wide range of application scenarios. We focus our attention on those in which power consumption, size and weight are of paramount importance because of the so-called Size, Weight and Power (SWaP) requirements, such as the use case of Unmanned Aerial Vehicles (UAVs) delivering real-time video streams over WebRTC to peers on the ground. The testbed described in this work shows that the power consumption can be reduced by changing WebRTC default settings while maintaining comparable video quality.DOI: 10.1109/glocom.2018.8647375
DOI: 10.5281/zenodo.2705728
DOI: 10.5281/zenodo.2705727
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: ZENODO Open Access | ZENODO Open Access | CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | zenodo.org Open Access | zenodo.org Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2019 Journal article Open Access OPEN
Parallel Traversal of Large Ensembles of Decision Trees
Lettich F, Lucchese C, Nardini Fm, Orlando S, Perego R, Tonellotto N, Venturini R
Machine-learnt models based on additive ensembles of regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. The deployment of such models is computationally demanding: to compute the final prediction, the whole ensemble must be traversed by accumulating the contributions of all its trees. In particular, traversal cost impacts applications where the number of candidate items is large, the time budget available to apply the learnt model to them is limited, and the users' expectations in terms of quality-of-service is high. Document ranking in web search, where sub-optimal ranking models are deployed to find a proper trade-off between efficiency and effectiveness of query answering, is probably the most typical example of this challenging issue. This paper investigates multi/many-core parallelization strategies for speeding up the traversal of large ensembles of regression trees thus obtaining machine-learnt models that are, at the same time, effective, fast, and scalable. Our best results are obtained by the GPU-based parallelization of the state-of-the-art algorithm, with speedups of up to 102.6x.Source: IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (PRINT), vol. 30 (issue 9), pp. 2075-2089
DOI: 10.1109/tpds.2018.2860982
DOI: 10.5281/zenodo.2668378
DOI: 10.5281/zenodo.2668379
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: IEEE Transactions on Parallel and Distributed Systems Open Access | ZENODO Open Access | ZENODO Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | IEEE Transactions on Parallel and Distributed Systems Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2019 Patent Open Access OPEN
Cache optimization via topics in web search engines
Frieder O, Mele I, Perego R, Tonellotto N

See at: CNR IRIS Open Access | ISTI Repository Open Access | patentimages.storage.googleapis.com Open Access | CNR IRIS Restricted


2019 Conference article Open Access OPEN
Enhanced news retrieval: passages lead the way!
Catena M., Nardini F. M., Frieder O., Perego R., Muntean Cristina Ioana, Tonellotto N.
We observe that most relevant terms in unstructured news articles are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of news articles. Our experimentation, conducted using three publicly available news datasets, demonstrates that BM25P markedly outperforms BM25 in term of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1.DOI: 10.1145/3331184.3331373
Project(s): BIGDATAGRAPES
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted


2019 Conference article Closed Access
Multiple query processing via logic function factoring
Catena M, Tonellotto N
Some extensions to search systems require support for multiple query processing. This is the case with query variations, i.e., different query formulations of the same information need. The results of their processing can be fused together to improve effectiveness, but this requires to traverse more than once the query terms' posting lists, thus prolonging the multiple query processing time. In this work, we propose an approach to optimize the processing of query variations to reduce their overall response time. Similarly to the standard Boolean model, we firstly represent a group of query variations as a logic function where Boolean variables represent query terms. We then apply factoring to such function, in order to produce a more compact but logically equivalent representation. The factored form is used to process the query variations in a single pass over the inverted index. We experimentally show that our approach can improve by up to 1.95× the mean processing time of a multiple query with no statistically significant degradation in terms of NDCG@10.DOI: 10.1145/3331184.3331297
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: dl.acm.org Restricted | doi.org Restricted | CNR IRIS Restricted


2018 Journal article Open Access OPEN
Dataset popularity prediction for caching of CMS big data
Meoni M, Perego R, Tonellotto N
The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. The historical usage data recorded by this large infrastructure is a rich source of information for system tuning and capacity planning. In this paper we investigate how to leverage machine learning on this huge amount of data in order to discover patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. In particular we propose a scalable pipeline of components built on top of the Spark engine for large-scale data processing, whose goal is collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The high accuracy achieved indicates the ability of the learned model to correctly separate popular datasets from unpopular ones. Dataset popularity predictions are then exploited within a novel data caching policy, called PPC (Popularity Prediction Caching). We evaluate the performance of PPC against popular caching policy baselines like LRU (Least Recently Used). The experiments conducted on large traces of real dataset accesses show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.Source: JOURNAL OF GRID COMPUTING, vol. 16 (issue 2), pp. 211-228
DOI: 10.1007/s10723-018-9436-4
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | Journal of Grid Computing Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2018 Conference article Open Access OPEN
Efficient energy management in distributed web search
Catena M, Frieder O, Tonellotto N
Distributed Web search engines (WSEs) require warehouse-scale computers to deal with the ever-increasing size of the Web and the large amount of user queries they daily receive. The energy consumption of this infrastructure has a major impact on the economic profitability of WSEs. Recently several approaches to reduce the energy consumption of WSEs have been proposed. Such solutions leverage dynamic voltage and frequency scaling techniques in modern CPUs to adapt the WSEs' query processing to the incoming query traffic without negative impacts on latencies. A state-of-the-art research approach is the PESOS (Predictive Energy Saving Online Scheduling) algorithm, which can reduce the energy consumption of a WSE' single server by up to 50%. We evaluate PESOS on a simulated distributed WSE composed of a thousand of servers, and we compare its performance w.r.t. an industry-level baseline, called PEGASUS. Our results show that PESOS can reduce the CPU energy consumption of a distributed WSE by up to 18% with respect to PEGASUS, while providing query response times which are in line with user expectations.DOI: 10.1145/3269206.3269263
DOI: 10.5281/zenodo.2710864
DOI: 10.5281/zenodo.2710863
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: dl.acm.org Open Access | ZENODO Open Access | ZENODO Open Access | CNR IRIS Open Access | ISTI Repository Open Access | zenodo.org Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2018 Conference article Open Access OPEN
Efficient query processing infrastructures: A half-day tutorial at SIGIR 2018
Tonellotto N, Macdonald C
Typically, techniques that benefit effectiveness of information retrieval (IR) systems have a negative impact on efficiency. Yet, with the large scale of Web search engines, there is a need to deploy efficient query processing techniques to reduce the cost of the infrastructure required. This tutorial aims to provide a detailed overview of the infrastructure of an IR system devoted to the efficient yet effective processing of user queries. This tutorial guides the attendees through the main ideas, approaches and algorithms developed in the last 30 years in query processing. In particular, we illustrate, with detailed examples and simplified pseudo-code, the most important query processing strategies adopted in major search engines, with a particular focus on dynamic pruning techniques. Moreover, we present and discuss the state-of-the-art innovations in query processing, such as impact-sorted and blockmax indexes. We also describe how modern search engines exploit such algorithms with learning-to-rank (LtR) models to produce effective results, exploiting new approaches in LtR query processing. Finally, this tutorial introduces query efficiency predictors for dynamic pruning, and discusses their main applications to scheduling, routing, selective processing and parallelisation of query processing, as deployed by a major search engine.DOI: 10.1145/3209978.3210191
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2018 Journal article Open Access OPEN
Efficient query processing for scalable web search
Tonellotto N, Macdonald C, Ounis I
Search engines are exceptionally important tools for accessing information in today's world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures.Source: FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, vol. 12, pp. 319-500
DOI: 10.1561/1500000057
DOI: 10.1561/9781680835434
DOI: 10.5281/zenodo.3268358
DOI: 10.5281/zenodo.3268359
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: Enlighten Open Access | Archivio della Ricerca - Università di Pisa Open Access | ZENODO Open Access | Foundations and Trends® in Information Retrieval Open Access | CNR IRIS Restricted | www.nowpublishers.com Restricted


2018 Contribution to book Metadata Only Access
Popularity-based caching of CMS datasets
Meoni M, Perego R, Tonellotto N
The distributed monitoring infrastructure of the Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) records on a Hadoop infrastructures a broad variety of computing and storage logs. They represent a valuable source of information for system tuning and capacity planning. In this paper we analyze machine learning (ML) techniques on large amount of traces to discover patterns and correlations useful to classify the popularity of experiment-related datasets. We implement a scalable pipeline of Spark components which collect the dataset access logs from heterogeneous monitoring sources and group them into weekly snapshots organized by CMS sites. Predictive models are trained on these snapshots and forecast which dataset will become popular over time. Dataset popularity predictions are then used to experiment a novel strategy of data caching, called Popularity Prediction Caching (PPC). We compare the hit rates of PPC with those produced by well known caching policies. We demonstrate how the performance improvement is as high as 20% in some sites.DOI: 10.3233/978-1-61499-843-3-221
Metrics:


See at: ebooks.iospress.nl Restricted | CNR IRIS Restricted