256 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2021 Contribution to conference Open Access

Compressed indexes for fast search of semantic data
Perego R., Pibiri G. E., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. We propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30-60% less space and speeding up query execution by a factor of 2-81 times.Source: IEEE International Conference on Data Engineering (ICDE), 19-22/04/2021
Project(s): BigDataGrapes

See at: ISTI Repository | CNR ExploRA

2021 Contribution to conference Open Access

Cloud and Data Federation in MobiDataLab
Carlini E., Dazzi P., Lettich F., Perego R., Renso C.
Today's innovative digital services dealing with the mobility of per- sons and goods produce huge amount of data. To propose advanced and efficient mobility services, the collection and aggregation of new sources of data from various producers are necessary. The overall objective of the MobiDataLab H2020 project is to propose to the mobility stakeholders (transport organising authorities, operators, industry, government and innovators) reproducible methodologies and sustainable tools that foster the development of a data-sharing culture in Europe and beyond. This short paper introduces the key concepts driving the design and definition of the Cloud and Data Federation that stands at the basis of MobiDataLab.Source: FRAME'21 - 1st Workshop on Flexible Resource and Application Management on the Edge, Virtual Event, Sweden, 25/06/2021
DOI: 10.1145/3452369.3463819
Project(s): ACCORDION

See at: ISTI Repository | CNR ExploRA

2021 Contribution to conference Open Access

Advances in Information Retrieval. 43rd European Conference on IR Research, ECIR 2021. Proceedings
Hiemstra D., Moens M. F., Mothe J., Perego R., Potthast M., Sebastiani F.
This two-volume set LNCS 12656 and 12657 constitutes the refereed proceedings of the 43rd European Conference on IR Research, ECIR 2021, held virtually in March/April 2021, due to the COVID-19 pandemic. The 50 full papers presented together with 11 reproducibility papers, 39 short papers, 15 demonstration papers, 12 CLEF lab descriptions papers, 5 doctoral consortium papers, 5 workshop abstracts, and 8 tutorials abstracts were carefully reviewed and selected from 436 submissions. The accepted contributions cover the state of the art in IR: deep learning-based information retrieval techniques, use of entities and knowledge graphs, recommender systems, retrieval methods, information extraction, question answering, topic and prediction models, multimedia retrieval, and much more.DOI: 10.1007/978-3-030-72240-1

See at: ISTI Repository | CNR ExploRA

2021 Contribution to journal Open Access

Report on the 43rd European Conference on Information Retrieval (ECIR 2021)
Perego R., Sebastiani F.
Source: SIGIR forum 55 (2021).

See at: ISTI Repository | CNR ExploRA | sigir.org

2020 Journal article Restricted

Leveraging feature selection to detect potential tax fraudsters
Matos T., Macedo J. A., Lettich F., Monteiro J. M., Renso C., Perego R., Nardini F. M.
Tax evasion is any act that knowingly or unknowingly, legally or unlawfully, leads to non-payment or underpayment of tax due. Enforcing the correct payment of taxes by taxpayers is fundamental in maintaining investments that are necessary and benefits a society as a whole. Indeed, without taxes it is not possible to guarantee basic services such as health-care, education, sanitation, transportation, infrastructure, among other services essential to the population. This issue is especially relevant in developing countries such as Brazil. In this work we consider a real-world case study involving the Treasury Office of the State of Ceará (SEFAZ-CE, Brazil), the agency in charge of supervising more than 300,000 active taxpayers companies. SEFAZ-CE maintains a very large database containing vast amounts of information concerning such companies. Its enforcement team struggles to perform thorough inspections on taxpayers accounts as the underlying traditional human-based inspection processes involve the evaluation of countless fraud indicators (i.e., binary features), thus requiring burdensome amounts of time and being potentially prone to human errors. On the other hand, the vast amount of taxpayer information collected by fiscal agencies opens up the possibility of devising novel techniques able to tackle fiscal evasion much more effectively than traditional approaches. In this work we address the problem of using feature selection to select the most relevant binary features to improve the classification of potential tax fraudsters. Finding out possible fraudsters from taxpayer data with binary features presents several challenges. First, taxpayer data typically have features with low linear correlation between themselves. Also, tax frauds may originate from intricate illicit tactics, which in turn requires to uncover non-linear relationships between multiple features. Finally, few features may be correlated with the targeted class. In this work we propose Alicia, a new feature selection method based on association rules and propositional logic with a carefully crafted graph centrality measure that attempts to tackle the above challenges while, at the same time, being agnostic to specific classification techniques. Alicia is structured in three phases: first, it generates a set of relevant association rules from a set of fraud indicators (features). Subsequently, from such association rules Alicia builds a graph, which structure is then used to determine the most relevant features. To achieve this Alicia applies a novel centrality measure we call the Feature Topological Importance. We perform an extensive experimental evaluation to assess the validity of our proposal on four different real-world datasets, where we compare our solution with eight other feature selection methods. The results show that Alicia achieves F-measure scores up to 76.88%, and consistently outperforms its competitors.Source: Expert systems with applications 145 (2020). doi:10.1016/j.eswa.2019.113128
DOI: 10.1016/j.eswa.2019.113128

2020 Journal article Open Access

Topical result caching in web search engines
Mele I., Tonellotto N., Frieder O., Perego R.
Caching search results is employed in information retrieval systems to expedite query processing and reduce back-end server workload. Motivated by the observation that queries belonging to different topics have different temporal-locality patterns, we investigate a novel caching model called STD (Static-Topic-Dynamic cache), a refinement of the traditional SDC (Static-Dynamic Cache) that stores in a static cache the results of popular queries and manages the dynamic cache with a replacement policy for intercepting the temporal variations in the query stream. Our proposed caching scheme includes another layer for topic-based caching, where the entries are allocated to different topics (e.g., weather, education). The results of queries characterized by a topic are kept in the fraction of the cache dedicated to it. This permits to adapt the cache-space utilization to the temporal locality of the various topics and reduces cache misses due to those queries that are neither sufficiently popular to be in the static portion nor requested within short-time intervals to be in the dynamic portion. We simulate different configurations for STD using two real-world query streams. Experiments demonstrate that our approach outperforms SDC with an increase up to 3% in terms of hit rates, and up to 36% of gap reduction w.r.t. SDC from the theoretical optimal caching algorithm.Source: Information processing & management 57 (2020): 1–21. doi:10.1016/j.ipm.2019.102193
DOI: 10.1016/j.ipm.2019.102193
Project(s): BigDataGrapes

2020 Journal article Open Access

Compressed Indexes for Fast Search of Semantic Data
Pibiri G. E., Perego R., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is to devise a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis, conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 - 60% less space and speeding up query execution by a factor of 2-81× .Source: IEEE transactions on knowledge and data engineering (Print) (2020): 1–11. doi:10.1109/TKDE.2020.2966609
DOI: 10.1109/tkde.2020.2966609
Project(s): BigDataGrapes

2020 Report Open Access

Dynamic hard pruning of neural networks at the edge of the internet
Valerio L., Nardini F. M., Passarella A., Perego R.
Neural Networks (NN), although successfully applied to several Artificial Intelligence tasks, are often unnecessarily over-parametrized. In fog/edge computing, this might make their training prohibitive on resource-constrained devices, contrasting with the current trend of decentralising intelligence from remote data-centres to local constrained devices. Therefore, we investigate the problem of training effective NN models on constrained devices having a fixed, potentially small, memory budget. We target techniques that are both resource-efficient and performance effective while enabling significant network compression. Our technique, called Dynamic Hard Pruning (DynHP), incrementally prunes the network during training, identifying neurons that marginally contribute to the model accuracy. DynHP enables a tunable size reduction of the final neural network and reduces the NN memory occupancy during training. Freed memory is reused by a\emph {dynamic batch sizing} approach to counterbalance the accuracy degradation caused by the hard pruning strategy, improving its convergence and effectiveness. We assess the performance of DynHP through reproducible experiments on two public datasets, comparing them against reference competitors. Results show that DynHP compresses a NN up to times without significant performance drops (up to relative error wrt competitors), reducing up to the training memory occupancySource: IIT TR-21/2020 and ISTI Technical Reports 2020/016, 2020, 2020
DOI: 10.32079/isti-tr-2020/016
Project(s): BigDataGrapes

See at: ISTI Repository | CNR ExploRA

2020 Journal article Restricted

Weighting passages enhances accuracy
Muntean C. I., Nardini F. M., Perego R., Tonellotto N., Frieder O.
We observe that in curated documents the distribution of the occurrences of salient terms, e.g., terms with a high Inverse Document Frequency, is not uniform, and such terms are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of the document. We study a multiplicity of partitioning schemes of document content into passages and compute the collection-dependent weights associated with them on the basis of the distribution of occurrences of salient terms in documents. Moreover, we tune BM25P hyperparameters and investigate their impact on ad hoc document retrieval through fully reproducible experiments conducted using four publicly available datasets. Our findings demonstrate that our BM25P weighting model markedly and consistently outperforms BM25 in terms of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1, and up to 21% in MRR.Source: ACM transactions on information systems 39 (2020). doi:10.1145/3428687
DOI: 10.1145/3428687

2020 Conference article Open Access

Efficient document re-ranking for transformers by precomputing term representations
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
Deep pretrained transformer networks are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their computational expenses deem them cost-prohibitive in practice. Our proposed approach, called PreTTR (Precomputing Transformer Term Representations), considerably reduces the query-time latency of deep transformer networks (up to a 42x speedup on web document ranking) making these networks more practical to use in a real-time ranking scenario. Specifically, we precompute part of the document term representations at indexing time (without a query), and merge them with the query representation at query time to compute the final ranking score. Due to the large size of the token representations, we also propose an effective approach to reduce the storage requirement by training a compression layer to match attention scores. Our compression technique reduces the storage required up to 95% and it can be applied without a substantial degradation in ranking performance.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–58, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401093
Project(s): BigDataGrapes

2020 Conference article Open Access

Training curricula for open domain answer re-ranking
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
DOI: 10.1145/3397271.3401094
Project(s): BigDataGrapes

2020 Conference article Open Access

Expansion via prediction of importance with contextualization
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1573–1576, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401262
Project(s): BigDataGrapes

2020 Conference article Open Access

Query-level early exit for additive learning-to-rank ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Search engine ranking pipelines are commonly based on large ensembles of machine-learned decision trees. The tight constraints on query response time recently motivated researchers to investigate algorithms to make faster the traversal of the additive ensemble or to early terminate the evaluation of documents that are unlikely to be ranked among the top-k. In this paper, we investigate the novel problem of query-level early exiting, aimed at deciding the profitability of early stopping the traversal of the ranking ensemble for all the candidate documents to be scored for a query, by simply returning a ranking based on the additive scores computed by a limited portion of the ensemble. Besides the obvious advantage on query latency and throughput, we address the possible positive impact on ranking effectiveness. To this end, we study the actual contribution of incremental portions of the tree ensemble to the ranking of the top-k documents scored for a given query. Our main finding is that queries exhibit different behaviors as scores are accumulated during the traversal of the ensemble and that query-level early stopping can remarkably improve ranking quality. We present a reproducible and comprehensive experimental evaluation, conducted on two public datasets, showing that query-level early exiting achieves an overall gain of up to 7.5% in terms of NDCG@10 with a speedup of the scoring process of up to 2.2x.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2033–2036, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401256
Project(s): BigDataGrapes

2020 Conference article Open Access

Topic propagation in conversational search
Mele I., Muntean C. I., Nardini F. M., Perego R., Tonellotto N., Frieder O.
In a conversational context, a user expresses her multi-faceted information need as a sequence of natural-language questions, i.e., utterances. Starting from a given topic, the conversation evolves through user utterances and system replies. The retrieval of documents relevant to a given utterance in a conversation is challenging due to ambiguity of natural language and to the difficulty of detecting possible topic shifts and semantic relationships among utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing: (i) topic-aware utterance rewriting, (ii) retrieval of candidate passages for the rewritten utterances, and (iii) neural-based re-ranking of candidate passages. We present a comprehensive experimental evaluation of the architecture assessed in terms of traditional IR metrics at small cutoffs. Experimental results show the effectiveness of our techniques that achieve an improvement of up to $0.28$ (+93%) for P@1 and $0.19$ (+89.9%) for nDCG@3 w.r.t. the CAsT baseline.Source: SIGIR 2020 - 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2057–2060, Online Conference, July 25-30, 2020
DOI: 10.1145/3397271.3401268
Project(s): BigDataGrapes

2020 Journal article Open Access

RankEval: Evaluation and investigation of ranking models
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
RankEval is a Python open-source tool for the analysis and evaluation of ranking models based on ensembles of decision trees. Learning-to-Rank (LtR) approaches that generate tree-ensembles are considered the most effective solution for difficult ranking tasks and several impactful LtR libraries have been developed aimed at improving ranking quality and training efficiency. However, these libraries are not very helpful in terms of hyper-parameters tuning and in-depth analysis of the learned models, and even the implementation of most popular Information Retrieval (IR) metrics differ among them, thus making difficult to compare different models. RankEval overcomes these limitations by providing a unified environment where to perform an easy, comprehensive inspection and assessment of ranking models trained using different machine learning libraries. The tool focuses on ensuring efficiency, flexibility and extensibility and is fully interoperable with most popular LtR libraries.Source: Softwarex (Amsterdam) 12 (2020). doi:10.1016/j.softx.2020.100614
DOI: 10.1016/j.softx.2020.100614
Project(s): BigDataGrapes

2019 Journal article Open Access

Parallel Traversal of Large Ensembles of Decision Trees
Lettich F., Lucchese C., Nardini F. M., Orlando S., Perego R., Tonellotto N., Venturini R.
Machine-learnt models based on additive ensembles of regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. The deployment of such models is computationally demanding: to compute the final prediction, the whole ensemble must be traversed by accumulating the contributions of all its trees. In particular, traversal cost impacts applications where the number of candidate items is large, the time budget available to apply the learnt model to them is limited, and the users' expectations in terms of quality-of-service is high. Document ranking in web search, where sub-optimal ranking models are deployed to find a proper trade-off between efficiency and effectiveness of query answering, is probably the most typical example of this challenging issue. This paper investigates multi/many-core parallelization strategies for speeding up the traversal of large ensembles of regression trees thus obtaining machine-learnt models that are, at the same time, effective, fast, and scalable. Our best results are obtained by the GPU-based parallelization of the state-of-the-art algorithm, with speedups of up to 102.6x.Source: IEEE transactions on parallel and distributed systems (Print) 30 (2019): 2075–2089. doi:10.1109/TPDS.2018.2860982
DOI: 10.1109/tpds.2018.2860982
DOI: 10.5281/zenodo.2668379
DOI: 10.5281/zenodo.2668378
Project(s): BigDataGrapes

2019 Journal article Embargo

Speed prediction in large and dynamic traffic sensor networks
Magalhaes R. P., Lettich F., Macedo J. A., Nardini F. M., Perego R., Renso C., Trani R.
Smart cities are nowadays equipped with pervasive networks of sensors that monitor traffic in real-time and record huge volumes of traffic data. These datasets constitute a rich source of information that can be used to extract knowledge useful for municipalities and citizens. In this paper we are interested in exploiting such data to estimate future speed in traffic sensor networks, as accurate predictions have the potential to enhance decision making capabilities of traffic management systems. Building effective speed prediction models in large cities poses important challenges that stem from the complexity of traffic patterns, the number of traffic sensors typically deployed, and the evolving nature of sensor networks. Indeed, sensors are frequently added to monitor new road segments or replaced/removed due to different reasons (e.g., maintenance). Exploiting a large number of sensors for effective speed prediction thus requires smart solutions to collect vast volumes of data and train effective prediction models. Furthermore, the dynamic nature of real-world sensor networks calls for solutions that are resilient not only to changes in traffic behavior, but also to changes in the network structure, where the cold start problem represents an important challenge. We study three different approaches in the context of large and dynamic sensor networks: local, global, and cluster-based. The local approach builds a specific prediction model for each sensor of the network. Conversely, the global approach builds a single prediction model for the whole sensor network. Finally, the cluster-based approach groups sensors into homogeneous clusters and generates a model for each cluster. We provide a large dataset, generated from ~1.3 billion records collected by up to 272 sensors deployed in Fortaleza, Brazil, and use it to experimentally assess the effectiveness and resilience of prediction models built according to the three aforementioned approaches. The results show that the global and cluster-based approaches provide very accurate prediction models that prove to be robust to changes in traffic behavior and in the structure of sensor networks.Source: Information systems (Oxf.) 98 (2019). doi:10.1016/j.is.2019.101444
DOI: 10.1016/j.is.2019.101444
Project(s): BigDataGrapes , MASTER

2019 Patent Open Access

Cache optimization via topics in web search engines
Frieder O., Mele I., Perego R., Tonellotto N.
Source: US 10503792, Internazionale, 2019

2019 Journal article Open Access

Event attendance classification in social media
De Lira V. M., Macdonald C., Ounis I., Perego R., Renso C., Cesario Times V.
Popular events are well reflected on social media, where people share their feelings and discuss their experiences. In this paper, we investigate the novel problem of exploiting the content of non-geotagged posts on social media to infer the users' attendance of large events in three temporal periods: before, during and after an event. We detail the features used to train event attendance classifiers and report on experiments conducted on data from two large music festivals in the UK, namely the VFestival and Creamfields events. Our classifiers attain very high accuracy with the highest result observed for the Creamfields festival ( similar to 91% accuracy at classifying users that will participate in the event). We study the most informative features for the tasks addressed and the generalization of the learned models across different events. Finally, we discuss an illustrative application of the methodology in the field of transportation.Source: Information processing & management 56 (2019): 687–703. doi:10.1016/j.ipm.2018.11.001
DOI: 10.1016/j.ipm.2018.11.001

2019 Journal article Open Access

Boosting learning to rank with user dynamics and continuation methods
Ferro N., Lucchese C., Maistro M., Perego R.
Learning to rank (LtR) techniques leverage assessed samples of query-document relevance to learn effective ranking functions able to exploit the noisy signals hidden in the features used to represent queries and documents. In this paper we explore how to enhance the state-of-the-art LambdaMart LtR algorithm by integrating in the training process an explicit knowledge of the underlying user-interaction model and the possibility of targeting different objective functions that can effectively drive the algorithm towards promising areas of the search space. We enrich the iterative process followed by the learning algorithm in two ways: (1) by considering complex query-based user dynamics instead than simply discounting the gain by the rank position; (2) by designing a learning path across different loss functions that can capture different signals in the training data. Our extensive experiments, conducted on publicly available datasets, show that the proposed solution permits to improve various ranking quality measures by statistically significant margins.Source: Information retrieval (Boston) 23 (2019): 528–554. doi:10.1007/s10791-019-09366-9
DOI: 10.1007/s10791-019-09366-9