17 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2021 Conference article Open Access OPEN

Fast and compact set intersection through recursive universe partitioning
Pibiri G. E.
We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: a markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem.Source: IEEE Data Compression Conference, Online Conference, 23-26/03/2021

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2021 Journal article Open Access OPEN

Rank/select queries over mutable bitmaps
Pibiri G. E., Kanda S.
The problem of answering rank/select queries over a bitmap is of utmost importance for many succinct data structures. When the bitmap does not change, many solutions exist in the theoretical and practical side. In this work we consider the case where one is allowed to modify the bitmap via a flip(i) operation that toggles its i-th bit. By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets. Compared to the state-of-the-art, our solution improves runtime with no space degradation. Moreover, it does not incur in a significant runtime penalty when compared to the fastest immutable indexes, while providing even lower space overhead.Source: Information systems (Oxf.) (2021).
Project(s): BigDataGrapes via OpenAIRE

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2021 Contribution to conference Open Access OPEN

Compressed indexes for fast search of semantic data
Perego R., Pibiri G. E., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. We propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30-60% less space and speeding up query execution by a factor of 2-81 times.Source: ICDE 2021 - 37th IEEE International Conference on Data Engineering, pp. 2325–2326, Online conference, 19-22/04/2021
DOI: 10.1109/icde51399.2021.00248
Project(s): BigDataGrapes via OpenAIRE

See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA Restricted


2021 Conference article Open Access OPEN

PTHash: revisiting FCH minimal perfect hashing
Pibiri G. M., Trani R.
Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range {0,...,n-1} is called a minimal perfect hash function for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f . Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide 2 - 4X better lookup time.Source: ACM Conference on Research and Development in Information Retrieval, Canada, Montreal (Virtual Event), 11-15/07/2021
DOI: 10.1145/3404835.3462849

See at: ISTI Repository Open Access | CNR ExploRA Open Access


2021 Conference article Open Access OPEN

TSXor: a simple time series compression algorithm
Bruno A., Nardini F. M., Pibiri G. E., Trani R., Venturini R.
Time series are ubiquitous in computing as a key ingredient of many machine learning analytics, ranging from classification to forecasting. Typically, the training of such machine learning algorithms on time series requires to access the data in temporal order for several times. Therefore, a compression algorithm providing good compression ratios and fast decompression speed is desirable. In this paper, we present TSXor, a simple yet effective lossless compressor for time series. The main idea is to exploit the redundancy/similarity between close-in-time values through a window that acts as a cache, as to improve the compression ratio and decompression speed. We show that TSXor achieves up to 3× better compression and up to 2× faster decompression than the state of the art on real-world datasets.Source: SPIRE 2021 - International Symposium on String Processing and Information Retrieval, pp. 217–223, France, Lille (Virtual Event), 04/10/2021-06/10/2021
DOI: 10.1007/978-3-030-86692-1_18

See at: link.springer.com Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2020 Journal article Open Access OPEN

Compressed Indexes for Fast Search of Semantic Data
Pibiri G. E., Perego R., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is to devise a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis, conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 - 60% less space and speeding up query execution by a factor of 2-81× .Source: IEEE transactions on knowledge and data engineering (Print) (2020): 1–11. doi:10.1109/TKDE.2020.2966609
DOI: 10.1109/tkde.2020.2966609
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | IEEE Transactions on Knowledge and Data Engineering Open Access | ISTI Repository Open Access | IEEE Transactions on Knowledge and Data Engineering Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted


2020 Conference article Open Access OPEN

Efficient and effective query auto-completion
Gog S., Pibiri G. E., Venturini R.
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.Source: ACM Conference on Research and Development in Information Retrieval, pp. 2271–2280, 25/07/2020-30/07/2020
DOI: 10.1145/3397271.3401432
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | arxiv.org Open Access | ISTI Repository Open Access | academic.microsoft.com Restricted | arxiv.org Restricted | dblp.uni-trier.de Restricted | dl.acm.org Restricted | dl.acm.org Restricted | CNR ExploRA Restricted


2020 Journal article Open Access OPEN

Practical trade-offs for the prefix-sum problem
Pibiri G. E., Venturini R.
Given an integer arrayA, theprefix-sum problemis to answersum(i)queries that return the sum of the elements inA[0..i], knowing that the integers inAcan be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare practical solutions to this problem, showing that new trade-offs between the performance of queries and updates can be achieved on modern hardware.Source: Software, practice & experience (Print) (2020). doi:10.1002/spe.2918
DOI: 10.1002/spe.2918
Project(s): BigDataGrapes via OpenAIRE

See at: Software Practice and Experience Open Access | Software Practice and Experience Restricted | Software Practice and Experience Restricted | Software Practice and Experience Restricted | Software Practice and Experience Restricted | onlinelibrary.wiley.com Restricted | Software Practice and Experience Restricted | Software Practice and Experience Restricted | Software Practice and Experience Restricted | CNR ExploRA Restricted


2020 Journal article Open Access OPEN

Techniques for Inverted Index Compression
Pibiri G. E., Venturini R.
The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.Source: ACM computing surveys 53 (2020). doi:10.1145/3415148
DOI: 10.1145/3415148
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | ACM Computing Surveys Open Access | ISTI Repository Open Access | ACM Computing Surveys Restricted | ACM Computing Surveys Restricted | ACM Computing Surveys Restricted | ACM Computing Surveys Restricted | dl.acm.org Restricted | ACM Computing Surveys Restricted | CNR ExploRA Restricted


2019 Conference article Open Access OPEN

Fast dictionary-based compression for inverted indexes
Pibiri G. E., Petri M., Moffat A.
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.Source: International Conference on Web Search and Data Mining, pp. 6–14, 11/02/2019,15/02/2019
DOI: 10.1145/3289600.3290962

See at: ISTI Repository Open Access | academic.microsoft.com Restricted | dl.acm.org Restricted | dl.acm.org Restricted | dl.acm.org Restricted | findanexpert.unimelb.edu.au Restricted | CNR ExploRA Restricted


2019 Journal article Open Access OPEN

Handling massive n-gram datasets efficiently
Pibiri G. E., Venturini R.
Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.Source: ACM transactions on information systems 37 (2019). doi:10.1145/3302913
DOI: 10.1145/3302913
DOI: 10.5281/zenodo.3257995
DOI: 10.5281/zenodo.3257994
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | ISTI Repository Open Access | ZENODO Open Access | ACM Transactions on Information Systems Open Access | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | CNR ExploRA Restricted


2019 Journal article Open Access OPEN

On optimally partitioning variable-byte codes
Pibiri G. E., Venturini R.
The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2× by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Source: IEEE transactions on knowledge and data engineering (Print) 32 (2019): 1812–1823. doi:10.1109/TKDE.2019.2911288
DOI: 10.1109/tkde.2019.2911288
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | IEEE Transactions on Knowledge and Data Engineering Open Access | ISTI Repository Open Access | IEEE Transactions on Knowledge and Data Engineering Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted


2019 Doctoral thesis Open Access OPEN

Space and time-efficient data structures for massive datasets
Pibiri G. M.
This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings.

See at: etd.adm.unipi.it Open Access | CNR ExploRA Open Access


2018 Contribution to book Open Access OPEN

Inverted Index Compression
Pibiri G. E., Venturini R.
The data structure at the core of nowadays large-scale search engines, social networks and storage architectures is the inverted index, which can be regarded as being a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by search engines and stringent performance requirements dictated by the heavy load of user queries, the inverted lists often store several million (even billion) of integers and must be searched efficiently. In this scenario, compressing the inverted lists of the index appears as a mandatory design phase since it can introduce a twofold advantage over a non-compressed representation: feed faster memory levels with more data in order to speed up the query processing algorithms and reduce the number of storage machines needed to host the whole index. The scope of the chapter is the one of surveying the most important encoding algorithms developed for efficient inverted index compression.DOI: 10.1007/978-3-319-63962-8_52-1
DOI: 10.1007/978-3-319-77525-8_52

See at: ISTI Repository Open Access | link.springer.com Restricted | link.springer.com Restricted | CNR ExploRA Restricted


2017 Journal article Open Access OPEN

Clustered Elias-Fano Indexes
Pibiri G. E., Venturini R.
State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.Source: ACM transactions on information systems 36 (2017). doi:10.1145/3052773
DOI: 10.1145/3052773
Project(s): SoBigData via OpenAIRE

See at: ACM Transactions on Information Systems Open Access | Archivio della Ricerca - Università di Pisa Open Access | ISTI Repository Open Access | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | CNR ExploRA Restricted


2017 Conference article Open Access OPEN

Dynamic Elias-Fano Representation
Pibiri G. E., Venturini R.
We show that it is possible to store a dynamic ordered set S(n,u) of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and yet preserve the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of S(n,u) which takes EF(S(n,u))=n?log(u/n)?+2n bits of space and can be shown to be less than half a bit per element away from the information-theoretic minimum. Considering a RAM model with memory words of ?(log u) bits, we focus on the case in which the integers of S are drawn from a polynomial universe of size u=n?, for any ?=?(1). We represent S(n,u) with EF(S(n,u))+o(n) bits of space and: 1. support static predecessor/successor queries in O(min{1+log(u/n),loglog n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. support random access in O(log n/loglog n) worst-case, insertions/deletions in O(log n/loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n),loglog n}) worst-case time. These time bounds are optimal.Source: Annual Symposium on Combinatorial Pattern Matching, Varsavia, Polonia, 4-6/07/2017
DOI: 10.4230/lipics.cpm.2017.30

See at: drops.dagstuhl.de Open Access | ISTI Repository Open Access | CNR ExploRA Open Access


2017 Conference article Restricted

Efficient Data Structures for Massive N-Gram Datasets
Pibiri G. E., Venturini R.
The effcient indexing of large and sparse N-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of N-grams poses the challenge of introducing a compressed representation that preserves the query processing speed. In this paperwe study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given N-gram within micro seconds. For this purpose we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-The-Art software packages. In particular, we present a trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we are able to lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, we show that our technique introduces a negligible penalty at query time.Source: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 615–624, Tokyo, Giappone, 7-11/08/2017
DOI: 10.1145/3077136.3080798

See at: academic.microsoft.com Restricted | dblp.uni-trier.de Restricted | dl.acm.org Restricted | dl.acm.org Restricted | dl.acm.org Restricted | CNR ExploRA Restricted