4 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2019 Doctoral thesis Open Access OPEN

Space and time-efficient data structures for massive datasets
Pibiri G. M.
This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings.

See at: etd.adm.unipi.it Open Access | CNR ExploRA Open Access


2019 Conference article Open Access OPEN

Fast dictionary-based compression for inverted indexes
Pibiri G. E., Petri M., Moffat A.
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.Source: International Conference on Web Search and Data Mining, pp. 6–14, 11/02/2019,15/02/2019
DOI: 10.1145/3289600.3290962

See at: ISTI Repository Open Access | academic.microsoft.com Restricted | dl.acm.org Restricted | dl.acm.org Restricted | dl.acm.org Restricted | findanexpert.unimelb.edu.au Restricted | CNR ExploRA Restricted


2019 Journal article Open Access OPEN

On optimally partitioning variable-byte codes
Pibiri G. E., Venturini R.
The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2× by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Source: IEEE transactions on knowledge and data engineering (Print) 32 (2019): 1812–1823. doi:10.1109/TKDE.2019.2911288
DOI: 10.1109/tkde.2019.2911288
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | IEEE Transactions on Knowledge and Data Engineering Open Access | ISTI Repository Open Access | IEEE Transactions on Knowledge and Data Engineering Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA Restricted | IEEE Transactions on Knowledge and Data Engineering Restricted


2019 Journal article Open Access OPEN

Handling massive n-gram datasets efficiently
Pibiri G. E., Venturini R.
Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.Source: ACM transactions on information systems 37 (2019). doi:10.1145/3302913
DOI: 10.1145/3302913
DOI: 10.5281/zenodo.3257995
DOI: 10.5281/zenodo.3257994
Project(s): BigDataGrapes via OpenAIRE

See at: arXiv.org e-Print Archive Open Access | ISTI Repository Open Access | ZENODO Open Access | ACM Transactions on Information Systems Open Access | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | ACM Transactions on Information Systems Restricted | CNR ExploRA Restricted