2010
Journal article
Restricted
The compressed permuterm index
Ferragina P, Venturini RThe Permuterm index [Garfield 1976] is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this article we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index that allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the Burrows-Wheeler Transform, defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [Navarro and Mäkinen 2007]. We will complement our theoretical study with a significant set of experiments that show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip. This improves known approaches based on Front-Coding [Witten et al. 1999] by more than 50% in absolute space occupancy, still guaranteeing comparable query time.Source: ACM TRANSACTIONS ON ALGORITHMS, vol. 7 (issue 1), pp. 10-21
DOI: 10.1145/1868237.1868248Metrics:
See at:
dl.acm.org
| ACM Transactions on Algorithms
| CNR IRIS
| CNR IRIS
2011
Journal article
Open Access
On optimally partitioning a text to improve its compression
Ferragina P, Nitto I, Venturini RIn this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688-713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184-193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825-851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129-143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688-713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939-942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteeSource: ALGORITHMICA, vol. 61 (issue 1), pp. 51-74
DOI: 10.1007/s00453-010-9437-6DOI: 10.48550/arxiv.0906.4692DOI: 10.1007/978-3-642-04128-0_38Metrics:
See at:
arXiv.org e-Print Archive
| Algorithmica
| doi.org
| Algorithmica
| doi.org
| CNR IRIS
| CNR IRIS
| www.springerlink.com
2009
Conference article
Restricted
On the bit-complexity of Lempel-Ziv compression
Ferragina P, Nitto I, Venturini ROne of the most famous and investigated lossless data-compression schemes is the one introduced by Lempel and Ziv about 30 years ago [37]. This compression scheme is known as "dictionary-based compressor" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and effectiveness of this compression process (see e.g. [23, 28] and references therein), "we are not aware of any parsing scheme that achieves optimality when the LZ77-dictionary is in use under any constraint on the codewords other than being of equal length" [28, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate three issues pertaining to the bit-complexity of LZ-based compressors, and we design algorithms which achieve bit-optimality in the compressed output size by taking efficient/optimal time and optimal space. These theoretical results will be sustained by some experiments that will compare our novel LZ-based compressors against the most popular compression tools (like gzip, bzip2) and state-of-the-art compressors (like the booster of [14, 13]).
See at:
CNR IRIS
| CNR IRIS
| www.siam.org
2014
Book
Restricted
Compressed data structures for strings. On searching and extracting strings from compressed textual data
Venturini RData compression is mandatory to manage massive datasets, indexing is fundamental to query them. However, their goals appear as counterposed: the former aims at minimizing data redundancies, whereas the latter augments the dataset with auxiliary information to speed up the query resolution. In this monograph we introduce solutions that overcome this dichotomy. We start by presenting the use of optimization techniques to improve the compression of classical data compression algorithms, then we move to the design of compressed data structures providing fast random access or efficient pattern matching queries on the compressed dataset. These theoretical studies are supported by experimental evidences of their impact in practical scenarios.Source: ATLANTIS STUDIES IN COMPUTING
DOI: 10.2991/978-94-6239-033-1Metrics:
See at:
doi.org
| CNR IRIS
| CNR IRIS
| link.springer.com
2013
Other
Restricted
Cache-oblivious peeling of random hypergraphs
Belazzougui D, Boldi P, Ottaviano G, Venturini R, Vigna SThe computation of a peeling order in a randomly generated hypergraph is the most time- consuming step in a number of constructions, such as perfect hashing schemes, random r-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges. We experimentally evaluate the performance of our implementation of this algorithm in a real- world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.
See at:
CNR IRIS
| CNR IRIS
2013
Journal article
Open Access
On the bit-complexity of Lempel-Ziv compression
Ferragina P, Nitto I, Venturini ROne of the most famous and investigated lossless data-compression schemes is the one introduced by Lempel and Ziv about 30 years ago [38]. This compression scheme is known as dictionary-based compressor and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and e ffectiveness of this compression process, "we are not aware of any parsing scheme that achieves optimality [...] under any constraint on the codewords other than being of equal length" [29, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate some issues pertaining to the bit-complexity of LZ77-based compressors, the most powerful variant of the LZ-compression scheme, and we design algorithms which achieve bit-optimality in the compressed output size by taking efficient/optimal time and optimal space.Source: SIAM JOURNAL ON COMPUTING, vol. 42 (issue 4), pp. 1521-1541
DOI: 10.1137/120869511DOI: 10.1137/1.9781611973068.84Metrics:
See at:
SIAM Journal on Computing
| www.di.unipi.it
| doi.org
| SIAM Journal on Computing
| epubs.siam.org
| CNR IRIS
| CNR IRIS
| CNR IRIS
| www.scopus.com
2013
Contribution to book
Restricted
Web search
Ferragina P, Venturini RFaced with the massive amount of information on the Web, which includes not only texts but nowadays any kind of file (audio, video, images, etc.), Web users tend to lose their way when browsing the Web, falling into what psychologists call "getting lost in hyperspace". Search engines alleviate this by presenting the most relevant pages that better match the user's information needs. Collecting a large part of the pages in the Web, extrapolating a user information need expressed by means of often ambiguous queries, establishing the importance of Web pages and their relevance for a query, are just a few examples of the difficult problems that search engines address every day to achieve their ambitious goal. In this chapter, we introduce the concepts and the algorithms that lie at the core of modern search engines by providing running examples that simplify understanding, and we comment on some recent and powerful tools and functionalities that should increase the ability of users to match in the Web their information needs.DOI: 10.1007/978-3-642-39652-6_5Metrics:
See at:
CNR IRIS
| CNR IRIS
| link.springer.com
2013
Conference article
Open Access
Dynamic compressed strings with random access
Grossi R, Raman R, Rao S, Venturini RWe consider the problem of storing a string $S$ in dynamic compressed form, while permitting operations directly on the compressed representation of $S$: access a substring of $S$; replace, insert or delete a symbol in $S$; count how many occurrences of a given symbol appear in any given prefix of $S$ (called rank operation) and locate the position of the $i$th occurrence of a symbol inside $S$ (called select operation). We discuss the time complexity of several combinations of these operations along with the entropy space bounds of the corresponding compressed indexes. In this way, we extend or improve the bounds of previous work by Ferragina and Venturini [TCS,~2007], Jansson et al.mbox{} [ICALP,~2012], Nekrich and Navarro [SODA,~2013].DOI: 10.1007/978-3-642-39206-1_43Metrics:
See at:
figshare.com
| doi.org
| CNR IRIS
| CNR IRIS
| link.springer.com
2014
Conference article
Open Access
Cache-oblivious peeling of random hypergraphs
Belazzougui D, Boldi P, Ottaviano G, Venturini R, Vigna SThe computation of a peeling order in a randomly generated hypergraph is the most time- consuming step in a number of constructions, such as perfect hashing schemes, random r-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges. We experimentally evaluate the performance of our implementation of this algorithm in a real- world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.DOI: 10.1109/dcc.2014.48DOI: 10.48550/arxiv.1312.0526Project(s): MIDAS 
,
NADINE
Metrics:
See at:
arXiv.org e-Print Archive
| arxiv.org
| doi.org
| doi.org
| CNR IRIS
| CNR IRIS
| CNR IRIS
2014
Conference article
Restricted
Bicriteria data compression: efficient and usable
Farruggia A, Ferragina P, Venturini RLempel-Ziv's LZ77 algorithm is the de facto choice for compressing massive datasets (see e.g., Snappy in BigTable, Lz4 in Cassandra) because its algorithmic structure is flexible enough to guarantee very fast decompression speed at reasonable compressed-space occupancy. Recent theoretical results have shown how to design a bit-optimal LZ77-compressor which minimizes the compress size and how to deploy it in order to design a bicriteria data compressor, namely an LZ77-compressor which trades compressed-space occupancy versus its decompression time in a smoothed and principled way. Preliminary experiments were promising but raised many algorithmic and engineering questions which have to be addressed in order to turn these algorithmic results into an effective and practical tool. In this paper we address these issues by first designing a novel bit-optimal LZ77-compressor which is simple, cache-aware and asymptotically optimal. We benchmark our approach by investigating several algorithmic and implementation issues over many dataset types and sizes, and against an ample class of classic (LZ-based, PPM-based and BWT-based) as well as engineered compressors (Snappy, Lz4, and Lzma2). We conclude noticing how our novel bicriteria LZ77-compressor improves the state-of-the-art of fast (de) compressors Snappy and Lz4.DOI: 10.1007/978-3-662-44777-2_34Metrics:
See at:
doi.org
| CNR IRIS
| CNR IRIS
| link.springer.com
2014
Conference article
Restricted
Bicriteria data compression
Farruggia A, Frangioni A, Ferragina P, Venturini RIn this paper we address the problem of trading optimally, and in a principled way, the compressed size/decompression time of LZSE{} parsings by introducing what we call the {em Bicriteria LZSE{}-Parsing problem}. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by T. Symmetrically, we can exchange the role of the two resources and thus ask for minimizing the decompression time provided that the compressed space is bounded by a fixed amount given in advance.
See at:
dl.acm.org
| CNR IRIS
| CNR IRIS
2014
Conference article
Open Access
Partitioned Elias-Fano indexes
Ottaviano G, Venturini RThe Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is competitive with some state-of-the-art methods such as -??-Golomb codes and PForDelta, it fails to exploit the local clustering that inverted lists usually exhibit, namely the presence of long subsequences of close identifiers. In this paper we describe a new representation based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-Fano, hence forming a two-level data structure. This partitioning enables the encoding to better adapt to the local statistics of the chunk, thus exploiting clustering and improving compression. We present two partition strategies, respectively with fixed and variable-length chunks. For the latter case we introduce a linear-time optimization algorithm which identifies the minimum-space partition up to an arbitrarily small approximation factor. We show that our partitioned Elias-Fano indexes offer significantly better compression than plain Elias-Fano, while preserving their query time eficiency. Furthermore, compared with other state-of-the-art compressed encodings, our indexes exhibit the best compression ratio/query time trade-off. Copyright 2014 ACM.DOI: 10.1145/2600428.2609615Metrics:
See at:
www.di.unipi.it
| dl.acm.org
| doi.org
| CNR IRIS
| CNR IRIS