Page 1 of 4

2009 Conference article Open Access

2010 Journal article Open Access

On compact representations of all-pairs-shortest-path-distance matrices
Ferragina P, Nitto I, Venturini R
Let G be an unweighted and undirected graph of n nodes, and let be the n×n matrix storing the All-Pairs-Shortest-Path Distances in G. Since contains integers in [n] U +∞, its plain storage takes n. log(n+1) bits. However, a simple counting argument shows that n. /2 bits are necessary to store . In this paper we investigate the question of finding a succinct representation of that requires O(n. ) bits of storage and still supports constant-time access to each of its entries. This is asymptotically optimal in the worst case, and far from the information-theoretic lower bound by a multiplicative factor log_2 3. As a result O(1) bits per pairs of nodes in G are enough to retain constant-time access to their shortest-path distance. We achieve this result by reducing the storage of to the succinct storage of labeled trees and ternary sequences, for which we properly adapt and orchestrate the use of known compressed data structures. This approach can be easily and optimally extended to graphs whose edge weights are positive integers bounded by a constant value.Source: THEORETICAL COMPUTER SCIENCE, vol. 411, pp. 3293-3300
DOI: 10.1016/j.tcs.2010.05.021
Metrics:

See at: Theoretical Computer Science Open Access | CNR IRIS Restricted | CNR IRIS | www.sciencedirect.com

2010 Journal article Restricted

The compressed permuterm index
Ferragina P, Venturini R
The Permuterm index [Garfield 1976] is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this article we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index that allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the Burrows-Wheeler Transform, defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [Navarro and Mäkinen 2007]. We will complement our theoretical study with a significant set of experiments that show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip. This improves known approaches based on Front-Coding [Witten et al. 1999] by more than 50% in absolute space occupancy, still guaranteeing comparable query time.Source: ACM TRANSACTIONS ON ALGORITHMS, vol. 7 (issue 1), pp. 10-21
DOI: 10.1145/1868237.1868248
Metrics:

See at: dl.acm.org Restricted | ACM Transactions on Algorithms | CNR IRIS | CNR IRIS

2011 Journal article Open Access

On optimally partitioning a text to improve its compression
Ferragina P, Nitto I, Venturini R
In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688-713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184-193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825-851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129-143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688-713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939-942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteeSource: ALGORITHMICA, vol. 61 (issue 1), pp. 51-74
DOI: 10.1007/s00453-010-9437-6
DOI: 10.48550/arxiv.0906.4692
DOI: 10.1007/978-3-642-04128-0_38
Metrics:

2009 Conference article Restricted

On the bit-complexity of Lempel-Ziv compression
Ferragina P, Nitto I, Venturini R
One of the most famous and investigated lossless data-compression schemes is the one introduced by Lempel and Ziv about 30 years ago [37]. This compression scheme is known as "dictionary-based compressor" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and effectiveness of this compression process (see e.g. [23, 28] and references therein), "we are not aware of any parsing scheme that achieves optimality when the LZ77-dictionary is in use under any constraint on the codewords other than being of equal length" [28, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate three issues pertaining to the bit-complexity of LZ-based compressors, and we design algorithms which achieve bit-optimality in the compressed output size by taking efficient/optimal time and optimal space. These theoretical results will be sustained by some experiments that will compare our novel LZ-based compressors against the most popular compression tools (like gzip, bzip2) and state-of-the-art compressors (like the booster of [14, 13]).

See at: CNR IRIS Restricted | CNR IRIS | www.siam.org

2011 Conference article Restricted

Space-efficient substring occurrence estimation
Orlandi Alessio, Venturini Rossano
We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.DOI: 10.1145/1989284.1989300
DOI: 10.1007/s00453-014-9936-y
Metrics:

See at: Algorithmica Restricted | doi.org | CNR IRIS | CNR IRIS | portal.acm.org

2011 Conference article Open Access

Distribution-aware compressed full-text indexes
Ferragina P, Sirén, J, Venturini R
In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query-time within that index-space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a particular Directed Acyclic Graph. Interestingly enough, our solution is independent of the underlying compressed index in use. Our experiments compare this optimal strategy with several other standard approaches, showing its effectiveness in practice.DOI: 10.1007/978-3-642-23719-5_64
DOI: 10.1007/s00453-013-9782-3
Project(s): MIDAS via OpenAIRE

Metrics:

2012 Conference article Open Access

Compressed String Dictionary Look-up with Edit Distance One
Belazzougui D, Venturini R
In this paper we present dierent solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern P, the index has to report all the strings in the dictionary having edit distance at most one with P. Our rst solution is able to solve queries in (almost optimal) O(|P|+occ) time where occ is the number of strings in the dictionary having edit distance at most one with P. The space complexity of this solution is bounded in terms of the k-th order entropy of the indexed dictionary. Our second solution further improves this space complexity at the cost of increasing the query timeDOI: 10.1007/978-3-642-31265-6_23
Metrics:

See at: hpc.isti.cnr.it Open Access | doi.org Restricted | CNR IRIS | CNR IRIS

2014 Book Restricted

Compressed data structures for strings. On searching and extracting strings from compressed textual data
Venturini R
Data compression is mandatory to manage massive datasets, indexing is fundamental to query them. However, their goals appear as counterposed: the former aims at minimizing data redundancies, whereas the latter augments the dataset with auxiliary information to speed up the query resolution. In this monograph we introduce solutions that overcome this dichotomy. We start by presenting the use of optimization techniques to improve the compression of classical data compression algorithms, then we move to the design of compressed data structures providing fast random access or efficient pattern matching queries on the compressed dataset. These theoretical studies are supported by experimental evidences of their impact in practical scenarios.Source: ATLANTIS STUDIES IN COMPUTING
DOI: 10.2991/978-94-6239-033-1
Metrics:

See at: doi.org Restricted | CNR IRIS | CNR IRIS | link.springer.com

2013 Other Restricted

Cache-oblivious peeling of random hypergraphs
Belazzougui D, Boldi P, Ottaviano G, Venturini R, Vigna S
The computation of a peeling order in a randomly generated hypergraph is the most time- consuming step in a number of constructions, such as perfect hashing schemes, random r-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges. We experimentally evaluate the performance of our implementation of this algorithm in a real- world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.

See at: CNR IRIS Restricted | CNR IRIS

2013 Journal article Open Access

Distribution-aware compressed full-text indexes
Ferragina P, Sìren J, Venturini R
In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query time within that index space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight $K$-link path in a properly designed Directed Acyclic Graph. Interestingly enough, our solution can be used with any compressed index based on the Burrows-Wheeler transform. Our experiments compare this optimal strategy with several other known approaches, showing its effectiveness in practice.Source: ALGORITHMICA, vol. 67 (issue 4), pp. 529-546
DOI: 10.1007/s00453-013-9782-3
DOI: 10.1007/978-3-642-23719-5_64
Project(s): MIDAS via OpenAIRE

Metrics:

2013 Journal article Open Access

On the bit-complexity of Lempel-Ziv compression
Ferragina P, Nitto I, Venturini R
One of the most famous and investigated lossless data-compression schemes is the one introduced by Lempel and Ziv about 30 years ago [38]. This compression scheme is known as dictionary-based compressor and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and e ffectiveness of this compression process, "we are not aware of any parsing scheme that achieves optimality [...] under any constraint on the codewords other than being of equal length" [29, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate some issues pertaining to the bit-complexity of LZ77-based compressors, the most powerful variant of the LZ-compression scheme, and we design algorithms which achieve bit-optimality in the compressed output size by taking efficient/optimal time and optimal space.Source: SIAM JOURNAL ON COMPUTING, vol. 42 (issue 4), pp. 1521-1541
DOI: 10.1137/120869511
DOI: 10.1137/1.9781611973068.84
Metrics:

2013 Contribution to book Restricted

Web search
Ferragina P, Venturini R
Faced with the massive amount of information on the Web, which includes not only texts but nowadays any kind of file (audio, video, images, etc.), Web users tend to lose their way when browsing the Web, falling into what psychologists call "getting lost in hyperspace". Search engines alleviate this by presenting the most relevant pages that better match the user's information needs. Collecting a large part of the pages in the Web, extrapolating a user information need expressed by means of often ambiguous queries, establishing the importance of Web pages and their relevance for a query, are just a few examples of the difficult problems that search engines address every day to achieve their ambitious goal. In this chapter, we introduce the concepts and the algorithms that lie at the core of modern search engines by providing running examples that simplify understanding, and we comment on some recent and powerful tools and functionalities that should increase the ability of users to match in the Web their information needs.DOI: 10.1007/978-3-642-39652-6_5
Metrics:

See at: CNR IRIS Restricted | CNR IRIS | link.springer.com

2013 Conference article Restricted

Compressed cache-oblivious string B-tree
Ferragina P, Venturini R
In this paper we address few variants of the well-known prefix-search problem in strings, and provide solutions for the cache-oblivious model which improve the best known results. In particular we close (asymptotically) the classic problem which asks for searching the set of strings sharing the longest common prefix with a queried pattern. Indeed, we provide an I/O-optimal solution matching the space lower bound for tries up to a constant multiplicative factor $(1+epsilon)$, for $epsilon>0$. Our solutions hinge upon few technicalities and, more interestingly, on a novel compressed storage scheme which adds to the elegant Locality Preserving Front-Coding (Bender {em et al.} 2006) the ability to decompress prefixes of the stored strings I/O-optimally still preserving its space bounds.DOI: 10.1007/978-3-642-40450-4_40
DOI: 10.1145/2903141
Project(s): SoBigData via OpenAIRE

Metrics:

2013 Conference article Open Access

Dynamic compressed strings with random access
Grossi R, Raman R, Rao S, Venturini R
We consider the problem of storing a string $S$ in dynamic compressed form, while permitting operations directly on the compressed representation of $S$: access a substring of $S$; replace, insert or delete a symbol in $S$; count how many occurrences of a given symbol appear in any given prefix of $S$ (called rank operation) and locate the position of the $i$th occurrence of a symbol inside $S$ (called select operation). We discuss the time complexity of several combinations of these operations along with the entropy space bounds of the corresponding compressed indexes. In this way, we extend or improve the bounds of previous work by Ferragina and Venturini [TCS,~2007], Jansson et al.mbox{} [ICALP,~2012], Nekrich and Navarro [SODA,~2013].DOI: 10.1007/978-3-642-39206-1_43
Metrics:

See at: figshare.com Open Access | doi.org Restricted | CNR IRIS | CNR IRIS | link.springer.com

2013 Conference article Restricted

Compressed static functions with applications
Belazzougui D, Venturini R
Given a set of integer keys from a bounded universe along with associated data, the dictionary problem asks to answer two queries: membership and retrieval. Membership has to tell whether a given element is in the dictionary or not; Retrieval has to return the data associated with the searched key. In this paper we provide time and space optimal solutions for three well-established relaxations of this basic problem: (Compressed) Static functions, Approximate membership and Relative membership.Project(s): MIDAS via OpenAIRE

See at: CNR IRIS Restricted | CNR IRIS | knowledgecenter.siam.org

2014 Conference article Open Access

, NADINE

Metrics:

2014 Conference article Restricted

Bicriteria data compression: efficient and usable
Farruggia A, Ferragina P, Venturini R
Lempel-Ziv's LZ77 algorithm is the de facto choice for compressing massive datasets (see e.g., Snappy in BigTable, Lz4 in Cassandra) because its algorithmic structure is flexible enough to guarantee very fast decompression speed at reasonable compressed-space occupancy. Recent theoretical results have shown how to design a bit-optimal LZ77-compressor which minimizes the compress size and how to deploy it in order to design a bicriteria data compressor, namely an LZ77-compressor which trades compressed-space occupancy versus its decompression time in a smoothed and principled way. Preliminary experiments were promising but raised many algorithmic and engineering questions which have to be addressed in order to turn these algorithmic results into an effective and practical tool. In this paper we address these issues by first designing a novel bit-optimal LZ77-compressor which is simple, cache-aware and asymptotically optimal. We benchmark our approach by investigating several algorithmic and implementation issues over many dataset types and sizes, and against an ample class of classic (LZ-based, PPM-based and BWT-based) as well as engineered compressors (Snappy, Lz4, and Lzma2). We conclude noticing how our novel bicriteria LZ77-compressor improves the state-of-the-art of fast (de) compressors Snappy and Lz4.DOI: 10.1007/978-3-662-44777-2_34
Metrics:

See at: doi.org Restricted | CNR IRIS | CNR IRIS | link.springer.com

2014 Conference article Restricted

Bicriteria data compression
Farruggia A, Frangioni A, Ferragina P, Venturini R
In this paper we address the problem of trading optimally, and in a principled way, the compressed size/decompression time of LZSE{} parsings by introducing what we call the {em Bicriteria LZSE{}-Parsing problem}. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by T. Symmetrically, we can exchange the role of the two resources and thus ask for minimizing the decompression time provided that the compressed space is bounded by a fixed amount given in advance.

See at: dl.acm.org Restricted | CNR IRIS | CNR IRIS

2014 Conference article Open Access

Partitioned Elias-Fano indexes
Ottaviano G, Venturini R
The Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is competitive with some state-of-the-art methods such as -??-Golomb codes and PForDelta, it fails to exploit the local clustering that inverted lists usually exhibit, namely the presence of long subsequences of close identifiers. In this paper we describe a new representation based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-Fano, hence forming a two-level data structure. This partitioning enables the encoding to better adapt to the local statistics of the chunk, thus exploiting clustering and improving compression. We present two partition strategies, respectively with fixed and variable-length chunks. For the latter case we introduce a linear-time optimization algorithm which identifies the minimum-space partition up to an arbitrarily small approximation factor. We show that our partitioned Elias-Fano indexes offer significantly better compression than plain Elias-Fano, while preserving their query time eficiency. Furthermore, compared with other state-of-the-art compressed encodings, our indexes exhibit the best compression ratio/query time trade-off. Copyright 2014 ACM.DOI: 10.1145/2600428.2609615
Metrics:

See at: www.di.unipi.it Open Access | dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS