2009
Conference article
Open Access
On optimally partitioning a text to improve its compression
Ferragina P., Nitto I., Venturini R.In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor gets a compressed output that is shorter than applying over the entire T at once. This problem was introduced in [2,3] in the context of table compression, and further elaborated and extended to strings and trees by [10,11,20], but it is still open how to efficiently compute the optimal partition [4]. In this paper we provide the first algorithm which is guaranteed to compute in O(n polylog(n)) time a partition of T whose compressed output is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is any positive constant.Source: Algorithms. 17th Annual European Symposium - ESA 2009, pp. 420–431, Copenhagen, Denmark, 7-9 September 2009
DOI: 10.1007/978-3-642-04128-0_38DOI: 10.48550/arxiv.0906.4692DOI: 10.1007/s00453-010-9437-6Metrics:
See at:
arXiv.org e-Print Archive | Algorithmica | doi.org | Algorithmica | doi.org | link.springer.com | CNR ExploRA
2010
Journal article
Open Access
On compact representations of all-pairs-shortest-path-distance matrices
Ferragina P., Nitto I., Venturini R.Let G be an unweighted and undirected graph of n nodes, and let be the n×n matrix storing the All-Pairs-Shortest-Path Distances in G. Since contains integers in [n] U +∞, its plain storage takes n. log(n+1) bits. However, a simple counting argument shows that n. /2 bits are necessary to store . In this paper we investigate the question of finding a succinct representation of that requires O(n. ) bits of storage and still supports constant-time access to each of its entries. This is asymptotically optimal in the worst case, and far from the information-theoretic lower bound by a multiplicative factor log_2 3. As a result O(1) bits per pairs of nodes in G are enough to retain constant-time access to their shortest-path distance. We achieve this result by reducing the storage of to the succinct storage of labeled trees and ternary sequences, for which we properly adapt and orchestrate the use of known compressed data structures. This approach can be easily and optimally extended to graphs whose edge weights are positive integers bounded by a constant value.Source: Theoretical computer science 411 (2010): 3293–3300. doi:10.1016/j.tcs.2010.05.021
DOI: 10.1016/j.tcs.2010.05.021Metrics:
See at:
Theoretical Computer Science | www.sciencedirect.com | CNR ExploRA
2010
Journal article
Restricted
The compressed permuterm index
Ferragina P., Venturini R.The Permuterm index [Garfield 1976] is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this article we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index that allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the Burrows-Wheeler Transform, defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [Navarro and Mäkinen 2007]. We will complement our theoretical study with a significant set of experiments that show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip. This improves known approaches based on Front-Coding [Witten et al. 1999] by more than 50% in absolute space occupancy, still guaranteeing comparable query time.Source: ACM transactions on algorithms 7 (2010): 10–21. doi:10.1145/1868237.1868248
DOI: 10.1145/1868237.1868248Metrics:
See at:
dl.acm.org | ACM Transactions on Algorithms | CNR ExploRA
2011
Journal article
Open Access
On optimally partitioning a text to improve its compression
Ferragina P., Nitto I., Venturini R.In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688-713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184-193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825-851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129-143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688-713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939-942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guaranteeSource: Algorithmica 61 (2011): 51–74. doi:10.1007/s00453-010-9437-6
DOI: 10.1007/s00453-010-9437-6DOI: 10.48550/arxiv.0906.4692DOI: 10.1007/978-3-642-04128-0_38Metrics:
See at:
arXiv.org e-Print Archive | Algorithmica | doi.org | Algorithmica | doi.org | www.springerlink.com | CNR ExploRA
2009
Conference article
Restricted
On the bit-complexity of Lempel-Ziv compression
Ferragina P., Nitto I., Venturini R.One of the most famous and investigated lossless data-compression schemes is the one introduced by Lempel and Ziv about 30 years ago [37]. This compression scheme is known as "dictionary-based compressor" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about the speed and effectiveness of this compression process (see e.g. [23, 28] and references therein), "we are not aware of any parsing scheme that achieves optimality when the LZ77-dictionary is in use under any constraint on the codewords other than being of equal length" [28, pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we investigate three issues pertaining to the bit-complexity of LZ-based compressors, and we design algorithms which achieve bit-optimality in the compressed output size by taking efficient/optimal time and optimal space. These theoretical results will be sustained by some experiments that will compare our novel LZ-based compressors against the most popular compression tools (like gzip, bzip2) and state-of-the-art compressors (like the booster of [14, 13]).Source: Symposium on Discrete Algorithms, pp. 768–777, New York, 4-6 January 2009
See at:
www.siam.org | CNR ExploRA
2011
Conference article
Restricted
Space-efficient substring occurrence estimation
Orlandi Alessio, Venturini RossanoWe study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Sigma of size sigma is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that requires (|T|log sigma/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.Source: 30th symposium on Principles of Database Systems of Data, PODS 2011, pp. 95–106, Athens, Greece, 12-16 June 2011
DOI: 10.1145/1989284.1989300DOI: 10.1007/s00453-014-9936-yMetrics:
See at:
Algorithmica | doi.org | portal.acm.org | CNR ExploRA
2011
Conference article
Open Access
Distribution-aware compressed full-text indexes
Ferragina P., Sirén, J., Venturini R.In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query-time within that index-space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a particular Directed Acyclic Graph. Interestingly enough, our solution is independent of the underlying compressed index in use. Our experiments compare this optimal strategy with several other standard approaches, showing its effectiveness in practice.Source: Algorithms - ESA 2011. 19th Annual European Symposium, pp. 760–771, Saarbrucken, Germany, 5-9 September 2011
DOI: 10.1007/978-3-642-23719-5_64DOI: 10.1007/s00453-013-9782-3Project(s): MIDAS Metrics:
See at:
HELDA - Digital Repository of the University of Helsinki | Algorithmica | doi.org | Algorithmica | www.springerlink.com | CNR ExploRA
2012
Conference article
Open Access
Compressed String Dictionary Look-up with Edit Distance One
Belazzougui D., Venturini R.In this paper we present dierent solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern P, the index has to report all the strings in the dictionary having edit distance at most one with P. Our rst solution is able to solve queries in (almost optimal) O(|P|+occ) time where occ is the number of strings in the dictionary having edit distance at most one with P. The space complexity of this solution is bounded in terms of the k-th order entropy of the indexed dictionary. Our second solution further improves this space complexity at the cost of increasing the query timeSource: 23rd Annual Symposium on Combinatorial Pattern Matching, pp. 280–292, Helsinky, 1 Giugno 2012
DOI: 10.1007/978-3-642-31265-6_23Metrics:
See at:
hpc.isti.cnr.it | doi.org | CNR ExploRA
2014
Book
Restricted
Compressed data structures for strings. On searching and extracting strings from compressed textual data
Venturini R.Data compression is mandatory to manage massive datasets, indexing is fundamental to query them. However, their goals appear as counterposed: the former aims at minimizing data redundancies, whereas the latter augments the dataset with auxiliary information to speed up the query resolution. In this monograph we introduce solutions that overcome this dichotomy. We start by presenting the use of optimization techniques to improve the compression of classical data compression algorithms, then we move to the design of compressed data structures providing fast random access or efficient pattern matching queries on the compressed dataset. These theoretical studies are supported by experimental evidences of their impact in practical scenarios.Source: Amsterdam: Atlantis Press, 2014
DOI: 10.2991/978-94-6239-033-1Metrics:
See at:
doi.org | link.springer.com | CNR ExploRA
2013
Contribution to book
Restricted
Web search
Ferragina P., Venturini R.Faced with the massive amount of information on the Web, which includes not only texts but nowadays any kind of file (audio, video, images, etc.), Web users tend to lose their way when browsing the Web, falling into what psychologists call "getting lost in hyperspace". Search engines alleviate this by presenting the most relevant pages that better match the user's information needs. Collecting a large part of the pages in the Web, extrapolating a user information need expressed by means of often ambiguous queries, establishing the importance of Web pages and their relevance for a query, are just a few examples of the difficult problems that search engines address every day to achieve their ambitious goal. In this chapter, we introduce the concepts and the algorithms that lie at the core of modern search engines by providing running examples that simplify understanding, and we comment on some recent and powerful tools and functionalities that should increase the ability of users to match in the Web their information needs.Source: The power of algorithms. Inspiration and Examples in Everyday Life, edited by Giorgio Ausiello, Rossella Petreschi, pp. 107–137. Berlin: Springer-Verlag, 2013
DOI: 10.1007/978-3-642-39652-6_5Metrics:
See at:
link.springer.com | CNR ExploRA
2013
Conference article
Restricted
Compressed cache-oblivious string B-tree
Ferragina P., Venturini R.In this paper we address few variants of the well-known prefix-search problem in strings, and provide solutions for the cache-oblivious model which improve the best known results. In particular we close (asymptotically) the classic problem which asks for searching the set of strings sharing the longest common prefix with a queried pattern. Indeed, we provide an I/O-optimal solution matching the space lower bound for tries up to a constant multiplicative factor $(1+epsilon)$, for $epsilon>0$. Our solutions hinge upon few technicalities and, more interestingly, on a novel compressed storage scheme which adds to the elegant Locality Preserving Front-Coding (Bender {em et al.} 2006) the ability to decompress prefixes of the stored strings I/O-optimally still preserving its space bounds.Source: ESA 2013 - Algorithms. 21st Annual European Symposium, pp. 469–480, Sophia Antipolis, France, 2-4 September 2013
DOI: 10.1007/978-3-642-40450-4_40DOI: 10.1145/2903141Project(s): SoBigData Metrics:
See at:
doi.org | ACM Transactions on Algorithms | link.springer.com | CNR ExploRA
2013
Conference article
Open Access
Dynamic compressed strings with random access
Grossi R., Raman R., Rao S., Venturini R.We consider the problem of storing a string $S$ in dynamic compressed form, while permitting operations directly on the compressed representation of $S$: access a substring of $S$; replace, insert or delete a symbol in $S$; count how many occurrences of a given symbol appear in any given prefix of $S$ (called rank operation) and locate the position of the $i$th occurrence of a symbol inside $S$ (called select operation). We discuss the time complexity of several combinations of these operations along with the entropy space bounds of the corresponding compressed indexes. In this way, we extend or improve the bounds of previous work by Ferragina and Venturini [TCS,~2007], Jansson et al.mbox{} [ICALP,~2012], Nekrich and Navarro [SODA,~2013].Source: ICALP 2013 - Automata, Languages, and Programming. 40th International Colloquium, pp. 504–515, Riga, Latvia, 8-12 July 2013
DOI: 10.1007/978-3-642-39206-1_43Metrics:
See at:
figshare.com | doi.org | link.springer.com | CNR ExploRA
2013
Conference article
Restricted
Compressed static functions with applications
Belazzougui D., Venturini R.Given a set of integer keys from a bounded universe along with associated data, the dictionary problem asks to answer two queries: membership and retrieval. Membership has to tell whether a given element is in the dictionary or not; Retrieval has to return the data associated with the searched key. In this paper we provide time and space optimal solutions for three well-established relaxations of this basic problem: (Compressed) Static functions, Approximate membership and Relative membership.Source: 24th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 229–240, New orleans, Louisiana, USA, 5 gennaio 2013
Project(s): MIDAS
See at:
knowledgecenter.siam.org | CNR ExploRA
2014
Conference article
Restricted
Bicriteria data compression: efficient and usable
Farruggia A., Ferragina P., Venturini R.Lempel-Ziv's LZ77 algorithm is the de facto choice for compressing massive datasets (see e.g., Snappy in BigTable, Lz4 in Cassandra) because its algorithmic structure is flexible enough to guarantee very fast decompression speed at reasonable compressed-space occupancy. Recent theoretical results have shown how to design a bit-optimal LZ77-compressor which minimizes the compress size and how to deploy it in order to design a bicriteria data compressor, namely an LZ77-compressor which trades compressed-space occupancy versus its decompression time in a smoothed and principled way. Preliminary experiments were promising but raised many algorithmic and engineering questions which have to be addressed in order to turn these algorithmic results into an effective and practical tool. In this paper we address these issues by first designing a novel bit-optimal LZ77-compressor which is simple, cache-aware and asymptotically optimal. We benchmark our approach by investigating several algorithmic and implementation issues over many dataset types and sizes, and against an ample class of classic (LZ-based, PPM-based and BWT-based) as well as engineered compressors (Snappy, Lz4, and Lzma2). We conclude noticing how our novel bicriteria LZ77-compressor improves the state-of-the-art of fast (de) compressors Snappy and Lz4.Source: ESA 2014 - Algorithms. 22th Annual European Symposium, pp. 406–417, Wroclaw, Poland, 8-10 September 2014
DOI: 10.1007/978-3-662-44777-2_34Metrics:
See at:
doi.org | link.springer.com | CNR ExploRA
2014
Conference article
Restricted
Bicriteria data compression
Farruggia A., Frangioni A., Ferragina P., Venturini R.In this paper we address the problem of trading optimally, and in a principled way, the compressed size/decompression time of LZSE{} parsings by introducing what we call the {em Bicriteria LZSE{}-Parsing problem}. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by T. Symmetrically, we can exchange the role of the two resources and thus ask for minimizing the decompression time provided that the compressed space is bounded by a fixed amount given in advance.Source: SODA'14 - Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1582–1595, Portland, Oregon, US, 5-7 January 2014
See at:
dl.acm.org | CNR ExploRA
2016
Journal article
Restricted
Space-Efficient Substring Occurrence Estimation
Orlandi A., Venturini R.In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text on some alphabet of length is preprocessed and an index is built. The index is used in lieu of the text to answer queries of the form , returning an approximated number of the occurrences of an arbitrary pattern as a substring of . The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error , requires bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.Source: Algorithmica 74 (2016): 65–90. doi:10.1007/s00453-014-9936-y
DOI: 10.1007/s00453-014-9936-yDOI: 10.1145/1989284.1989300Metrics:
See at:
Algorithmica | doi.org | link.springer.com | CNR ExploRA
2016
Journal article
Restricted
Compressed Cache-Oblivious String B-Tree
Ferragina P., Venturini R.In this article, we study three variants of the well-known prefix-search problem for strings, and we design solutions for the cache-oblivious model which improve the best known results. Among these contributions, we close (asymptotically) the classic problem, which asks for the detection of the set of strings that share the longest common prefix with a queried pattern by providing an I/O-optimal solution that matches the space lower bound for tries up to a constant multiplicative factor of the form (1 + epsilon), for epsilon > 0. Our solutions hinge upon a novel compressed storage scheme that adds the ability to decompress prefixes of the stored strings I/O-optimally to the elegant locality-preserving front coding (Bender et al. 2006) still preserving its space bounds.Source: ACM transactions on algorithms 12 (2016). doi:10.1145/2903141
DOI: 10.1145/2903141DOI: 10.1007/978-3-642-40450-4_40Project(s): SoBigData Metrics:
See at:
dl.acm.org | doi.org | ACM Transactions on Algorithms | CNR ExploRA