2011
Journal article  Open Access

On optimally partitioning a text to improve its compression

Ferragina P., Nitto I., Venturini R.

Compression  Computer Science - Data Structures and Algorithms  Computer Science - Information Theory  General Computer Science  Information Theory (cs.IT)  E.4 Coding and Information Theory  FOS: Computer and information sciences  Analysis of Algorithms and Problem Complexity  Computer Science Applications  Dynamic Programming  Applied Mathematics  Information theory  Data Structures and Algorithms (cs.DS) 

In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in Buchsbaum et al. (Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003) in the context of table compression, and then further elaborated and extended to strings and trees by Ferragina et al. (J. ACM 52:688-713, 2005; Proc. of 46th IEEE Symposium on Foundations of Computer Science, pp. 184-193, 2005) and Mäkinen and Navarro (Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007). Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming (Buchsbaum et al. in J. ACM 50(6):825-851, 2003; Giancarlo and Sciortino in Proc. of 14th Symposium on Combinatorial Pattern Matching, pp. 129-143, 2003), or few heuristics that do not guarantee any bounds on the efficacy of their computed partition (Buchsbaum et al. in Proc. of 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175-184, 2000; J. ACM 50(6):825-851, 2003), or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform, see e.g. Ferragina et al. in J. ACM 52:688-713, 2005; Mäkinen and Navarro in Proc. of 14th Symposium on String Processing and Information Retrieval, pp. 229-241, 2007) and achieve compression performance that might be worse than the optimal-partitioning by a Ω(log n/log log n) factor. Therefore, computing efficiently the optimal solution is still open (Buchsbaum and Giancarlo in Encyclopedia of Algorithms, pp. 939-942, 2008). In this paper we provide the first algorithm which computes in O(nlog 1+ε n) time and O(n) space, a partition of T whose compressed output is guarantee

Source: Algorithmica 61 (2011): 51–74. doi:10.1007/s00453-010-9437-6

Publisher: Springer Science + Business Media, New York , Stati Uniti d'America


[1] J.L. Bentley and M.D. McIlroy. Data compression with long repeated strings. Information Sciences, 135(1-2):1-11, 2001.
[2] A. L. Buchsbaum, D. F. Caldwell, K. W. Church, G. S. Fowler, and S. Muthukrishnan. Engineering the compression of massive tables: an experimental approach. In Procs ACM-SIAM SODA, pages 175-184, 2000.
[3] Adam L. Buchsbaum, Glenn S. Fowler, and Raffaele Giancarlo. Improving table compression with combinatorial optimization. J. ACM, 50(6):825- 851, 2003.
[4] A.L. Buchsbaum and R. Giancarlo. Table compression. In M.Y. Kao, editor, Encyclopedia of Algorithms, pages 939-942. Springer, 2008.
[5] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
[6] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. The MIT Press and McGraw-Hill Book Company, 2001.
[8] P. Ferragina, R. Giancarlo, and G. Manzini. The engineering of a compression boosting library: Theory vs practice in BWT compression. In Proc. 14th European Symposium on Algorithms (ESA '06), pages 756-767. Springer Verlag LNCS n. 4168, 2006.
[9] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207:849-866, 2009.
[10] P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. Journal of the ACM, 52:688-713, 2005.
[11] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. 46th IEEE Symposium on Foundations of Computer Science (FOCS), pages 184-193, 2005.
[12] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. 15th International World Wide Web Conference (WWW), pages 751-760, 2006.
[13] P. Ferragina and R. Venturini. The compressed permuterm index. ACM Transactions on Algorithms (to appear), 2009.
[14] R. Giancarlo, A. Restivo, and M. Sciortino. From first principles to the burrows and wheeler transform and beyond, via combinatorial optimization. Theoretical Computer Science, 387(3):236-248, 2007.
[15] R. Giancarlo and M. Sciortino. Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms. In Proc. 14th Symposium on Combinatorial Pattern Matching (CPM '03), pages 129-143. SpringerVerlag LNCS n. 2676, 2003.
[16] P. G. Howard and J. S. Vitter. Analysis of arithmetic coding for data compression. Information Processing Management, 28(6):749-764, 1992.
[17] H. Kaplan, S. Landau, and E. Verbin. A simpler analysis of burrowswheeler-based compression. Theoretical Computer Science, 387(3):220-235, 2007.
[18] R. Kosaraju and G. Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999.
[19] P. Kulkarni, F. Douglis, J.D. LaVoie, and J.M. Tracey. Redundancy elimination within large collections of files. In USENIX Annual Technical Conference, pages 59-72, 2004.
[20] V. Ma¨kinen and G. Navarro. Position-restricted substring searching. In Proc. 7th Latin American Symposium on Theoretical Informatics (LATIN), pages 703-714. Springer Verlag LNCS n. 3887, 2006.
[21] V. Ma¨kinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Procs 14th Symp. on String Processing and Information Retrieval (SPIRE), pages 229-241. Springer Verlag LNCS n. 4726, 2007.
[22] G. Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407-430, 2001.
[23] A. Moffat and R.Y. Isal. Word-based text compression using the burrowswheeler transform. Information Processing Management, 41(5):1175-1192, 2005.
[24] Z. Ouyang, N.D. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Procs 3rd Conference on Web Information Systems Engineering (WISE), pages 257-268. IEEE Computer Society, 2002.
[25] T. Suel and N. Memon. Algorithms for delta compression and remote file synchronization. In Khalid Sayood, editor, Lossless Compression Handbook. Academic Press, 2002.
[26] D. Trendafilov, N. Memon, and T. Suel. Compressing file collections with a TSP-based approach. Technical report, Technical Report TR-CIS-2004-02, Polytechnic University, 2004.
[27] B.D. Vo and K.-P. Vo. Compressing table data with column dependency. Theoretical Computer Science, 387(3):273-283, 2007.
[28] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.

Metrics



Back to previous page
BibTeX entry
@article{oai:it.cnr:prodotti:68504,
	title = {On optimally partitioning a text to improve its compression},
	author = {Ferragina P. and Nitto I. and Venturini R.},
	publisher = {Springer Science + Business Media, New York , Stati Uniti d'America},
	doi = {10.1007/s00453-010-9437-6 and 10.48550/arxiv.0906.4692 and 10.1007/978-3-642-04128-0_38},
	journal = {Algorithmica},
	volume = {61},
	pages = {51–74},
	year = {2011}
}