2009
Conference article  Open Access

On optimally partitioning a text to improve its compression

Ferragina P., Nitto I., Venturini R.

Compression  Computer Science - Data Structures and Algorithms  Computer Science - Information Theory  General Computer Science  Information Theory (cs.IT)  FOS: Computer and information sciences  E.4 Coding and Information Theory  Analysis of Algorithms and Problem Complexity  Computer Science Applications  Dynamic Programming  Applied Mathematics  Data Structures and Algorithms (cs.DS)  Information theory 

In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor gets a compressed output that is shorter than applying over the entire T at once. This problem was introduced in [2,3] in the context of table compression, and further elaborated and extended to strings and trees by [10,11,20], but it is still open how to efficiently compute the optimal partition [4]. In this paper we provide the first algorithm which is guaranteed to compute in O(n polylog(n)) time a partition of T whose compressed output is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is any positive constant.

Source: Algorithms. 17th Annual European Symposium - ESA 2009, pp. 420–431, Copenhagen, Denmark, 7-9 September 2009


[1] J.L. Bentley and M.D. McIlroy. Data compression with long repeated strings. Information Sciences, 135(1-2):1-11, 2001.
[2] A. L. Buchsbaum, D. F. Caldwell, K. W. Church, G. S. Fowler, and S. Muthukrishnan. Engineering the compression of massive tables: an experimental approach. In Procs ACM-SIAM SODA, pages 175-184, 2000.
[3] Adam L. Buchsbaum, Glenn S. Fowler, and Raffaele Giancarlo. Improving table compression with combinatorial optimization. J. ACM, 50(6):825- 851, 2003.
[4] A.L. Buchsbaum and R. Giancarlo. Table compression. In M.Y. Kao, editor, Encyclopedia of Algorithms, pages 939-942. Springer, 2008.
[5] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
[6] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. The MIT Press and McGraw-Hill Book Company, 2001.
[8] P. Ferragina, R. Giancarlo, and G. Manzini. The engineering of a compression boosting library: Theory vs practice in BWT compression. In Proc. 14th European Symposium on Algorithms (ESA '06), pages 756-767. Springer Verlag LNCS n. 4168, 2006.
[9] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207:849-866, 2009.
[10] P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. Journal of the ACM, 52:688-713, 2005.
[11] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. 46th IEEE Symposium on Foundations of Computer Science (FOCS), pages 184-193, 2005.
[12] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. 15th International World Wide Web Conference (WWW), pages 751-760, 2006.
[13] P. Ferragina and R. Venturini. The compressed permuterm index. ACM Transactions on Algorithms (to appear), 2009.
[14] R. Giancarlo, A. Restivo, and M. Sciortino. From first principles to the burrows and wheeler transform and beyond, via combinatorial optimization. Theoretical Computer Science, 387(3):236-248, 2007.
[15] R. Giancarlo and M. Sciortino. Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms. In Proc. 14th Symposium on Combinatorial Pattern Matching (CPM '03), pages 129-143. SpringerVerlag LNCS n. 2676, 2003.
[16] P. G. Howard and J. S. Vitter. Analysis of arithmetic coding for data compression. Information Processing Management, 28(6):749-764, 1992.
[17] H. Kaplan, S. Landau, and E. Verbin. A simpler analysis of burrowswheeler-based compression. Theoretical Computer Science, 387(3):220-235, 2007.
[18] R. Kosaraju and G. Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM Journal on Computing, 29(3):893-911, 1999.
[19] P. Kulkarni, F. Douglis, J.D. LaVoie, and J.M. Tracey. Redundancy elimination within large collections of files. In USENIX Annual Technical Conference, pages 59-72, 2004.
[20] V. Ma¨kinen and G. Navarro. Position-restricted substring searching. In Proc. 7th Latin American Symposium on Theoretical Informatics (LATIN), pages 703-714. Springer Verlag LNCS n. 3887, 2006.
[21] V. Ma¨kinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Procs 14th Symp. on String Processing and Information Retrieval (SPIRE), pages 229-241. Springer Verlag LNCS n. 4726, 2007.
[22] G. Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407-430, 2001.
[23] A. Moffat and R.Y. Isal. Word-based text compression using the burrowswheeler transform. Information Processing Management, 41(5):1175-1192, 2005.
[24] Z. Ouyang, N.D. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Procs 3rd Conference on Web Information Systems Engineering (WISE), pages 257-268. IEEE Computer Society, 2002.
[25] T. Suel and N. Memon. Algorithms for delta compression and remote file synchronization. In Khalid Sayood, editor, Lossless Compression Handbook. Academic Press, 2002.
[26] D. Trendafilov, N. Memon, and T. Suel. Compressing file collections with a TSP-based approach. Technical report, Technical Report TR-CIS-2004-02, Polytechnic University, 2004.
[27] B.D. Vo and K.-P. Vo. Compressing table data with column dependency. Theoretical Computer Science, 387(3):273-283, 2007.
[28] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.

Metrics



Back to previous page
BibTeX entry
@inproceedings{oai:it.cnr:prodotti:44298,
	title = {On optimally partitioning a text to improve its compression},
	author = {Ferragina P. and Nitto I. and Venturini R.},
	doi = {10.1007/978-3-642-04128-0_38 and 10.48550/arxiv.0906.4692 and 10.1007/s00453-010-9437-6},
	booktitle = {Algorithms. 17th Annual European Symposium - ESA 2009, pp. 420–431, Copenhagen, Denmark, 7-9 September 2009},
	year = {2009}
}