2019
Article  Open Access

On optimally partitioning variable-byte codes

Pibiri G. E., Venturini R.

Computational Theory and Mathematics  Compression  Computer Science - Information Retrieval  indexing  Computer Science - Databases  Computer Science Applications  Variable-byte  Information Systems 

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2× by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

Source: IEEE transactions on knowledge and data engineering (Print) 32 (2019): 1812–1823. doi:10.1109/TKDE.2019.2911288

Publisher: Institute of Electrical and Electronics Engineers,, New York, NY , Stati Uniti d'America


Citations

[1] Dropbox Techblog. https: //blogs.dropbox.com/tech/2016/09/improvingthe-performance-of-full-text-search/. Accessed on 15-04-2018.
[2] Protocol Bu ers - Google's data interchange format. https://github.com/google/protobuf. Accessed on 15-04-2018.
[3] RediSearch. https://github.com/RedisLabsModules/ RediSearch/blob/master/docs/DESIGN.md. Accessed on 15-04-2018.
[4] UpscaleDB. https://upscaledb.com/about01.html#compression. Accessed on 15-04-2018.
[5] D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden, et al. The design and implementation of modern column-oriented database systems. Foundations and Trends R in Databases, 5(3):197{280, 2013.
[6] V. N. Anh and A. Mo at. Inverted index compression using word-aligned binary codes. Information Retrieval Journal, 8(1):151{166, 2005.
[7] V. N. Anh and A. Mo at. Index compression using 64-bit words. Software: Practice and Experience, 40(2):131{147, 2010.
[8] B. Bhattacharjee, L. Lim, T. Malkemus, G. Mihaila, K. Ross, S. Lau, C. McArthur, Z. Toth, and R. Sherkat. E cient index compression in DB2 LUW. Proceedings of the VLDB Endowment, 2(2):1462{1473, 2009.
[9] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 310{321. ACM, 2002.
[10] A. Buchsbaum, G. Fowler, and R. Giancarlo. Improving table compression with combinatorial optimization. Journal of the ACM, 50(6):825{851, 2003.
[11] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In Proceedings of the 28th International Conference on Data Engineering (ICDE), pages 1360{1369. IEEE, 2012.
[12] S. Buttcher, C. Clarke, and G. Cormack. Information retrieval: implementing and evaluating search engines. MIT Press, 2010.
[13] F. Claude, A. Farin~a, M. A. Mart nez-Prieto, and G. Navarro. Universal indexes for highly repetitive document collections. Information Systems, 61:1{23, 2016.
[14] M. Curtiss, I. Becker, T. Bosman, S. Doroshenko, L. Grijincu, T. Jackson, S. Kunnatur, S. Lassen, P. Pronin, S. Sankar, G. Shen, G. Woss, C. Yang, and N. Zhang. Unicorn: A system for searching the social graph. In Proceedings of the Very Large Database Endowment (VLDB), volume 6, pages 1150{1161, 2013.
[15] J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In Proceedings of the 2nd International Conference on Web Search and Data Mining (WSDM), 2009.
[16] B. Debnath, S. Sengupta, and J. Li. Skimpystash: RAM space skimpy key-value store on ash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 25{36. ACM, 2011.
[17] J. Duda. Asymmetric numeral systems: entropy coding combining speed of hu man coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.
[18] P. Elias. E cient storage and retrieval by content and address of static les. Journal of the ACM, 21(2):246{260, 1974.
[19] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194{203, 1975.
[20] R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971.
[21] P. Ferragina, I. Nitto, and R. Venturini. On optimally partitioning a text to improve its compression. Algorithmica, 61(1):51{74, 2011.
[22] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering (ICDE), pages 370{379, 1998.
[23] S. Golomb. Run-length encodings. IEEE Transactions on Information Theory, 12(3):399{401, 1966.
[24] V. Hristidis, Y. Papakonstantinou, and L. Gravano. E cient IR-style keyword search over relational databases. In Proceedings 2003 VLDB Conference, pages 850{861. Elsevier, 2003.
[25] D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1{29, 2013.
[26] D. Lemire, N. Kurz, and C. Rupp. Stream-VByte: faster byte-oriented integer compression. Information Processing Letters, 130:1{6, 2018.
[27] C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[28] A. Mo at and M. Petri. ANS-based index compression. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, pages 677{686, 2017.
[29] A. Mo at and M. Petri. Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 405{413, 2018.
[30] A. Mo at and L. Stuiver. Binary interpolative coding for e ective index compression. Information Retrieval Journal, 3(1):25{47, 2000.
[31] G. Ottaviano, N. Tonellotto, and R. Venturini. Optimal space-time tradeo s for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM), pages 47{56, 2015.
[32] G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In Proceedings of the 37th International Conference on Research and Development in Information Retrieval (SIGIR), pages 273{282, 2014.
[33] G. E. Pibiri and R. Venturini. Clustered Elias-Fano indexes. ACM Transactions on Information Systems, 36(1):1{33, 2017.
[34] G. E. Pibiri and R. Venturini. Inverted index compression. Encyclopedia of Big Data Technologies, pages 1{8, 2018.
[35] J. Plaisance, N. Kurz, and D. Lemire. Vectorized VByte decoding. In International Symposium on Web Algorithms (iSWAG), 2015.
[36] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, and F.-L. Ling. Lazy, adaptive rid-list intersection, and its application to index anding. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 773{784. ACM, 2007.
[37] F. Silvestri. Sorting out the document identi er assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR), pages 101{112, 2007.
[38] F. Silvestri and R. Venturini. VSEncoding: E cient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th International Conference on Information and Knowledge Management (CIKM), pages 1219{1228, 2010.
[39] A. Stepanov, A. Gangolli, D. Rose, R. Ernst, and P. Oberoi. Simd-based decoding of posting lists. In Proceedings of the 20th International Conference on Information and Knowledge Management (CIKM), pages 317{326, 2011.
[40] L. H. Thiel and H. Heaps. Program design for retrospective searches on large data bases. Information Storage and Retrieval, 8(1):1{20, 1972.
[41] A. Trotman. Compression, simd, and postings lists. In Proceedings of the 2014 Australasian Document Computing Symposium, page 50. ACM, 2014.
[42] S. Vigna. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM), pages 83{92, 2013.
[43] J. Wang, C. Lin, R. He, M. Chae, Y. Papakonstantinou, and S. Swanson. MILC: inverted list compression in memory. Proceedings of the VLDB Endowment, 10(8):853{864, 2017.
[44] H. E. Williams and J. Zobel. Compressing integers for fast le access. The Computer Journal, 42(3):193{201, 1999.
[45] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 401{410, 2009.
[46] Z. Zhang, J. Tong, H. Huang, J. Liang, T. Li, R. J. Stones, G. Wang, and X. Liu. Leveraging context-free grammar for e cient inverted index compression. In Proceedings of the 39th International Conference on Research and Development in Information Retrieval (SIGIR), pages 275{284, 2016.
[47] J. Zobel and A. Mo at. Inverted les for text search engines. ACM Computing Surveys, 38(2):1{56, 2006.
[48] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE), pages 59{70, 2006.


Back to previous page