2020
Article  Open Access

Techniques for Inverted Index Compression

Pibiri G. E., Venturini R.

efficiency  Computer Science - Information Retrieval  inverted index  compression 

The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.

Source: ACM computing surveys 53 (2020). doi:10.1145/3415148

Publisher: Association for Computing Machinery,, New York, N.Y. , Stati Uniti d'America


Citations

[1] Norman Abramson. 1963. Information theory and coding. McGraw-Hill.
[2] Vo Ngoc Anh and Alistair Mofat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval Journal 8, 1 (2005), 151-166.
[3] Vo Ngoc Anh and Alistair Mofat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2 (2010), 131-147.
[4] Alberto Apostolico and A Fraenkel. 1987. Robust transmission of unbounded strings using Fibonacci representations. IEEE Transactions on Information Theory 33, 2 (1987), 238-245.
[5] Ricardo Baeza-Yates, Berthier de Araújo Neto Ribeiro, and others. 2011. Modern information retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,.
[6] Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph framework II: Codes for the World-Wide Web. In Data Compression Conference. 1.
[7] Paolo Boldi and Sebastiano Vigna. 2005. Codes for the World Wide Web. Internet Mathematics 2, 4 (2005), 407-429.
[8] Abraham Bookstein and Shmuel T Klein. 1989. Construction of optimal graphs for bit-vector compression. In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 327-342.
[9] Nieves R Brisaboa, Antonio Farina, Gonzalo Navarro, and Maria F Esteller. 2003. (S, C)-dense coding: An optimized compression code for natural language text databases. In International Symposium on String Processing and Information Retrieval. Springer, 122-136.
[10] Nieves R Brisaboa, Susana Ladra, and Gonzalo Navarro. 2013. DACs: Bringing direct access to variable-length codes. Information Processing & Management 49, 1 (2013), 392-404.
[11] Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Sofer, and Jason Y. Zien. 2003. Eficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management. 426-434.
[12] Stefan Büttcher, Charles Clarke, and Gordon Cormack. 2010. Information retrieval: implementing and evaluating search engines. MIT Press.
[13] B. Barla Cambazoglu and Ricardo Baeza-Yates. 2015. Scalability Challenges in Web Search Engines. Morgan & Claypool Publishers.
[14] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better bitmap performance with Roaring bitmaps. Software: practice and experience 46, 5 (2016), 709-719.
[15] David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo.
[16] Intel Corporation. [last checked April 2019]. The Intel Intrinsics Guide, https://software.intel.com/sites/landingpage/ IntrinsicsGuide/.
[17] Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company.
[18] J Shane Culpepper and Alistair Mofat. 2005. Enhanced byte codes with restricted prefix properties. In International Symposium on String Processing and Information Retrieval. Springer, 1-12.
[19] Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, and Ning Zhang. 2013. Unicorn: A System for Searching the Social Graph. In Proceedings of the Very Large Database Endowment, Vol. 6. 1150-1161.
[20] Jefrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. In Proceedings of the 2nd International Conference on Web Search and Data Mining.
[21] Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 25-36.
[22] Renaud Delbru, Stéphane Campinas, and Giovanni Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Journal of Web Semantics 10 (2012), 33-58.
[23] Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing Graphs and Indexes with Recursive Graph Bisection. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining. 1535-1544.
[24] Jarek Duda. 2009. Asymmetric numeral systems. CoRR abs/0902.0271 (2009). http://arxiv.org/abs/0902.0271
[25] Jarek Duda. 2013. Asymmetric numeral systems: Entropy coding combining speed of Hufman coding with compression rate of arithmetic coding. CoRR abs/1311.2540 (2013). http://arxiv.org/abs/1311.2540
[26] Jarek Duda, Khalid Tahboub, Neeraj J Gadgil, and Edward J Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Hufman coding. In 2015 Picture Coding Symposium (PCS). IEEE, 65-69.
[27] Peter Elias. 1974. Eficient Storage and Retrieval by Content and Address of Static Files. J. ACM 21, 2 (1974), 246-260.
[28] Peter Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2 (1975), 194-203.
[29] Robert Mario Fano. 1949. The transmission of information. Massachusetts Institute of Technology, Research Laboratory of Electronics.
[30] Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).
[31] Peter Fenwick. 2003. Universal codes. Lossless Compression Handbook (2003), 55-78.
[32] Aviezri S Fraenkel and Shmuel T Klein. 1985. Robust universal complete codes as alternatives to Hufman codes . Department of Applied Mathematics, Weizmann Institute of Science.
[33] Robert Gallager and David Van Voorhis. 1975. Optimal source codes for geometrically distributed integer alphabets (corresp.). IEEE Transactions on Information theory 21, 2 (1975), 228-230.
[34] Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing Relations and Indexes. In Proceedings of the 14th International Conference on Data Engineering. 370-379.
[35] Solomon Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory 12, 3 (1966), 399-401.
[36] Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. 2005. Practical implementation of rank and select queries. In Workshop on Eficient and Experimental Algorithms . 27-38.
[37] Vagelis Hristidis, Yannis Papakonstantinou, and Luis Gravano. 2003. Eficient IR-Style Keyword Search over Relational Databases. In Proceedings 2003 VLDB Conference. Elsevier, 850-861.
[38] David A Hufman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098-1101.
[39] Guy Jacobson. 1989. Succinct Static Data Structures. Ph.D. Dissertation. Carnegie Mellon University.
[40] Matti Jakobsson. 1978. Hufman Coding in Bit-Vector Compression. Inf. Process. Lett. 7, 6 (1978), 304-307.
[41] Aaron Kiely. 2004. Selecting the Golomb parameter in Rice coding. IPN progress report 42 (2004), 159.
[42] Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. 45, 1 (2015), 1-29.
[43] Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, and Gregory Ssi-Yan-Kai. 2018. Roaring bitmaps: Implementation of an optimized software library. Software: Practice and Experience 48, 4 (2018), 867-895.
[44] Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream-VByte: faster byte-oriented integer compression. Inform. Process. Lett. 130 (2018), 1-6.
[45] Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster and smaller compressed bitmaps with roaring. Software: Practice and Experience 46, 11 (2016), 1547-1569.
[46] Veli Mäkinen and Gonzalo Navarro. 2007. Rank and select revisited and extended. Theoretical Computer Science 387, 3 (2007), 332-347.
[47] Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with Variable-sized Blocks. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval. 625-634.
[48] Antonio Mallia, Michał Siedlaczek, and Torsten Suel. 2019. An experimental study of index compression and DAAT query processing methods. In European Conference on Information Retrieval. Springer, 353-368.
[49] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[50] Alistair Mofat. 2016. Compressing Integer Sequences. In Encyclopedia of Algorithms. 407-412.
[51] Alistair Mofat. 2019. Hufman Coding. ACM Computing Surveys (CSUR) (2019), 35 pages. To appear.
[52] Alistair Mofat, Radford M. Neal, and Ian H. Witten. 1998. Arithmetic Coding Revisited. ACM Trans. Inf. Syst. 16, 3 (July 1998), 256-294.
[53] Alistair Mofat and Matthias Petri. 2017. ANS-Based Index Compression. In Proceedings of the ACM on Conference on Information and Knowledge Management. 677-686.
[54] Alistair Mofat and Matthias Petri. 2018. Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 405-413.
[55] Alistair Mofat and Lang Stuiver. 1996. Exploiting Clustering in Inverted File Compression. In Data Compression Conference. 82-91.
[56] Alistair Mofat and Lang Stuiver. 2000. Binary Interpolative Coding for Efective Index Compression. Information Retrieval Journal 3, 1 (2000), 25-47.
[57] Alistair Mofat and Andrew Turpin. 2002. Compression and coding algorithms. Springer Science & Business Media.
[58] Alistair Mofat and Justin Zobel. 1992. Parameterised compression for sparse bitmaps. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 274-285.
[59] Leonardo of Pisa (known as Fibonacci). 1202. Liber Abaci.
[60] Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. 2015. Optimal Space-time Tradeofs for Inverted Indexes. In International ACM Conference on Web Search and Data Mining. 47-56.
[61] Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano Indexes. In Proceedings of the 37th International Conference on Research and Development in Information Retrieval. 273-282.
[62] Rasmus Pagh. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2 (2001), 353-363.
[63] Athanasios Papoulis. 1991. Probability, Random Variables, and Stochastic Processes (3rd ed.). McGraw-Hill.
[64] Richard Clark Pasco. 1976. Source coding algorithms for fast data compression. Ph.D. Dissertation. Stanford University Palo Alto, CA.
[65] Giulio Ermanno Pibiri. 2019. On Slicing Sorted Integer Sequences. CoRR abs/1907.01032 (2019). http://arxiv.org/abs/ 1907.01032
[66] Giulio Ermanno Pibiri. 2019. Space- and Time-Eficient Data Structures for Massive Datasets . Ph.D. Dissertation. University of Pisa.
[67] Giulio Ermanno Pibiri, Matthias Petri, and Alistair Mofat. 2019. Fast Dictionary-Based Compression for Inverted Indexes. In International ACM Conference on Web Search and Data Mining. 9.
[68] Giulio Ermanno Pibiri and Rossano Venturini. 2017. Clustered Elias-Fano indexes. ACM Transactions on Information Systems 36, 1, Article 2 (2017), 33 pages.
[69] Giulio Ermanno Pibiri and Rossano Venturini. 2017. Dynamic Elias-Fano Representation. In Proceedings of the 28-th Annual Symposium on Combinatorial Pattern Matching. 30:1-30:14.
[70] Giulio Ermanno Pibiri and Rossano Venturini. 2018. Inverted Index Compression. Encyclopedia of Big Data Technologies (2018), 1-8.
[71] Giulio Ermanno Pibiri and Rossano Venturini. 2019. On Optimally Partitioning Variable-Byte Codes. IEEE Transactions on Knowledge and Data Engineering (2019), 1-12.
[72] Jef Plaisance, Nathan Kurz, and Daniel Lemire. 2015. Vectorized VByte Decoding. In International Symposium on Web Algorithms.
[73] Vijayshankar Raman, Lin Qiao, Wei Han, Inderpal Narang, Ying-Lin Chen, Kou-Horng Yang, and Fen-Ling Ling. 2007. Lazy, adaptive rid-list intersection, and its application to index anding. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 773-784.
[74] Robert Rice and J. Plaunt. 1971. Adaptive Variable-Length Coding for Eficient Compression of Spacecraft Television Data. IEEE Transactions on Communications 16, 9 (1971), 889-897.
[75] JJ Rissanen. 1979. Arithmetic codings as number representations. Acta Polytech. Scand. Math 31 (1979), 44-51.
[76] Jorma J Rissanen. 1976. Generalized Kraft inequality and arithmetic coding. IBM Journal of research and development 20, 3 (1976), 198-203.
[77] David Salomon. 2007. Variable-length Codes for Data Compression. Springer.
[78] Benjamin Schlegel, Rainer Gemulla, and Wolfgang Lehner. 2010. Fast integer compression using SIMD instructions. In Proceedings of the Sixth International Workshop on Data Management on New Hardware. ACM, 34-40.
[79] Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell system technical journal 27, 3 (1948), 379-423.
[80] Fabrizio Silvestri. 2007. Sorting Out the Document Identifier Assignment Problem. In Proceedings of the 29th European Conference on IR Research. 101-112.
[81] Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding: Eficient Coding and Fast Decoding of Integer Lists via Dynamic Programming. In Proceedings of the 19th International Conference on Information and Knowledge Management. 1219-1228.
[82] Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th International Conference on Information and Knowledge Management. 317-326.
[83] J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. 29, 4 (1982), 928-951.
[84] Larry H Thiel and HS Heaps. 1972. Program design for retrospective searches on large data bases. Information Storage and Retrieval 8, 1 (1972), 1-20.
[85] Andrew Trotman. 2003. Compressing inverted files. Information Retrieval 6, 1 (2003), 5-19.
[86] Andrew Trotman. 2014. Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian Document Computing Symposium. ACM, 50.
[87] Andrew Trotman and Kat Lilly. 2018. Elias Revisited: Group Elias SIMD Coding. In Proceedings of the 23rd Australasian Document Computing Symposium. ACM, 4.
[88] Peter van Emde Boas. 1975. Preserving Order in a Forest in less than Logarithmic Time. In Proceedings of the 16-th Annual Symposium on Foundations of Computer Science. 75-84.
[89] Peter van Emde Boas. 1977. Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space. Inform. Process. Lett. 6, 3 (1977), 80-82.
[90] Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 83-92.
[91] Ian Witten, Alistair Mofat, and Timothy Bell. 1999. Managing gigabytes: compressing and indexing documents and images (2nd ed.). Morgan Kaufmann.
[92] Ian H Witten, Radford M Neal, and John G Cleary. 1987. Arithmetic coding for data compression. Commun. ACM 30, 6 (1987), 520-540.
[93] Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web. 401-410.
[94] J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. In International World Wide Web Conference (WWW). 387-396.
[95] Justin Zobel and Alistair Mofat. 2006. Inverted files for text search engines. Comput. Surveys 38, 2 (2006), 1-56.
[96] Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering. 59-70.


Back to previous page
Projects (via OpenAIRE)

BigDataGrapes
Big Data to Enable Global Disruption of the Grapevine-powered Industries


OpenAIRE