Article  Open Access

Handling massive n-gram datasets efficiently

Pibiri G. E., Venturini R.

Algorithm engineering  Computer Science - Information Retrieval  Scalability  Computer Science - Databases  Efficiency  n-grams  satellite values  Kneser-Ney language models 

Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

Source: ACM transactions on information systems 37 (2019). doi:10.1145/3302913

Publisher: Association for Computing Machinery,, New York, NY , Stati Uniti d'America


[1] 2006. Yahoo! N-Grams, version 2.0, http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.
[2] Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In International World Wide Web Conference (WWW). 107-116.
[3] Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. 2014. Cache-oblivious peeling of random hypergraphs. In International Data Compression Conference (DCC). 352-361.
[4] Burton H. Bloom. 1970. Space/time trade-ofs in hash coding with allowable errors. In Communications of the ACM (CACM). 422-426.
[5] Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jefrey Dean. 2007. Large language models in machine translation. In International Conference Empirical Methods on Natural Language Processing (EMNLP). 858-867.
[6] Thorsten Brantz and Alex Franz. 2006. The Google Web 1T 5-Gram Corpus, http://storage.googleapis.com/books/ ngrams/books/datasetsv2.html. In Linguistic Data Consortium, Philadelphia, PA, Technical Report LDC2006T13.
[7] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH. 2635-2639.
[8] Ciprian Chelba and Johan Schalkwyk. 2013. Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search. In Mobile Speech and Advanced Natural Language Solutions (MSANLS). 197-229.
[9] Stanley Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. In Computer Speech and Language (CSL), Vol. 13. 359-394.
[10] Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Association for Computational Linguistics (ACL). 310-318.
[11] Wenlin Chen, David Grangier, and Michael Auli. 2015. Strategies for Training Large Vocabulary Neural Language Models. In Preprint arXiv:1512.04906.
[12] David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo.
[13] David R. Clark and J. Ian Munro. 1996. Eficient sufix trees on secondary storage. In Symposium on Discrete Algorithms (SODA). 383-391.
[14] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliford Stein. 2009. Introduction to Algorithms (3-rd Edition). MIT Press.
[15] Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company.
[16] Erik D. Demaine, Thouis Jones, and Mihai Pătraşcu. 2004. Interpolation search for non-independent data. In Symposium on Discrete Algorithms (SODA). 529-530.
[17] Roman Dementiev, Lutz Kettner, and Peter Sanders. 2008. STXXL: standard template library for XXL data sets. In Software, Practice and Experience (SPE), Vol. 38. 589-637.
[18] Peter Elias. 1974. Eficient Storage and Retrieval by Content and Address of Static Files. In Journal of the ACM (JACM). 246-260.
[19] Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. In Memorandum 61, Computer Structures Group, MIT.
[20] Marcello Federico and Nicola Bertoldi. 2006. How many bits are needed to store probabilities for phrase-based translation?. In Workshop on Statistical Machine Translation (WMT). 94-101.
[21] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In INTERSPEECH. 1618-1621.
[22] Agner Fog. 2014. Optimizing software in C++: an optimization guide from Windows, Linux and Mac platforms. Technical University of Denmark.
[23] Edward Fredkin. 1960. Trie memory. In Communications of the ACM (CACM). 490-499.
[24] Kimmo Fredriksson and Fedor Nikitin. 2007. Simple compression code supporting random access and fast string matching. In Workshop on Experimental Algorithms (WEA). 203-216.
[25] Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation (WMT). 187-197.
[26] Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Association for Computational Linguistics (ACL). 690-696.
[27] Stefen Heinz, Justin Zobel, and Hugh E. Williams. 2002. Burst tries: a fast, eficient data structure for string keys. In Transactions on Information Systems (TOIS). 192-223.
[28] Samuel Huston, Alistair Mofat, and W. Bruce Croft. 2011. Eficient indexing of repeated n-grams. In International Conference on Web Search and Data Mining (WSDM). 127-136.
[29] Guy Jacobson. 1989. Space-eficient Static Trees and Graphs. In Foundations of Computer Science (FOCS). 549-554.
[30] Dan Jurafsky and James H. Martin. 2014. Speech and language processing. Pearson.

Back to previous page