Koerkamp R. G., Pibiri G. E.
Minimizers, Randomized algorithms, Sketching, Hashing
Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order O from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t≡ k (mod w). Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.
Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 312. Royal Holloway, London, United Kingdom, 2-4/09/2024
@inproceedings{oai:iris.cnr.it:20.500.14243/503503, title = {The {mod-minimizer}: a simple and efficient sampling algorithm for long k-Mers}, author = {Koerkamp R. G. and Pibiri G. E.}, doi = {10.4230/lipics.wabi.2024.11}, booktitle = {LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 312. Royal Holloway, London, United Kingdom, 2-4/09/2024}, year = {2024} }