2022
Conference article  Open Access

On weighted k-mer dictionaries

Pibiri G. E.

K-Mers  Weights  Compression  Hashing 

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Source: WABI 2022 - International Workshop on Algorithms in Bioinformatics, Potsdam, Germany, 05-09/09/2022


Metrics



Back to previous page
BibTeX entry
@inproceedings{oai:it.cnr:prodotti:470242,
	title = {On weighted k-mer dictionaries},
	author = {Pibiri G. E.},
	doi = {10.4230/lipics.wabi.2022.9},
	booktitle = {WABI 2022 - International Workshop on Algorithms in Bioinformatics, Potsdam, Germany, 05-09/09/2022},
	year = {2022}
}

MobiDataLab
Labs for prototyping future Mobility Data sharing cloud solutions


OpenAIRE