Schmidt S, Khan S, Alanko Jn, Pibiri Ge, Tomescu Ai
k-mer sets Plain text compression Genomic sequences Graph algorithm Sequence analysis Minimum-cost flow Chinese postman problem
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Source: GENOME BIOLOGY (ONLINE), vol. 24
@article{oai:it.cnr:prodotti:484067, title = {Matchtigs: minimum plain text representation of k-mer sets}, author = {Schmidt S and Khan S and Alanko Jn and Pibiri Ge and Tomescu Ai}, doi = {10.1186/s13059-023-02968-z}, year = {2023} }
MobiDataLab
Labs for prototyping future Mobility Data sharing cloud solutions
SAFEBIO
Safe and Complete Algorithms for Bioinformatics