2023
Conference article  Open Access

Spectrum preserving tilings enable sparse and modular reference indexing

Fan J., Khan J., Pibiri G. E., Patro R.

Reference indexing  Spectrum preserving tilings  Minimal perfect hashing  Pufferfish2 

The reference indexing problem for -mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried -mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for -mers into: (1) a -mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index -mer sets can be used to efficiently implement the -mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the -mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique -mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying -mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.

Source: RECOMB 2023 - 27th International Conference on Research in Computational Molecular Biology, pp. 21–40, Istanbul, Turkey, 16-19/04/2023

Publisher: Springer, Berlin , Germania


Metrics



Back to previous page
BibTeX entry
@inproceedings{oai:it.cnr:prodotti:480859,
	title = {Spectrum preserving tilings enable sparse and modular reference indexing},
	author = {Fan J. and Khan J. and Pibiri G. E. and Patro R.},
	publisher = {Springer, Berlin , Germania},
	doi = {10.1007/978-3-031-29119-7_2},
	booktitle = {RECOMB 2023 - 27th International Conference on Research in Computational Molecular Biology, pp. 21–40, Istanbul, Turkey, 16-19/04/2023},
	year = {2023}
}

MobiDataLab
Labs for prototyping future Mobility Data sharing cloud solutions


OpenAIRE