31 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2025 Conference article Open Access OPEN
U-Index: a universal indexing framework for matching long patterns
Ayad Lorraine A. K., Fici G., Koerkamp R. G., Loukides G., Patro R., Pibiri G. E., Pissis S. P.
Motivation. Text indexing is a fundamental and well-studied problem. Classic solutions to this problem either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching, e.g., the suffix array. The former solutions thus retain excellent compressed space, but are practically slow to construct and query. The latter approaches, instead, sacrifice space efficiency but are typically faster; for example, the suffix array takes much more space than the text itself for commonly used alphabets, like ASCII or DNA, but it is very fast to construct and query. Methods. In this paper, we show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. Results. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so we demonstrate applications in long-read mapping.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 338. Venice, Italy, 22-24/07/2025
DOI: 10.4230/lipics.sea.2025.4
Project(s): EFRA via OpenAIRE
Metrics:


See at: drops.dagstuhl.de Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2025 Conference article Open Access OPEN
Fast pseudoalignment queries on compressed colored de Bruijn graphs
Campanelli A., Pibiri G. E., Patro R.
Motivation. Indexes for the colored de Bruijn graph (c-dBG) play a crucial role in computational biology by facilitating complex tasks such as read mapping and assembly. These indexes map k-mers (substrings of length k) appearing in a large collection of reference strings to the set of identifiers of the strings where they appear. These sets, colloquially referred to as color sets, tend to occupy large quantities of memory, especially for large pangenomes. Our previous work thus focused on leveraging the repetitiveness of the color sets to improve the space effectiveness of the resulting index. As a matter of fact, repetition-aware indexes can be up to one order of magnitude smaller on large pangenomes compared to indexes that do not exploit such repetitiveness. Such improved space effectiveness, on the other hand, imposes an overhead at query time when performing tasks such as pseudoalignment that require the collection and processing of multiple related color sets. Methods. In this paper, we show how to avoid this overhead. We devise novel query algorithms tailored for the specific repetition-aware representations adopted by the Fulgor index, a state-of-the-art c-dBG index, to significantly improve its pseudoalignment efficiency and without consuming additional space. Results. Our results indicate that with increasing redundancy in the pangenomes, the compression factor provided by the Fulgor index increases, while the relative query time actually reduces. For example, while the space of the Fulgor index improves by 2.5× with repetition-aware compression and its query time improves by 1.6× on a collection of 5,000 Salmonella Enterica genomes, these factors become (6.1×,2.8×) and (11.2×,3.2×) for 50,000 and 150,000 genomes respectively. For an even larger collection of 300,000 genomes, we obtained an index that is 22.3× smaller and 2.2× faster.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 344. University of Maryland, USA, 2025
DOI: 10.4230/lipics.wabi.2025.6
Metrics:


See at: drops.dagstuhl.de Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2025 Journal article Open Access OPEN
The open-closed mod-minimizer algorithm
Koerkamp R. G., Liu D., Pibiri G. E.
Sampling algorithms that deterministically select a subset of k-mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k-mer out of every window of w consecutive k-mers. The folklore and most used scheme is the random minimizer that selects the smallest k-mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k-mers) of 2/(w+1). In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when k→∞, like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when k≤w. In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method—the open-closed minimizer—offers improved density for small k≤w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k>w is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.Source: ALGORITHMS FOR MOLECULAR BIOLOGY, vol. 20 (issue 1)
DOI: 10.1186/s13015-025-00270-0
Project(s): ETH Research Grant ETH-1721, EFRA via OpenAIRE
Metrics:


See at: almob.biomedcentral.com Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Fulgor: a fast and compact k-mer index for large-scale matching and color queries
Fan J, Khan J, Pratap Singh N, Pibiri Ge, Patro R
The problem of sequence identification or matching--determining the subset of reference sequences from a givencollection that are likely to contain a short, queried nucleotide sequence--is relevant for many important tasksin Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analysesand the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance.This poses the threefold challenge of representing the reference collection with a data structure that is efficientto query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. Theproposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic(i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are keptin the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as littleas 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. Bycombining this property with simple but effective compression methods for integer lists, the index achieves verysmall space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysisto demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto--thestrongest competitor in terms of index space vs. query time trade-off--Fulgor requires significantly less space (upto 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries,and is 2-6× faster to construct.Source: ALGORITHMS FOR MOLECULAR BIOLOGY, vol. 19 (issue 3)
DOI: 10.1186/s13015-024-00251-9
Metrics:


See at: almob.biomedcentral.com Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2024 Conference article Open Access OPEN
The {mod-minimizer}: a simple and efficient sampling algorithm for long k-Mers
Koerkamp R. G., Pibiri G. E.
Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,O) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order O from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order O is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t≡ k (mod w). Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 312. Royal Holloway, London, United Kingdom, 2-4/09/2024
DOI: 10.4230/lipics.wabi.2024.11
Metrics:


See at: drops.dagstuhl.de Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Where the patterns are: repetition-aware compression for colored de Bruijn graphs
Campanelli A., Pibiri G. E., Fan J., Patro R.
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.Source: JOURNAL OF COMPUTATIONAL BIOLOGY
DOI: 10.1089/cmb.2024.0714
DOI: 10.1101/2024.07.09.602727
Project(s): EFRA via OpenAIRE
Metrics:


See at: Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | IRIS Cnr Open Access | IRIS Cnr Open Access | Journal of Computational Biology Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Conference article Open Access OPEN
PHOBIC: Perfect Hashing with Optimized Bucket sizes and Interleaved Coding
Hermann S., Lehmann H. -P., Pibiri G. E., Sanders P., Walzer S.
A minimal perfect hash function (or MPHF) maps a set of n keys to [n]:= {1, ..., n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal choice for the expected bucket sizes, improving construction throughput for space efficient configurations both in theory and practice. Further contributions include a new encoding scheme for seeds that works across partitions of the data structure and a GPU parallelization. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. For a configuration with fast queries, our GPU implementation can construct an MPHF at 2.17 bits/key in 28 ns/key, which can be queried in 37 ns/query on the CPU.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, vol. 308. London, UK, 02-04/09/2024
DOI: 10.4230/lipics.esa.2024.69
Project(s): EFRA via OpenAIRE
Metrics:


See at: drops.dagstuhl.de Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Spectrum preserving tilings enable sparse and modular reference indexing
Fan J, Khan J, Pibiri Ge, Patro R
The reference indexing problem for -mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried -mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for -mers into: (1) a -mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index -mer sets can be used to efficiently implement the -mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the -mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique -mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying -mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.DOI: 10.1007/978-3-031-29119-7_2
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
On weighted k-mer dictionaries
Pibiri Ge
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.Source: ALGORITHMS FOR MOLECULAR BIOLOGY, vol. 18 (issue 1)
DOI: 10.1186/s13015-023-00226-2
Metrics:


See at: almob.biomedcentral.com Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Matchtigs: minimum plain text representation of k-mer sets
Schmidt S, Khan S, Alanko Jn, Pibiri Ge, Tomescu Ai
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.Source: GENOME BIOLOGY (ONLINE), vol. 24
DOI: 10.1186/s13059-023-02968-z
Project(s): MobiDataLab via OpenAIRE, SAFEBIO via OpenAIRE
Metrics:


See at: genomebiology.biomedcentral.com Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Locality-preserving minimal perfect hashing of k-mers
Pibiri Ge, Shibuya Y, Limasset A
Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1, ... , n} bijectively. It is well-known that n log2(e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log2(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.Code Availability: https://github.com/jermp/lphashData Availability: https://zenodo.org/record/7239205Source: BIOINFORMATICS (OXF., ONLINE), vol. 39
DOI: 10.1093/bioinformatics/btad219
Metrics:


See at: academic.oup.com Open Access | CNR IRIS Open Access | ISTI Repository Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
Fulgor: a fast and compact {k-mer} index for large-scale matching and color queries
Fan J, Singh Np, Khan J, Pibiri Ge, Patro R
The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS, pp. 18:1-18:21. Houston, Texas (USA), 03-06/09/2023
DOI: 10.4230/lipics.wabi.2023.18
DOI: 10.1101/2023.05.09.539895
Metrics:


See at: drops.dagstuhl.de Open Access | Europe PubMed Central Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | CNR IRIS Open Access | doi.org Restricted | doi.org Restricted | CNR IRIS Restricted | pubmed.ncbi.nlm.nih.gov Restricted


2023 Conference article Open Access OPEN
Verifiable learning for robust tree ensembles
Calzavara S., Cazzaro L., Pibiri G. E., Prezza N.
Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on publicly available datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, while incurring in just a relatively small loss of accuracy in the non-adversarial setting.DOI: 10.1145/3576915.3623100
DOI: https://doi.org/10.1145/3576915.3623100
DOI: 10.48550/arxiv.2305.03626
Project(s): EFRA via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | CNR IRIS Open Access | doi.org Restricted | CNR IRIS Restricted


2022 Journal article Open Access OPEN
Sparse and skew hashing of K-mers
Pibiri Ge
MOTIVATION: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. RESULTS: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/jermp/sshash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Source: BIOINFORMATICS (OXF., ONLINE), vol. 38 (issue 1), pp. i185-i194
DOI: 10.1093/bioinformatics/btac245
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: academic.oup.com Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2022 Conference article Open Access OPEN
On weighted k-mer dictionaries
Pibiri Ge
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.Source: LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS. Potsdam, Germany, 05-09/09/2022
DOI: 10.4230/lipics.wabi.2022.9
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: drops.dagstuhl.de Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2021 Conference article Open Access OPEN
Fast and compact set intersection through recursive universe partitioning
Pibiri Ge
We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: a markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem.DOI: 10.1109/dcc50243.2021.00037
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2021 Journal article Open Access OPEN
Rank/select queries over mutable bitmaps
Pibiri Ge, Kanda S
The problem of answering rank/select queries over a bitmap is of utmost importance for many succinct data structures. When the bitmap does not change, many solutions exist in the theoretical and practical side. In this work we consider the case where one is allowed to modify the bitmap via a flip(i) operation that toggles its i-th bit. By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets. Compared to the state-of-the-art, our solution improves runtime with no space degradation. Moreover, it does not incur in a significant runtime penalty when compared to the fastest immutable indexes, while providing even lower space overhead.Source: INFORMATION SYSTEMS, vol. 99 (issue 101756)
DOI: 10.1016/j.is.2021.101756
DOI: 10.48550/arxiv.2009.12809
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Information Systems Open Access | CNR IRIS Open Access | ISTI Repository Open Access | www.sciencedirect.com Open Access | Information Systems Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2021 Contribution to conference Open Access OPEN
Compressed indexes for fast search of semantic data
Perego R, Pibiri Ge, Venturini R
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. We propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30-60% less space and speeding up query execution by a factor of 2-81 times.DOI: 10.1109/icde51399.2021.00248
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2021 Conference article Open Access OPEN
TSXor: a simple time series compression algorithm
Bruno A, Nardini Fm, Pibiri Ge, Trani R, Venturini R
Time series are ubiquitous in computing as a key ingredient of many machine learning analytics, ranging from classification to forecasting. Typically, the training of such machine learning algorithms on time series requires to access the data in temporal order for several times. Therefore, a compression algorithm providing good compression ratios and fast decompression speed is desirable. In this paper, we present TSXor, a simple yet effective lossless compressor for time series. The main idea is to exploit the redundancy/similarity between close-in-time values through a window that acts as a cache, as to improve the compression ratio and decompression speed. We show that TSXor achieves up to 3× better compression and up to 2× faster decompression than the state of the art on real-world datasets.DOI: 10.1007/978-3-030-86692-1_18
Metrics:


See at: CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2020 Journal article Open Access OPEN
Compressed indexes for fast search of semantic data
Pibiri G. E., Perego R., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is to devise a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis, conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 - 60% less space and speeding up query execution by a factor of 2-81× .Source: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (PRINT), vol. 33 (issue 9), pp. 3187-3198
DOI: 10.1109/tkde.2020.2966609
DOI: 10.48550/arxiv.1904.07619
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | IEEE Transactions on Knowledge and Data Engineering Open Access | CNR IRIS Open Access | ieeexplore.ieee.org Open Access | ISTI Repository Open Access | IEEE Transactions on Knowledge and Data Engineering Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted