26 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
Typology operator: and / or
Language operator: and / or
Date operator: and / or
Rights operator: and / or
2019 Conference article Open Access OPEN
Fast dictionary-based compression for inverted indexes
Pibiri G. E., Petri M., Moffat A.
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.Source: International Conference on Web Search and Data Mining, pp. 6–14, 11/02/2019,15/02/2019
DOI: 10.1145/3289600.3290962
Metrics:


See at: ISTI Repository Open Access | dl.acm.org Restricted | doi.org Restricted | CNR ExploRA


2019 Doctoral thesis Open Access OPEN
Space and time-efficient data structures for massive datasets
Pibiri G. M.
This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings.

See at: etd.adm.unipi.it Open Access | ISTI Repository Open Access | CNR ExploRA


2020 Journal article Open Access OPEN
Practical trade-offs for the prefix-sum problem
Pibiri G. E., Venturini R.
Given an integer arrayA, theprefix-sum problemis to answersum(i)queries that return the sum of the elements inA[0..i], knowing that the integers inAcan be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare practical solutions to this problem, showing that new trade-offs between the performance of queries and updates can be achieved on modern hardware.Source: Software, practice & experience (Print) (2020). doi:10.1002/spe.2918
DOI: 10.1002/spe.2918
DOI: 10.48550/arxiv.2006.14552
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Software Practice and Experience Open Access | Software Practice and Experience Restricted | doi.org Restricted | onlinelibrary.wiley.com Restricted | CNR ExploRA


2020 Journal article Open Access OPEN
Techniques for inverted index compression
Pibiri G. E., Venturini R.
The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.Source: ACM computing surveys 53 (2020). doi:10.1145/3415148
DOI: 10.1145/3415148
DOI: 10.48550/arxiv.1908.10598
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | ACM Computing Surveys Open Access | ISTI Repository Open Access | dl.acm.org Restricted | ACM Computing Surveys Restricted | doi.org Restricted | CNR ExploRA


2021 Conference article Open Access OPEN
Fast and compact set intersection through recursive universe partitioning
Pibiri G. E.
We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: a markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem.Source: DCC 2021 - IEEE Data Compression Conference, pp. 293–302, Online Conference, 23-26/03/2021
DOI: 10.1109/dcc50243.2021.00037
Metrics:


See at: ISTI Repository Open Access | doi.org Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA


2021 Journal article Open Access OPEN
Rank/select queries over mutable bitmaps
Pibiri G. E., Kanda S.
The problem of answering rank/select queries over a bitmap is of utmost importance for many succinct data structures. When the bitmap does not change, many solutions exist in the theoretical and practical side. In this work we consider the case where one is allowed to modify the bitmap via a flip(i) operation that toggles its i-th bit. By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets. Compared to the state-of-the-art, our solution improves runtime with no space degradation. Moreover, it does not incur in a significant runtime penalty when compared to the fastest immutable indexes, while providing even lower space overhead.Source: Information systems (Oxf.) 99 (2021). doi:10.1016/j.is.2021.101756
DOI: 10.1016/j.is.2021.101756
DOI: 10.48550/arxiv.2009.12809
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | Information Systems Open Access | ISTI Repository Open Access | Information Systems Restricted | doi.org Restricted | www.sciencedirect.com Restricted | CNR ExploRA


2022 Journal article Open Access OPEN
Sparse and skew hashing of K-mers
Pibiri G. E.
MOTIVATION: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. RESULTS: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/jermp/sshash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Source: Bioinformatics (Oxf., Online) 38 (2022): i185–i194. doi:10.1093/bioinformatics/btac245
DOI: 10.1093/bioinformatics/btac245
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: academic.oup.com Open Access | ISTI Repository Open Access | CNR ExploRA


2022 Conference article Open Access OPEN
On weighted k-mer dictionaries
Pibiri G. E.
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.Source: WABI 2022 - International Workshop on Algorithms in Bioinformatics, Potsdam, Germany, 05-09/09/2022
DOI: 10.4230/lipics.wabi.2022.9
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: drops.dagstuhl.de Open Access | ISTI Repository Open Access | CNR ExploRA


2023 Conference article Open Access OPEN
Spectrum preserving tilings enable sparse and modular reference indexing
Fan J., Khan J., Pibiri G. E., Patro R.
The reference indexing problem for -mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried -mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for -mers into: (1) a -mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index -mer sets can be used to efficiently implement the -mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the -mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique -mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying -mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.Source: RECOMB 2023 - 27th International Conference on Research in Computational Molecular Biology, pp. 21–40, Istanbul, Turkey, 16-19/04/2023
DOI: 10.1007/978-3-031-29119-7_2
Project(s): MobiDataLab via OpenAIRE
Metrics:


See at: link.springer.com Open Access | ISTI Repository Open Access | CNR ExploRA


2023 Journal article Open Access OPEN
On weighted k-mer dictionaries
Pibiri G. E.
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.Source: Algorithms for molecular biology 18 (2023). doi:10.1186/s13015-023-00226-2
DOI: 10.1186/s13015-023-00226-2
Metrics:


See at: almob.biomedcentral.com Open Access | ISTI Repository Open Access | CNR ExploRA


2023 Journal article Open Access OPEN
Matchtigs: minimum plain text representation of k-mer sets
Schmidt S., Khan S., Alanko J. N., Pibiri G. E., Tomescu A. I.
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.Source: Genome biology (Online) 24 (2023). doi:10.1186/s13059-023-02968-z
DOI: 10.1186/s13059-023-02968-z
Project(s): MobiDataLab via OpenAIRE, SAFEBIO via OpenAIRE
Metrics:


See at: genomebiology.biomedcentral.com Open Access | ISTI Repository Open Access | CNR ExploRA


2023 Journal article Open Access OPEN
Locality-preserving minimal perfect hashing of k-mers
Pibiri G. E., Shibuya Y., Limasset A.
Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1, ... , n} bijectively. It is well-known that n log2(e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log2(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.Code Availability: https://github.com/jermp/lphashData Availability: https://zenodo.org/record/7239205Source: Bioinformatics (Oxf., Online) 39 (2023): i534–i543. doi:10.1093/bioinformatics/btad219
DOI: 10.1093/bioinformatics/btad219
Metrics:


See at: academic.oup.com Open Access | ISTI Repository Open Access | ISTI Repository Open Access | CNR ExploRA


2023 Conference article Open Access OPEN
Fulgor: a fast and compact {k-mer} index for large-scale matching and color queries
Fan J., Singh N. P., Khan J., Pibiri G. E., Patro R.
The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.Source: WABI 223 - 23rd International Workshop on Algorithms in Bioinformatics, pp. 18:1–18:21, Houston, Texas (USA), 03-06/09/2023
DOI: 10.4230/lipics.wabi.2023.18
DOI: 10.1101/2023.05.09.539895
Metrics:


See at: drops.dagstuhl.de Open Access | Europe PubMed Central Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | doi.org Restricted | doi.org Restricted | pubmed.ncbi.nlm.nih.gov Restricted | CNR ExploRA


2024 Journal article Open Access OPEN
Fulgor: a fast and compact k-mer index for large-scale matching and color queries
Fan J., Khan J., Pratap Singh N., Pibiri G. E., Patro R.
The problem of sequence identification or matching--determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence--is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto--the strongest competitor in terms of index space vs. query time trade-off--Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6× faster to construct.Source: Algorithms for molecular biology 19 (2024). doi:10.1186/s13015-024-00251-9
DOI: 10.1186/s13015-024-00251-9
Metrics:


See at: almob.biomedcentral.com Open Access | ISTI Repository Open Access | CNR ExploRA


2017 Journal article Open Access OPEN
Clustered Elias-Fano Indexes
Pibiri G. E., Venturini R.
State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.Source: ACM transactions on information systems 36 (2017). doi:10.1145/3052773
DOI: 10.1145/3052773
Project(s): SoBigData via OpenAIRE
Metrics:


See at: ACM Transactions on Information Systems Open Access | ISTI Repository Open Access | dl.acm.org Restricted | ACM Transactions on Information Systems Restricted | CNR ExploRA


2017 Conference article Open Access OPEN
Dynamic Elias-Fano Representation
Pibiri G. E., Venturini R.
We show that it is possible to store a dynamic ordered set S(n,u) of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and yet preserve the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of S(n,u) which takes EF(S(n,u))=n?log(u/n)?+2n bits of space and can be shown to be less than half a bit per element away from the information-theoretic minimum. Considering a RAM model with memory words of ?(log u) bits, we focus on the case in which the integers of S are drawn from a polynomial universe of size u=n?, for any ?=?(1). We represent S(n,u) with EF(S(n,u))+o(n) bits of space and: 1. support static predecessor/successor queries in O(min{1+log(u/n),loglog n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. support random access in O(log n/loglog n) worst-case, insertions/deletions in O(log n/loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n),loglog n}) worst-case time. These time bounds are optimal.Source: Annual Symposium on Combinatorial Pattern Matching, Varsavia, Polonia, 4-6/07/2017
DOI: 10.4230/lipics.cpm.2017.30
Metrics:


See at: drops.dagstuhl.de Open Access | ISTI Repository Open Access | CNR ExploRA


2017 Conference article Restricted
Efficient Data Structures for Massive N-Gram Datasets
Pibiri G. E., Venturini R.
The effcient indexing of large and sparse N-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of N-grams poses the challenge of introducing a compressed representation that preserves the query processing speed. In this paperwe study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given N-gram within micro seconds. For this purpose we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-The-Art software packages. In particular, we present a trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we are able to lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, we show that our technique introduces a negligible penalty at query time.Source: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 615–624, Tokyo, Giappone, 7-11/08/2017
DOI: 10.1145/3077136.3080798
Metrics:


See at: dl.acm.org Restricted | doi.org Restricted | CNR ExploRA


2019 Journal article Open Access OPEN
Handling massive n-gram datasets efficiently
Pibiri G. E., Venturini R.
Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.Source: ACM transactions on information systems 37 (2019). doi:10.1145/3302913
DOI: 10.1145/3302913
DOI: 10.48550/arxiv.1806.09447
DOI: 10.5281/zenodo.3257994
DOI: 10.5281/zenodo.3257995
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | ZENODO Open Access | ZENODO Open Access | ISTI Repository Open Access | ACM Transactions on Information Systems Open Access | dl.acm.org Restricted | ACM Transactions on Information Systems Restricted | doi.org Restricted | CNR ExploRA


2018 Contribution to book Open Access OPEN
Inverted Index Compression
Pibiri G. E., Venturini R.
The data structure at the core of nowadays large-scale search engines, social networks and storage architectures is the inverted index, which can be regarded as being a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by search engines and stringent performance requirements dictated by the heavy load of user queries, the inverted lists often store several million (even billion) of integers and must be searched efficiently. In this scenario, compressing the inverted lists of the index appears as a mandatory design phase since it can introduce a twofold advantage over a non-compressed representation: feed faster memory levels with more data in order to speed up the query processing algorithms and reduce the number of storage machines needed to host the whole index. The scope of the chapter is the one of surveying the most important encoding algorithms developed for efficient inverted index compression.DOI: 10.1007/978-3-319-63962-8_52-1
DOI: 10.1007/978-3-319-77525-8_52
Metrics:


See at: ISTI Repository Open Access | link.springer.com Restricted | link.springer.com Restricted | CNR ExploRA


2020 Journal article Open Access OPEN
Compressed indexes for fast search of semantic data
Pibiri G. E., Perego R., Venturini R.
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is to devise a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis, conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 - 60% less space and speeding up query execution by a factor of 2-81× .Source: IEEE transactions on knowledge and data engineering (Print) 33 (2020): 3187–3198. doi:10.1109/TKDE.2020.2966609
DOI: 10.1109/tkde.2020.2966609
DOI: 10.48550/arxiv.1904.07619
Project(s): BigDataGrapes via OpenAIRE
Metrics:


See at: arXiv.org e-Print Archive Open Access | IEEE Transactions on Knowledge and Data Engineering Open Access | ISTI Repository Open Access | IEEE Transactions on Knowledge and Data Engineering Restricted | doi.org Restricted | ieeexplore.ieee.org Restricted | CNR ExploRA