154 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2012 Other Open Access OPEN
Studying search shortcuts in a query log to improve retrieval over query sessions
Albakour M, Nardini F M, Adeyanju I, Kruschwitz U
In this paper, we study a state-of-the-art model to derive query suggestions from the search logs of a Web search engine in the context of its application for the session retrieval problem. Using the session track 2011 as an evaluation platform and the MSN 2006 logs, we analyze the characteristics of this model and how it can be optimized to better represent user information needs throughout the session.Project(s): S-CUBE via OpenAIRE

See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2012 Journal article Open Access OPEN
RICH: Research and Innovation for Cultural Heritage
Barone V, Licari D, Nardini F M
This paper describes RICH: a new architecture conceived and developed at the Scuola Normale Superiore, for collecting, promoting and sharing cultural heritage data. Starting with the observation that cultural heritage is a cross-cutting field of research where needs are often poorly integrated with each other, a new architecture is required, aimed at solving this integration issue. RICH provides a first step in this direction by addressing several needs in this field through state-of-the-art technologies such as 3D vision and virtualization. The paper outlines the principal building blocks of the architecture by planning and explaining each step in terms of functionality. We also report some preliminary experiences that are being carried out using this architecture.Source: CONSERVATION SCIENCE IN CULTURAL HERITAGE (ONLINE), vol. 12, pp. 109-133
DOI: 10.6092/issn.1973-9494/3385
Metrics:


See at: Conservation Science in Cultural Heritage Open Access | conservation-science.unibo.it Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2012 Conference article Restricted
RGU-ISTI-Essex at TREC 2011 session track
Adeyanju I, Nardini F M, Albakour M, Song D, Kruschwitz U
Mining query recommendation from query logs has attracted a lot of attention in recent years. We propose to use query recommenda- tions extracted from the logs of a web search engine to solve the session track tasks. The runs are obtained by using the Search Shortcuts recom- mender system. The Search Shortcuts technique uses an inverted index and the concept of " successful sessions" present in a web search engine's query log to produce effective recommendations for both frequent and rare/unseen queries. We adapt the above technique as a query expan- sion tool and use it to expand the given queries for Session Track at TREC 2011. The expansion is generated by using a method which aims to consider all past queries in the session. The expansion terms obtained are then used to build a global, uniformly weighted, representation of the user session (RL2). Furthermore, the expansion terms are then combined with a ranked list of results in order to boost terms appearing more fre- quently in the final results lists (RL3). Finally, we also integrate dwell times and the weighting method obtained taking both result lists and clicks into account for assigning weights to the terms to expand the final query of the session. In addition to that, we submitted a baseline run. It is based on the observation that using the term " wikipedia" to expand the query resulted in a better retrieval performance for the tasks at last year's session track at TREC 2010.Source: NIST SPECIAL PUBLICATION, pp. 1-12. Gaithersburg, Md. USA, 15-18 November 2011

See at: CNR IRIS Restricted | CNR IRIS Restricted | trec.nist.gov Restricted


2015 Journal article Restricted
Planning sightseeing tours using crowdsensed trajectories
Brilhante Igo R, De Macedo Josè Af, Nardini Fm, Perego R Renso C
We present an application where semantically enriched trajectories obtained from crowdsensed data are used to build an advanced system for planning personalized sightseeing tours, called TRIPBUILDER. The interesting feature of TRIPBUILDER is that it uses Wikipedia content and trajectories of previous tourists collected by georeferenced Flickr photos in a complex spatio-temporal framework. The objective is to address, in an unsupervised way, the problem of suggesting a budgeted sightseeing tour based on the preferences of the tourist and the time available for the visit. We present few highlights of how TRIPBUILDER works along with a research agenda where we discuss the role of semantically enriched trajectories and crowdsourced location data in planning itineraries.Source: SIGSPATIAL SPECIAL, vol. 7 (issue 1), pp. 59-66
DOI: 10.1145/2782759.2782769
Metrics:


See at: dl.acm.org Restricted | SIGSPATIAL Special Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2015 Conference article Restricted
MUSETS: Diversity-aware web query suggestions for shortening user sessions
Sydow M, Muntean Ci, Nardini Fm, Matwin S, Silvestri F
We propose MUSETS (multi-session total shortening) - a novel formulation of the query suggestion task, specified as an optimization problem. Given an ambiguous user query, the goal is to propose the user a set of query suggestions that optimizes a diversity-aware objective function. The function models the expected number of query reformulations that a user would save until reaching a satisfactory query formulation. The function is diversity-aware, as it naturally enforces high coverage of different alternative continuations of the user session. For modeling the topics covered by the queries, we also use an extended query representation based on entities extracted from Wikipedia. We apply a machine learning approach to learn the model on a set of user sessions to be subsequently used for queries that are under-represented in historical query logs and present an evaluation of the approach.DOI: 10.1007/978-3-319-25252-0_26
Metrics:


See at: doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted | link.springer.com Restricted


2015 Conference article Open Access OPEN
Gamification in information retrieval: State of the art, challenges and opportunities
Muntean Ci, Nardini Fm
Gamification aims at applying game design principles and elements, such as points, badges, feedbacks or leader boards in non- gaming environments. An interesting goal of gamification is to combine and exploit the fun factor for targeting other aspects like achieving more accurate work, more cost effective solutions and better retention rates. The application of gamification techniques to IR tasks poses interesting research challenges. In this paper, we propose an analysis of the state of the art in this field and we summarize interesting challenges and oppor- tunities for the near future.Source: CEUR WORKSHOP PROCEEDINGS, vol. 1404. Cagliari, Italy, 25-26/05/2015

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2016 Conference article Open Access OPEN
Understanding human mobility during events in foursquare
Muntean Ci, Nardini Fm, Noulas A
Social events can generate high influxes of people transitioning various locations in a city. They can be considered to have a considerable impact on the local economy, whether they are sport events, concerts or festivals. These events are capable of generating sudden changes in the activity landscape of a city, with the neighborhoods that host events becoming unusually busy and active compared to times of regular citizen activity. While event and anomaly detection more generally has been a topic of study in recent years, as also has been event recommendation for mobile users, progress has been slower towards building systems that are able to capture the sudden shift appropriately in this setting. In this work we exploit data from the location-based service Foursquare to study mobility during events in Chicago, and later expand our study to other cities as well. Our aim is to identify what differences emerge in terms of user mobility during events versus regular periods of human activity.Source: CEUR WORKSHOP PROCEEDINGS. Venezia, Italy, 30-31 May 2016

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2016 Book Open Access OPEN
Proceedings of the 7th Italian Information Retrieval Workshop
Di Nunzio G, Nardini F M, Orlando S
This volume contains the papers presented at IIR 2016 (the 7th Italian Infor- mation Retrieval Workshop) held on 30-31 May, 2016, in Venice, Italy. The purpose of the Italian Information Retrieval Workshop (IIR) is to provide a forum for stimulating and disseminating research in Information Retrieval, where Italian researchers (especially young ones) and researchers affiliated with Italian institutions can network and discuss their research results in an informal way. The general chair of IIR 2016 was Salvatore Orlando (Universit`a Ca' Fos- cari Venezia), and the co-chairs of the program committee were Giorgio Maria Di Nunzio (Universit`a di Padova) and Franco Maria Nardini (ISTI-CNR, Pisa). For this edition, the program committee accepted 26 papers for oral presen- tation and the final program also included four invited talks by Stefano Ceri (Politecnico di Milano), Alessandro Moschitti (Qatar Computing Research In- stitute, HBKU), Giorgio Satta (Universit`a di Padova), and Rossano Venturini (Universit`a di Pisa). The program co-chairs wish to thank all the members of the Program Com- mittee (see below) as well as the local organizers Francesco Lettich (Universit`a Ca' Foscari Venezia) and Claudio Silvestri (Universit`a Ca' Foscari Venezia), the webmaster Salvatore Trani (ISTI-CNR, Pisa), and the publicity chair, Cristina Muntean, (ISTI-CNR, Pisa).

See at: ceur-ws.org Open Access | CNR IRIS Open Access | CNR IRIS Restricted


2019 Conference article Open Access OPEN
Learning to Rank in Theory and Practice: From Gradient Boosting to Neural Networks and Unbiased Learning
Lucchese C, Nardini Fm, Pasumarthi Rk, Bruch S, Bendersky M, Wang X, Oosterhuis H, Jagerman R, De Rijke M
This tutorial aims to weave together diverse strands of modern Learning to Rank (LtR) research, and present them in a unified full-day tutorial. First, we will introduce the fundamentals of LtR, and an overview of its various sub-fields. Then, we will discuss some recent advances in gradient boosting methods such as LambdaMART by focusing on their efficiency/effectiveness trade-offs and optimizations. Subsequently, we will then present TF-Ranking, a new open source TensorFlow package for neural LtR models, and how it can be used for modeling sparse textual features. Finally, we will conclude the tutorial by covering unbiased LtR -- a new research field aiming at learning from biased implicit user feedback. The tutorial will consist of three two-hour sessions, each focusing on one of the topics described above. It will provide a mix of theoretical and hands-on sessions, and should benefit both academics interested in learning more about the current state-of-the-art in LtR, as well as practitioners who want to use LtR techniques in their applications.DOI: 10.1145/3331184.3334824
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | NARCIS Restricted | doi.org Restricted | NARCIS Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2011 Other Open Access OPEN
Query Log Mining to Enhance User Experience in Search Engines
Nardini Fm
The Web is the biggest repository of documents humans have ever built. Even more, it is increasingly growing in size every day. Users rely on Web search engines (WSEs) for finding information on the Web. By submitting a textual query expressing their information need, WSE users obtain a list of documents that are highly relevant to the query. Moreover, WSEs tend to store such huge amount of users activities in query logs. Query log mining is the set of techniques aiming at extracting valuable knowledge from query logs. This knowledge represents one of the most used ways of enhancing the users search experience. According to this vision, in this thesis we firstly prove that the knowledge extracted from query logs suffer aging effects and we thus propose a solution to this phenomenon. Secondly, we propose new algorithms for query recommendation that overcome the aging problem. Moreover, we study new query recommendation techniques for efficiently producing recommendations for rare queries. Finally, we study the problem of diversifying Web search engine results. We define a methodology based on the knowledge derived from query logs for detecting when and how query results need to be diversified and we develop an efficient algorithm for diversifying search results.

See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2020 Conference article Open Access OPEN
Predicting and explaining privacy risk exposure in mobility data
Naretto F, Pellungrini R, Monreale A, Nardini Fm, Musolesi M
Mobility data is a proxy of different social dynamics and its analysis enables a wide range of user services. Unfortunately, mobility data are very sensitive because the sharing of people's whereabouts may arise serious privacy concerns. Existing frameworks for privacy risk assessment provide tools to identify and measure privacy risks, but they often (i) have high computational complexity; and (ii) are not able to provide users with a justification of the reported risks. In this paper, we propose expert, a new framework for the prediction and explanation of privacy risk on mobility data. We empirically evaluate privacy risk on real data, simulating a privacy attack with a state-of-the-art privacy risk assessment framework. We then extract individual mobility profiles from the data for predicting their risk. We compare the performance of several machine learning algorithms in order to identify the best approach for our task. Finally, we show how it is possible to explain privacy risk prediction on real data, using two algorithms: Shap, a feature importance-based method and Lore, a rule-based method. Overall, expert is able to provide a user with the privacy risk and an explanation of the risk itself. The experiments show excellent performance for the prediction task.DOI: 10.1007/978-3-030-61527-7_27
Project(s): XAI via OpenAIRE, SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: discovery.ucl.ac.uk Open Access | CNR IRIS Open Access | link.springer.com Open Access | ISTI Repository Open Access | Lecture Notes in Computer Science Restricted | Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2022 Conference article Open Access OPEN
ReNeuIR: Reaching Efficiency in Neural Information Retrieval
Bruch S, Lucchese C, Nardini Fm
Perhaps the applied nature of information retrieval research goes some way to explain the community's rich history of evaluating machine learning models holistically, understanding that efficacy matters but so does the computational cost incurred to achieve it. This is evidenced, for example, by more than a decade of research on efficient training and inference of large decision forest models in learning-to-rank. As the community adopts even more complex, neural network-based models in a wide range of applications, questions on efficiency have once again become relevant. We propose this workshop as a forum for a critical discussion of efficiency in the era of neural information retrieval, to encourage debate on the current state and future directions of research in this space, and to promote more sustainable research by identifying best practices in the development and evaluation of neural models for information retrieval.DOI: 10.1145/3477495.3531704
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Journal article Open Access OPEN
Efficient and effective tree-based and neural learning to rank
Bruch S, Lucchese C, Nardini Fm
As information retrieval researchers, we not only develop algorithmic solutions to hard problems, but we also insist on a proper, multifaceted evaluation of ideas. The literature on the fundamental topic of retrieval and ranking, for instance, has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers, while at the same time quantifying their computational costs, from creation and training to application and inference. This is evidenced, for example, by more than a decade of research on efficient training and inference of large decision forest models in Learning to Rank (LtR). As we move towards even more complex, deep learning models in a wide range of applications, questions on efficiency have once again resurfaced with renewed urgency. Indeed, efficiency is no longer limited to time and space; instead it has found new, challenging dimensions that stretch to resource-, sample- and energy-efficiency with ramifications for researchers, users, and the environment.This monograph takes a step towards promoting the study of efficiency in the era of neural information retrieval by offering a comprehensive survey of the literature on efficiency and effectiveness in ranking, and to a limited extent, retrieval. This monograph was inspired by the parallels that exist between the challenges in neural network-based ranking solutions and their predecessors, decision forest-based LtR models, as well as the connections between the solutions the literature to date has to offer. We believe that by understanding the fundamentals underpinning these algorithmic and data structure solutions for containing the contentious relationship between efficiency and effectiveness, one can better identify future directions and more efficiently determine the merits of ideas. We also present what we believe to be important research directions in the forefront of efficiency and effectiveness in retrieval and ranking.Source: FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, vol. 17 (issue 1), pp. 1-123
DOI: 10.1561/1500000071
DOI: 10.48550/arxiv.2305.08680
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | Foundations and Trends® in Information Retrieval Restricted | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Conference article Open Access OPEN
ReNeuIR at SIGIR 2023: the second workshop on Reaching Efficiency in Neural Information Retrieval
Bruch S, Mackenzie J, Maistro M, Nardini Fm
Multifaceted, empirical evaluation of algorithmic ideas is one of the central pillars of Information Retrieval (IR) research. The IR community has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers and, at the same time, quantifying their computational costs, from creation and training to application and inference. As the community moves towards even more complex deep learning models, questions on efficiency have once again become relevant with renewed urgency.Indeed, efficiency is no longer limited to time and space; instead it has found new, challenging dimensions that stretch to resource-, sample- and energy-efficiency with ramifications for researchers, users, and the environment alike. Examining algorithms and models through the lens of holistic efficiency requires the establishment of standards and principles, from defining relevant concepts, to designing metrics, to creating guidelines for making sense of the significance of new findings. The second iteration of the ReNeuIR workshop aims to bring the community together to debate these questions, with the express purpose of moving towards a common benchmarking framework for efficiency.DOI: 10.1145/3539618.3591922
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | doi.org Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Journal article Open Access OPEN
An approximate algorithm for maximum inner product search over streaming sparse vectors
Bruch S, Nardini Fm, Ingber A, Liberty E
Maximum Inner Product Search or top-k retrieval on sparse vectors is well-understood in information retrieval, with a number of mature algorithms that solve it exactly. However, all existing algorithms are tailored to text and frequency-based similarity measures. To achieve optimal memory footprint and query latency, they rely on the near stationarity of documents and on laws governing natural languages. We consider, instead, a setup in which collections are streaming--necessitating dynamic indexing--and where indexing and retrieval must work with arbitrarily distributed real-valued vectors. As we show, existing algorithms are no longer competitive in this setup, even against na"ive solutions. We investigate this gap and present a novel approximate solution, called Sinnamon, that can efficiently retrieve the top-k results for sparse real valued vectors drawn from arbitrary distributions. Notably, Sinnamon offers levers to trade-off memory consumption, latency, and accuracy, making the algorithm suitable for constrained applications and systems. We give theoretical results on the error introduced by the approximate nature of the algorithm, and present an empirical evaluation of its performance on two hardware platforms and synthetic and real-valued datasets. We conclude by laying out concrete directions for future research on this general top-k retrieval problem over sparse vectors.Source: ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 42 (issue 2)
DOI: 10.1145/3609797
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | ACM Transactions on Information Systems Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2023 Book Open Access OPEN
String processing and information retrieval - SPIRE 2023 - Proceedings of the 30th International Symposium, Pisa, Italy, September 26-28, 2023
Nardini Fm, Pisanti N, Venturini R
The 30th International Symposium on String Processing and Information Retrieval (SPIRE) was held on September 26-28, 2023, in Pisa (Italy), followed by the 18th Workshop on Compression, Text, and Algorithms (WCTA) held on September 29, 2023. SPIRE started in 1993 as the South American Workshop on String Processing. It was held in Latin America until 2000. Then, SPIRE moved to Europe, and from then on, it has been held in Australia, Japan, the UK, Spain, Italy, Finland, Portugal, Israel, Brazil, Chile, Colombia, Mexico, Argentina, Bolivia, Peru, the USA, and France. SPIRE continues the long and well-established tradition of encouraging high-quality research at the broad nexus of string processing, information retrieval, and computational biology. This volume contains the accepted papers presented at SPIRE 2023. SPIRE 2023 received a total of 47 submissions. Each submission received at least three reviews. After an intensive discussion phase, the Scientific Program Committee accepted 31 papers. We thank all the authors for their valuable contributions and presentations at the conference and the Program Committee members for their valuable work during the review and discussion phases. We thank Springer for publishing the proceedings of SPIRE 2023 in the LNCS series and ACM SIGIR for sponsoring the conference. The scientific program of SPIRE 2023 includes invited talks by three eminent researchers in the field: Sebastian Bruch (Pinecone, USA), Inge Li Gørtz (Technical University of Denmark, Denmark), and Jakub Radoszewski (University of Warsaw, Poland). SPIRE 2023 had a Best Paper Award, sponsored by Springer. The award was announced during the conference. Finally, we thank the Local Organizing Committee members for making the conference successful.DOI: 10.1007/978-3-031-43980-3
Metrics:


See at: CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted | CNR IRIS Restricted


2023 Book Open Access OPEN
Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023), Pisa, Italy, June 8-9, 2023
Nardini Fm, Tonellotto N, Faggioli G, Ferrara A
There were 33 papers submitted this workshop. Out of these, 24 were accepted for this volume, 1 as regular papers and 23 as short papers.Source: CEUR WORKSHOP PROCEEDINGS, pp. I-II

See at: ceur-ws.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | CNR IRIS Restricted


2022 Journal article Open Access OPEN
Report on the 1st Workshop on Reaching Efficiency in Neural Information Retrieval (ReNeuIR 2022) at SIGIR 2022
Bruch S, Lucchese C, Nardini Fm
As Information Retrieval (IR) researchers, we not only develop algorithmic solutions to hard problems, but we also insist on a proper, multifaceted evaluation of ideas. The IR literature on the fundamental topic of retrieval and ranking, for instance, has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers and, at the same time, quantifying their computational costs, from creation and training to application and inference. This is evidenced, for example, by more than a decade of research on efficient training and inference of large decision forest models in Learning to Rank (LtR). As we move towards even more complex, deep learning models in a wide range of applications, questions on efficiency have once again become relevant with renewed urgency. Indeed, efficiency is no longer limited to time- and space-efficiency; instead it has found new, challenging dimensions that stretch to resource-, sample- and energy-efficiency with ramifications for researchers, users, and the environment. As a step towards bringing together experts from industry and academia and creating a forum for a critical discussion and the promotion of efficiency in the era of Neural Information Retrieval (NIR), we held the ReNeuIR workshop on July 15th, 2022 as a hybrid event---in person in Madrid, Spain along with online attendees---in conjunction with ACM SIGIR 2022. Recognizing the importance of this topic, approximately 80 participants answered our call and attended the workshop over three sessions. The event included a total of two keynotes and eight paper presentations, and concluded with a lively discussion where participants helped identify gaps in existing research and brainstormed future research directions. We had consensus in recognizing that efficiency is not simply latency, that a holistic, concrete definition of efficiency is needed to guide researchers and reviewers alike, and that more research is necessary in developing efficiency-centered evaluation metrics and standard benchmark datasets, platforms, and tools.Source: SIGIR FORUM, vol. 56 (issue 2), pp. 1-14
DOI: 10.1145/3582900.3582916
Metrics:


See at: dl.acm.org Open Access | CNR IRIS Open Access | ISTI Repository Open Access | ACM SIGIR Forum Restricted | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Restricted
Report on the 13th Italian Information Retrieval Workshop (IIR 2023)
Faggioli G., Ferrara A., Nardini F. M., Tonellotto N.
The 13th Italian Information Retrieval Workshop is the thirteenth edition of the annual conference of the Italian information retrieval and recommender systems communities. The two days of the conference gathered interesting studies and research work on a wide range of topics on information retrieval, recommender systems, and natural language processing, such as Search and Ranking, Recommendation, Content Analysis, and Classification, Artificial Intelligence, NLP, Semantics, and Dialog, Domain-Specific Applications, Human Factors and Interfaces, and Evaluation. It was hosted by the National Research Council (CNR) of Italy and the University of Pisa in a conference facility in Pisa, Italy.Source: SIGIR FORUM, vol. 57 (issue 2), pp. 1-12
DOI: 10.1145/3642979.3642990
Metrics:


See at: dl.acm.org Restricted | ACM SIGIR Forum Restricted | CNR IRIS Restricted | CNR IRIS Restricted


2024 Journal article Open Access OPEN
Bridging dense and sparse maximum inner product search
Bruch S., Nardini F. M., Ingber A., Liberty E.
Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top- retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top- retrieval methods. We study clustering-based approximate MIPS where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical k-means for partitioning. Our experiments demonstrate that clustering-based retrieval serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and explore their potential. First, we cast the clustering-based paradigm as dynamic pruning and turn that insight into a novel organization of the inverted index for approximate MIPS over general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, that is robust to query distributions.Source: ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 42 (issue 6), pp. 1-38
DOI: 10.1145/3665324
DOI: 10.48550/arxiv.2309.09013
Metrics:


See at: arXiv.org e-Print Archive Open Access | dl.acm.org Open Access | CNR IRIS Open Access | ACM Transactions on Information Systems Restricted | doi.org Restricted | CNR IRIS Restricted