Page 1 of 7

2007 Journal article Restricted

Sorting out the document identifier assignment problem
Silvestri F
The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.DOI: 10.1007/978-3-540-71496-5_12
Metrics:

See at: doi.org Restricted | CNR IRIS | CNR IRIS | CNR IRIS | www.springerlink.com

2009 Book Restricted

Special Section: Scalable information systems
Lee W, Jianliang X, Jianzhong L, Silvestri F
As data and knowledge volumes keep increasing, and global means for information dissemination continues to diversify, new methods, modeling paradigms and structures are needed to effi- ciently support the mounting scalability requirements for the large variety of current and future data, information, and knowledge [1-3]. Grid computing, peer-to-peer technology, data and knowl- edge bases, distributed information retrieval technology, and net- working technology should all converge to address the scalability concern. This special section compiles recent work on addressing scalability issues of distributed and peer-to-peer systems.Source: FUTURE GENERATION COMPUTER SYSTEMS, pp. 51-52
DOI: 10.1016/j.future.2008.07.012
Metrics:

See at: Future Generation Computer Systems Restricted | CNR IRIS | CNR IRIS

2010 Journal article Restricted

Mining query logs: turning search usage data into knowledge
Silvestri F
Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.Source: FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, vol. 4 (issue 1-2), pp. 1-174
DOI: 10.1561/1500000013
Metrics:

See at: Foundations and Trends® in Information Retrieval Restricted | CNR IRIS | CNR IRIS | www.nowpublishers.com

2010 Journal article Restricted

A study on the effect of application and resource characteristics on the QOS in service provisioning environments
Varvarigou T, Tserpes K, Kyriazis D, Silvestri F, Psimogiannos N
This article deals with the problem of quality provisioning in business service-oriented environments, examining the resource selection process as an initial matching of the provided to the demanded QoS. It investigates how the application and resource characteristics affect the provided level of QoS, a relationship that intuitively exists but has not yet being mapped. To do so, it focuses on identifying the application and resource parameters that affect the customer-defined QoS parameters. The article realistically centres upon modeling a data mining application and simple PC nodes in order to study how they affect response times. It moves on, by proving the existence of these specific relations and maps them using simple artificial neural networks so as to be able to wrap them in a single mechanism for resource selection based on customer QoS requirements and real time provider QoS capabilities.Source: INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, vol. 1 (issue 1), pp. 55-75
DOI: 10.4018/jdst.2010090804
Metrics:

See at: International Journal of Distributed Systems and Technologies Restricted | CNR IRIS | CNR IRIS | www.igi-global.com

2007 Conference article Restricted

Know your neighbors: Web spam detection using the web topology
Castillo C, Donato D, Gionis A, Murdock V, Silvestri F
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.DOI: 10.1145/1277741.1277814
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS

2007 Conference article Restricted

The impact of caching on search engines
Baezayates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F
In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.DOI: 10.1145/1277741.1277775
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS

2007 Conference article Restricted

Challenges on distributed Web retrieval
Baezayates R, Castillo C, Junqueira F, Plachouras V, Silvestri F
In the ocean ofWeb data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.Source: PROCEEDINGS - INTERNATIONAL CONFERENCE ON DATA ENGINEERING, pp. 6-20. Istanbul, Turkey, 15-20 April 2007

See at: CNR IRIS Restricted | CNR IRIS

2009 Contribution to book Restricted

Web search result caching and prefetching
Lempel R., Silvestri F. Donato D.
Caching is a well-known concept in systems with multiple tiers of storage. For simplicity, consider a system storing N objects in relatively slow memory, that also has a smaller but faster memory buffer ofproposed a two-level caching scheme that combines caching of search results with the caching of frequently accessed postings lists. Prefetching of search engine results was studied from a theoretical point of view by Lempel and Moran in 2002 [6]. They observed that the work involved in query evaluation scales in a sub-linear manner with the number of results computed by the search engine. Then they proceeded to minimize the computations involved in query evaluations by opti- mizing the number of results computed per query. The optimization is based on a workload function that models both (i) the computations performed by the search engine to produce search results and (ii) the probabilistic manner by which users advance through result pages in a search session.Source: Encyclopedia of Database Systems, edited by Ling Liu, M. Tamer Özsu, pp. 3501–3506. New York: Springer, 2009

See at: www.springerlink.com Restricted | CNR ExploRA

2007 Patent Metadata Only Access

System and Method for Detecting Spam Hosts Based on Propagating Prediction Labels
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri

See at: CNR IRIS Restricted

2007 Patent Metadata Only Access

System and Method for Identifying Spam Hosts Using Stacked Graphical Learning
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri

See at: CNR IRIS Restricted

2007 Patent Metadata Only Access

System and Method for Identifying Spam Hosts By Clustering The Host Graph
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri

See at: CNR IRIS Restricted

2008 Journal article Restricted

Design trade offs for search engine caching
Baeza Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri F
In this article we study the trade-offs in designing ef?cient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of ?nding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log in?uence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.Source: ACM TRANSACTIONS ON THE WEB, vol. 2 (issue 4), pp. 20-28
DOI: 10.1145/1409220.1409223
Metrics:

See at: ACM Transactions on the Web Restricted | CNR IRIS | CNR IRIS

2012 Conference article Restricted

Making your interests follow you on twitter
Pennacchiotti M, Silvestri F, Vahabi H, Venturini R
In this paper we introduce the task of tweet recommenda- tion, the problem of suggesting tweets that match a user's interests and likes. We propose an Information-Retrieval- like model that leverages the content of the user's tweets and those of her friends, and that effectively retrieves a set of tweets that is personalized and varied in nature. Our approach could be easily leveraged to build, for example, a Twitter or Facebook timeline that collects messages that are of interest for the user, but that are not posted by her friends. We compare to typical approaches used in similar tasks, reporting significant gains in terms of overall preci- sion, up to about +20%, on both a corpus-based evaluation and real world user study.DOI: 10.1145/2396761.2396786
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS

2012 Conference article Open Access

Prefetching query results and its impact on search engines
Jonassen S, Cambazoglu Bb, Silvestri F
We investigate the impact of query result prefetching on the efficiency and effectiveness of web search engines. We pro- pose offline and online strategies for selecting and ordering queries whose results are to be prefetched. The offline strate- gies rely on query log analysis and the queries are selected from the queries issued on the previous day. The online strategies select the queries from the result cache, relying on a machine learning model that estimates the arrival times of queries. We carefully evaluate the proposed prefetching techniques via simulation on a query log obtained from Ya- hoo! web search. We demonstrate that our strategies are able to improve various performance metrics, including the hit rate, query response time, result freshness, and query degradation rate, relative to a state-of-the-art baseline.DOI: 10.1145/2348283.2348368
Project(s): COAST via OpenAIRE

Metrics:

2013 Conference article Restricted

Towards leveraging closed captions for news retrieval
Blanco R, De Francisci Morales G, Silvestri F
IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.

See at: CNR IRIS Restricted | CNR IRIS

2017 Journal article Open Access

Tour recommendation for groups
Anagnostopoulos A, Atassi R, Becchetti L, Fazzone A, Silvestri F
Consider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people's expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data.Source: DATA MINING AND KNOWLEDGE DISCOVERY, vol. 31 (issue 5), pp. 1157-1188
DOI: 10.1007/s10618-016-0477-7
Project(s): MULTIPLEX via OpenAIRE

Metrics:

2023 Conference article Open Access

Leveraging inter-rater agreement for classification in the presence of noisy labels
Bucarelli Ms, Cassano L, Siciliano F, Mantrach A, Silvestri F
In practical settings, classification datasets are obtained through a labelling process that is usually done by humans. Labels can be noisy as they are obtained by aggregating the different individual labels assigned to the same sample by multiple, and possibly disagreeing, annotators. The interrater agreement on these datasets can be measured while the underlying noise distribution to which the labels are subject is assumed to be unknown. In this work, we: (i) show how to leverage the inter-annotator statistics to estimate the noise distribution to which labels are subject; (ii) introduce methods that use the estimate of the noise distribution to learn from the noisy dataset; and (iii) establish generalization bounds in the empirical risk minimization framework that depend on the estimated quantities. We conclude the paper by providing experiments that illustrate our findings.Source: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 3439-3448. Vancouver, CANADA, 17-24/06/2023
DOI: 10.1109/cvpr52729.2023.00335
Project(s): SoBigData-PlusPlus via OpenAIRE

Metrics:

See at: CNR IRIS Open Access | ieeexplore.ieee.org | ISTI Repository | CNR IRIS Restricted | CNR IRIS

2007 Journal article Restricted

Dynamic personalization of Web sites without user intervention
Baraglia R, Silvestri F
The Web is an integral part of today's busi- ness dealings. Companies and institutions exploit the Web to conduct their business; customers make daily use of the Net to per- form all kinds of transactions. In addition, most users browse through pages of per- sonal interest. The Web, as we know, is massive and its data collected from count- less sources. Consequently, search tools should be able to accurately extract, filter, and select what is "hidden" from such tools.Source: COMMUNICATIONS OF THE ACM, vol. 50 (issue 2), pp. 63-67
DOI: 10.1145/1216016.1216022
Metrics:

See at: dl.acm.org Restricted | Communications of the ACM | CNR IRIS | CNR IRIS

2009 Conference article Restricted

Entry pairing in inverted file
Lam H T, Perego R, Silvestri F, Quan N T M
This paper proposes to exploit content and usage informa- tion to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.DOI: 10.1007/978-3-642-04409-0_50
Metrics:

See at: doi.org Restricted | CNR IRIS | CNR IRIS | link.springer.com | NARCIS

2009 Conference article Restricted

Mining query logs
Orlando S, Silvestri F
Web Search Engines (WSEs) have stored in their query logs information about users since they started to operate. This information often serves many purposes. The primary focus of this tutorial is to introduce to the discipline of query log mining.We will show its foundations, by giving a unified view on the literature on query log analysis, and also present in detail the basic algorithms and techniques that could be used to extract useful knowledge from this (potentially) infinite source of information. Finally, we will discuss how the extracted knowledge can be exploited to improve different quality features of a WSE system, mainly its effectiveness and efficiency.DOI: 10.1007/978-3-642-00958-7_94
Metrics:

See at: CNR IRIS Restricted | CNR IRIS | link.springer.com | link.springer.com