2009
Book
Restricted
Special Section: Scalable information systems
Lee W, Jianliang X, Jianzhong L, Silvestri FAs data and knowledge volumes keep increasing, and global means for information dissemination continues to diversify, new methods, modeling paradigms and structures are needed to effi- ciently support the mounting scalability requirements for the large variety of current and future data, information, and knowledge [1-3]. Grid computing, peer-to-peer technology, data and knowl- edge bases, distributed information retrieval technology, and net- working technology should all converge to address the scalability concern. This special section compiles recent work on addressing scalability issues of distributed and peer-to-peer systems.Source: FUTURE GENERATION COMPUTER SYSTEMS, pp. 51-52
DOI: 10.1016/j.future.2008.07.012Metrics:
See at:
Future Generation Computer Systems
| CNR IRIS
| CNR IRIS
2010
Journal article
Restricted
Mining query logs: turning search usage data into knowledge
Silvestri FWeb search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.Source: FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, vol. 4 (issue 1-2), pp. 1-174
DOI: 10.1561/1500000013Metrics:
See at:
Foundations and Trends® in Information Retrieval
| CNR IRIS
| CNR IRIS
| www.nowpublishers.com
2010
Journal article
Restricted
A study on the effect of application and resource characteristics on the QOS in service provisioning environments
Varvarigou T, Tserpes K, Kyriazis D, Silvestri F, Psimogiannos NThis article deals with the problem of quality provisioning in business service-oriented environments, examining the resource selection process as an initial matching of the provided to the demanded QoS. It investigates how the application and resource characteristics affect the provided level of QoS, a relationship that intuitively exists but has not yet being mapped. To do so, it focuses on identifying the application and resource parameters that affect the customer-defined QoS parameters. The article realistically centres upon modeling a data mining application and simple PC nodes in order to study how they affect response times. It moves on, by proving the existence of these specific relations and maps them using simple artificial neural networks so as to be able to wrap them in a single mechanism for resource selection based on customer QoS requirements and real time provider QoS capabilities.Source: INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, vol. 1 (issue 1), pp. 55-75
DOI: 10.4018/jdst.2010090804Metrics:
See at:
International Journal of Distributed Systems and Technologies
| CNR IRIS
| CNR IRIS
| www.igi-global.com
2007
Conference article
Restricted
Know your neighbors: Web spam detection using the web topology
Castillo C, Donato D, Gionis A, Murdock V, Silvestri FWeb spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.DOI: 10.1145/1277741.1277814Metrics:
See at:
dl.acm.org
| doi.org
| CNR IRIS
| CNR IRIS
2007
Conference article
Restricted
The impact of caching on search engines
Baezayates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri FIn this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists.
Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of
posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness
of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk
layer or a broker/remote server layer.DOI: 10.1145/1277741.1277775Metrics:
See at:
dl.acm.org
| doi.org
| CNR IRIS
| CNR IRIS
2007
Conference article
Restricted
Challenges on distributed Web retrieval
Baezayates R, Castillo C, Junqueira F, Plachouras V, Silvestri FIn the ocean ofWeb data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.Source: PROCEEDINGS - INTERNATIONAL CONFERENCE ON DATA ENGINEERING, pp. 6-20. Istanbul, Turkey, 15-20 April 2007
See at:
CNR IRIS
| CNR IRIS
2009
Contribution to book
Restricted
Web search result caching and prefetching
Lempel R., Silvestri F. Donato D.Caching is a well-known concept in systems with multiple tiers of storage. For simplicity, consider a system storing N objects in relatively slow memory, that also has a smaller but faster memory buffer ofproposed a two-level caching scheme that combines caching of search results with the caching of frequently accessed postings lists. Prefetching of search engine results was studied from a theoretical point of view by Lempel and Moran in 2002 [6]. They observed that the work involved in query evaluation scales in a sub-linear manner with the number of results computed by the search engine. Then they proceeded to minimize the computations involved in query evaluations by opti- mizing the number of results computed per query. The optimization is based on a workload function that models both (i) the computations performed by the search engine to produce search results and (ii) the probabilistic manner by which users advance through result pages in a search session.Source: Encyclopedia of Database Systems, edited by Ling Liu, M. Tamer Özsu, pp. 3501–3506. New York: Springer, 2009
See at:
www.springerlink.com
| CNR ExploRA
2007
Patent
Metadata Only Access
See at:
CNR IRIS
2007
Patent
Metadata Only Access
See at:
CNR IRIS
2007
Patent
Metadata Only Access
See at:
CNR IRIS
2008
Journal article
Restricted
Design trade offs for search engine caching
Baeza Yates R, Gionis A, Junqueira F, Murdock V, Plachouras V, Silvestri FIn this article we study the trade-offs in designing ef?cient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of ?nding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log in?uence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.Source: ACM TRANSACTIONS ON THE WEB, vol. 2 (issue 4), pp. 20-28
DOI: 10.1145/1409220.1409223Metrics:
See at:
ACM Transactions on the Web
| CNR IRIS
| CNR IRIS
2012
Conference article
Restricted
Making your interests follow you on twitter
Pennacchiotti M, Silvestri F, Vahabi H, Venturini RIn this paper we introduce the task of tweet recommenda- tion, the problem of suggesting tweets that match a user's interests and likes. We propose an Information-Retrieval- like model that leverages the content of the user's tweets and those of her friends, and that effectively retrieves a set of tweets that is personalized and varied in nature. Our approach could be easily leveraged to build, for example, a Twitter or Facebook timeline that collects messages that are of interest for the user, but that are not posted by her friends. We compare to typical approaches used in similar tasks, reporting significant gains in terms of overall preci- sion, up to about +20%, on both a corpus-based evaluation and real world user study.DOI: 10.1145/2396761.2396786Metrics:
See at:
dl.acm.org
| doi.org
| CNR IRIS
| CNR IRIS
2013
Conference article
Restricted
Towards leveraging closed captions for news retrieval
Blanco R, De Francisci Morales G, Silvestri FIntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.
See at:
CNR IRIS
| CNR IRIS
2017
Journal article
Open Access
Tour recommendation for groups
Anagnostopoulos A, Atassi R, Becchetti L, Fazzone A, Silvestri FConsider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people's expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data.Source: DATA MINING AND KNOWLEDGE DISCOVERY, vol. 31 (issue 5), pp. 1157-1188
DOI: 10.1007/s10618-016-0477-7Project(s): MULTIPLEX
Metrics:
See at:
Archivio della ricerca- Università di Roma La Sapienza
| CNR IRIS
| link.springer.com
| ISTI Repository
| Data Mining and Knowledge Discovery
| CNR IRIS
| CNR IRIS
2023
Conference article
Open Access
Leveraging inter-rater agreement for classification in the presence of noisy labels
Bucarelli Ms, Cassano L, Siciliano F, Mantrach A, Silvestri FIn practical settings, classification datasets are obtained through a labelling process that is usually done by humans. Labels can be noisy as they are obtained by aggregating the different individual labels assigned to the same sample by multiple, and possibly disagreeing, annotators. The interrater agreement on these datasets can be measured while the underlying noise distribution to which the labels are subject is assumed to be unknown. In this work, we: (i) show how to leverage the inter-annotator statistics to estimate the noise distribution to which labels are subject; (ii) introduce methods that use the estimate of the noise distribution to learn from the noisy dataset; and (iii) establish generalization bounds in the empirical risk minimization framework that depend on the estimated quantities. We conclude the paper by providing experiments that illustrate our findings.Source: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, pp. 3439-3448. Vancouver, CANADA, 17-24/06/2023
DOI: 10.1109/cvpr52729.2023.00335Project(s): SoBigData-PlusPlus
Metrics:
See at:
CNR IRIS
| ieeexplore.ieee.org
| ISTI Repository
| CNR IRIS
| CNR IRIS
2007
Journal article
Restricted
Dynamic personalization of Web sites without user intervention
Baraglia R, Silvestri FThe Web is an integral part of today's busi- ness dealings. Companies and institutions exploit the Web to conduct their business; customers make daily use of the Net to per- form all kinds of transactions. In addition, most users browse through pages of per- sonal interest. The Web, as we know, is massive and its data collected from count- less sources. Consequently, search tools should be able to accurately extract, filter, and select what is "hidden" from such tools.Source: COMMUNICATIONS OF THE ACM, vol. 50 (issue 2), pp. 63-67
DOI: 10.1145/1216016.1216022Metrics:
See at:
dl.acm.org
| Communications of the ACM
| CNR IRIS
| CNR IRIS
2009
Conference article
Restricted
Entry pairing in inverted file
Lam H T, Perego R, Silvestri F, Quan N T MThis paper proposes to exploit content and usage informa- tion to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.DOI: 10.1007/978-3-642-04409-0_50Metrics:
See at:
doi.org
| CNR IRIS
| CNR IRIS
| link.springer.com
| NARCIS