124 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2007 Journal article Restricted
Sorting out the document identifier assignment problem
Silvestri F.
The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.Source: Lecture notes in computer science 4425 (2007): 101–112. doi:10.1007/978-3-540-71496-5_12
DOI: 10.1007/978-3-540-71496-5_12
Metrics:


See at: doi.org Restricted | www.springerlink.com Restricted | CNR ExploRA


2009 Contribution to journal Restricted
Special Section: Scalable information systems
Lee W., Jianliang X., Jianzhong L., Silvestri F.
As data and knowledge volumes keep increasing, and global means for information dissemination continues to diversify, new methods, modeling paradigms and structures are needed to effi- ciently support the mounting scalability requirements for the large variety of current and future data, information, and knowledge [1-3]. Grid computing, peer-to-peer technology, data and knowl- edge bases, distributed information retrieval technology, and net- working technology should all converge to address the scalability concern. This special section compiles recent work on addressing scalability issues of distributed and peer-to-peer systems.DOI: 10.1016/j.future.2008.07.012
Metrics:


See at: Future Generation Computer Systems Restricted | CNR ExploRA


2010 Journal article Closed Access
Mining query logs: turning search usage data into knowledge
Silvestri F.
Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.Source: Foundations and trends in information retrieval 4 (2010): 1–174. doi:10.1561/1500000013
DOI: 10.1561/1500000013
Metrics:


See at: Foundations and Trends® in Information Retrieval Restricted | www.nowpublishers.com Restricted | CNR ExploRA


2010 Journal article Restricted
A study on the effect of application and resource characteristics on the QOS in service provisioning environments
Varvarigou T., Tserpes K., Kyriazis D., Silvestri F., Psimogiannos N.
This article deals with the problem of quality provisioning in business service-oriented environments, examining the resource selection process as an initial matching of the provided to the demanded QoS. It investigates how the application and resource characteristics affect the provided level of QoS, a relationship that intuitively exists but has not yet being mapped. To do so, it focuses on identifying the application and resource parameters that affect the customer-defined QoS parameters. The article realistically centres upon modeling a data mining application and simple PC nodes in order to study how they affect response times. It moves on, by proving the existence of these specific relations and maps them using simple artificial neural networks so as to be able to wrap them in a single mechanism for resource selection based on customer QoS requirements and real time provider QoS capabilities.Source: International journal of distributed systems and technologies (Print) 1 (2010): 55–75. doi:10.4018/jdst.2010090804
DOI: 10.4018/jdst.2010090804
Metrics:


See at: International Journal of Distributed Systems and Technologies Restricted | www.igi-global.com Restricted | CNR ExploRA


2007 Conference article Restricted
Know your neighbors: Web spam detection using the web topology
Castillo C., Donato D., Gionis A., Murdock V., Silvestri F.
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.Source: 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423–430, Amsterdam, Netherland, 23-27 July 2007
DOI: 10.1145/1277741.1277814
Metrics:


See at: dl.acm.org Restricted | doi.org Restricted | CNR ExploRA


2007 Conference article Restricted
The impact of caching on search engines
Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachouras V., Silvestri F.
In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.Source: 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 183–190, Amsterdam, Netherland, 23-27 July 2007
DOI: 10.1145/1277741.1277775
Metrics:


See at: dl.acm.org Restricted | doi.org Restricted | CNR ExploRA


2007 Conference article Unknown
Challenges on distributed Web retrieval
Baeza-Yates R., Castillo C., Junqueira F., Plachouras V., Silvestri F.
In the ocean ofWeb data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.Source: IEEE 23rd International Conference on Data Engineering. ICDE 2007, pp. 6–20, Istanbul, Turkey, 15-20 April 2007

See at: CNR ExploRA


2009 Contribution to book Restricted
Web search result caching and prefetching
Lempel R., Silvestri F. Donato D.
Caching is a well-known concept in systems with multiple tiers of storage. For simplicity, consider a system storing N objects in relatively slow memory, that also has a smaller but faster memory buffer ofproposed a two-level caching scheme that combines caching of search results with the caching of frequently accessed postings lists. Prefetching of search engine results was studied from a theoretical point of view by Lempel and Moran in 2002 [6]. They observed that the work involved in query evaluation scales in a sub-linear manner with the number of results computed by the search engine. Then they proceeded to minimize the computations involved in query evaluations by opti- mizing the number of results computed per query. The optimization is based on a workload function that models both (i) the computations performed by the search engine to produce search results and (ii) the probabilistic manner by which users advance through result pages in a search session.Source: Encyclopedia of Database Systems, edited by Ling Liu, M. Tamer Özsu, pp. 3501–3506. New York: Springer, 2009

See at: www.springerlink.com Restricted | CNR ExploRA


2007 Patent Unknown
System and Method for Detecting Spam Hosts Based on Propagating Prediction Labels
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri
Source: 085804-0810000

See at: CNR ExploRA


2007 Patent Unknown
System and Method for Identifying Spam Hosts Using Stacked Graphical Learning
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri
Source: 085804-080000

See at: CNR ExploRA


2007 Patent Unknown
System and Method for Identifying Spam Hosts By Clustering The Host Graph
Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri
Source: 085804-079000

See at: CNR ExploRA


2008 Journal article Restricted
Design trade offs for search engine caching
Baeza Yates R., Gionis A., Junqueira F., Murdock V., Plachouras V., Silvestri F.
In this article we study the trade-offs in designing ef?cient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of ?nding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log in?uence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.Source: ACM transactions on the web 2 (2008): 20–28. doi:10.1145/1409220.1409223
DOI: 10.1145/1409220.1409223
Metrics:


See at: ACM Transactions on the Web Restricted | CNR ExploRA


2012 Conference article Open Access OPEN
Prefetching query results and its impact on search engines
Jonassen S., Cambazoglu B. B., Silvestri F.
We investigate the impact of query result prefetching on the efficiency and effectiveness of web search engines. We pro- pose offline and online strategies for selecting and ordering queries whose results are to be prefetched. The offline strate- gies rely on query log analysis and the queries are selected from the queries issued on the previous day. The online strategies select the queries from the result cache, relying on a machine learning model that estimates the arrival times of queries. We carefully evaluate the proposed prefetching techniques via simulation on a query log obtained from Ya- hoo! web search. We demonstrate that our strategies are able to improve various performance metrics, including the hit rate, query response time, result freshness, and query degradation rate, relative to a state-of-the-art baseline.Source: 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 631–640, Portland, OR, USA, 12-16 August 2012
DOI: 10.1145/2348283.2348368
Project(s): COAST via OpenAIRE
Metrics:


See at: Norwegian Open Research Archives Open Access | ntnuopen.ntnu.no Open Access | dl.acm.org Restricted | doi.org Restricted | CNR ExploRA


2013 Contribution to conference Unknown
Towards leveraging closed captions for news retrieval
Blanco R., De Francisci Morales G., Silvestri F.
IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.Source: 22nd international conference on World Wide Web Companion, pp. 135–136, Rio de Janeiro, Brasil, 13-17 Maggio 2013

See at: CNR ExploRA


2017 Journal article Open Access OPEN
Tour recommendation for groups
Anagnostopoulos A., Atassi R., Becchetti L., Fazzone A., Silvestri F.
Consider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people's expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data.Source: Data mining and knowledge discovery 31 (2017): 1157–1188. doi:10.1007/s10618-016-0477-7
DOI: 10.1007/s10618-016-0477-7
Project(s): MULTIPLEX via OpenAIRE
Metrics:


See at: Archivio della ricerca- Università di Roma La Sapienza Open Access | ISTI Repository Open Access | Data Mining and Knowledge Discovery Restricted | link.springer.com Restricted | CNR ExploRA


2023 Conference article Open Access OPEN
Leveraging inter-rater agreement for classification in the presence of noisy labels
Bucarelli M. S., Cassano L., Siciliano F., Mantrach A., Silvestri F.
In practical settings, classification datasets are obtained through a labelling process that is usually done by humans. Labels can be noisy as they are obtained by aggregating the different individual labels assigned to the same sample by multiple, and possibly disagreeing, annotators. The interrater agreement on these datasets can be measured while the underlying noise distribution to which the labels are subject is assumed to be unknown. In this work, we: (i) show how to leverage the inter-annotator statistics to estimate the noise distribution to which labels are subject; (ii) introduce methods that use the estimate of the noise distribution to learn from the noisy dataset; and (iii) establish generalization bounds in the empirical risk minimization framework that depend on the estimated quantities. We conclude the paper by providing experiments that illustrate our findings.Source: CVPR - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3439–3448, Vancouver, CANADA, 17-24/06/2023
DOI: 10.1109/cvpr52729.2023.00335
Project(s): SoBigData-PlusPlus via OpenAIRE
Metrics:


See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA


2007 Journal article Restricted
Dynamic personalization of Web sites without user intervention
Baraglia R., Silvestri F.
The Web is an integral part of today's busi- ness dealings. Companies and institutions exploit the Web to conduct their business; customers make daily use of the Net to per- form all kinds of transactions. In addition, most users browse through pages of per- sonal interest. The Web, as we know, is massive and its data collected from count- less sources. Consequently, search tools should be able to accurately extract, filter, and select what is "hidden" from such tools.Source: Communications of the ACM 50 (2007): 63–67. doi:10.1145/1216016.1216022
DOI: 10.1145/1216016.1216022
Metrics:


See at: dl.acm.org Restricted | Communications of the ACM Restricted | CNR ExploRA


2009 Conference article Restricted
Entry pairing in inverted file
Lam H. T., Perego R., Silvestri F., Quan N. T. M.
This paper proposes to exploit content and usage informa- tion to rearrange an inverted index for a full-text IR system. The idea is to merge the entries of two frequently co-occurring terms, either in the collection or in the answered queries, to form a single, paired, entry. Since postings common to paired terms are not replicated, the resulting index is more compact. In addition, queries containing terms that have been paired are answered faster since we can exploit the pre-computed posting intersection. In order to choose which terms have to be paired, we formulate the term pairing problem as a Maximum-Weight Matching Graph problem, and we evaluate in our scenario efficiency and efficacy of both an exact and a heuristic solution. We apply our technique: (i) to compact a compressed inverted file built on an actual Web collection of documents, and (ii) to increase capacity of an in-memory posting list. Experiments showed that in the first case our approach can improve the compression ratio of up to 7.7%, while we measured a saving from 12% up to 18% in the size of the posting cache.Source: WISE 2009 - Web Information Systems Engineering. 10th International Conference, pp. 511–522, Poznan, Polonia, October 5-7 2009
DOI: 10.1007/978-3-642-04409-0_50
Metrics:


See at: doi.org Restricted | link.springer.com Restricted | NARCIS Restricted | CNR ExploRA


2009 Conference article Restricted
Mining query logs
Orlando S., Silvestri F.
Web Search Engines (WSEs) have stored in their query logs information about users since they started to operate. This information often serves many purposes. The primary focus of this tutorial is to introduce to the discipline of query log mining.We will show its foundations, by giving a unified view on the literature on query log analysis, and also present in detail the basic algorithms and techniques that could be used to extract useful knowledge from this (potentially) infinite source of information. Finally, we will discuss how the extracted knowledge can be exploited to improve different quality features of a WSE system, mainly its effectiveness and efficiency.Source: Advances in Information Retrieval. 31th European Conference on IR Research - ECIR 2009, pp. 814–817, Toulose, France, 6-9 Aprile 2009
DOI: 10.1007/978-3-642-00958-7_94
Metrics:


See at: link.springer.com Restricted | link.springer.com Restricted | CNR ExploRA


2004 Conference article Restricted
An online recommender system for large Web sites
Baraglia R., Silvestri F.
In this paper we propose a WUM recommender system, called SUGGEST 3.0, that dynamically generates links to pages that have not yet been visited by a user and might be of his potential interest. Differently from the recommender systems proposed so far, SUGGEST 3.0 does not make use of any off-line component, and is able to manage Web sites made up of pages dynamically generated. To this purpose SUGGEST 3.0 incrementally builds and maintains historical information by means of an incremental graph partitioning algorithm, requiring no off-line component. The main innovation proposed here is a novel strategy that can be used to manage large Web sites. Experiments, conducted in order to evaluate SUGGEST 3.0 performance, demonstrated that our system is able to anticipate users' requests that will be made farther in the future, introducing a limited overhead on the Web server activity.Source: IEEE/WIC/ACM International Conference on WEB Intelligence - WI'2004, pp. 199–205, Beijing, China, 20-24 September 2004
DOI: 10.1109/wi.2004.10158
Metrics:


See at: ieeexplore.ieee.org Restricted | xplorestaging.ieee.org Restricted | CNR ExploRA