Page 1 of 7

2009 Journal article Restricted

Embedding and retrieving private metadata in electrocardiograms
Lucchese C, Kozat S, Vlachos M, Yu P S, Herle H
Due to the recent explosion of 'identity theft' cases, the safeguarding of private data has been the focus of many scientific efforts. Medical data contain a number of sensitive attributes, whose access the rightful owner would ideally like to disclose only to authorized personnel. One way of providing limited access to sensitive data is through means of encryption. In this work we follow a different path, by proposing the fusion of the sensitive metadata within the medical data. Our work is focused on medical time-series signals and in particular on Electrocardiograms (ECG). We present techniques that allow the embedding and retrieval of sensitive numerical data, such as the patient's social security number or birth date, within the medical signal. The proposed technique not only allows the effective hiding of the sensitive metadata within the signal itself, but it additionally provides a way of authenticating the data ownership or providing assurances about the origin of the data. Our methodology builds upon watermarking notions, and presents the following desirable characteristics: (a) it does not distort important ECG characteristics, which are essential for proper medical diagnosis, (b) it allows not only the embedding but also the efficient retrieval of the embedded data, (c) it provides resilience and fault tolerance by employing multistage watermarks (both robust and fragile). Our experiments on real ECG data indicate the viability of the proposed scheme.Source: JOURNAL OF MEDICAL SYSTEMS, vol. 33 (issue 4), pp. 241-259
DOI: 10.1007/s10916-008-9185-1
Metrics:

See at: Journal of Medical Systems Restricted | CNR IRIS | CNR IRIS | metapress.com

2010 Journal article Restricted

Rights protection of trajectory datasets with nearest-neighbor preservation
Lucchese C, Vlachos M, Yu P S, Rayan D
Companies frequently outsource datasets to mining firms, and academic institutions create repositories or share datasets in the interest of promoting research collaboration. Still, many practitioners have reservations about sharing or outsourcing datasets, primarily because of fear of losing the principal rights over the dataset. This work presents a way of convincingly claiming ownership rights over a trajectory dataset, without, at the same time, destroying the salient dataset characteristics, which are important for accurate search operations and data-mining tasks. The digital watermarking methodology that we present distorts imperceptibly a collection of sequences, effectively embedding a secret key, while retaining as well as possible the neighborhood of each object, which is vital for operations such as similarity search, classification, or clustering. A key contribution in this methodology is a technique for discovering the maximum distortion that still maintains such desirable properties. We demonstrate both analytically and empirically that the proposed dataset marking techniques can withstand a number of attacks (such a translation, rotation, noise addition, etc) and therefore can provide a robust framework for facilitating the secure dissemination of trajectory datasetsSource: VLDB JOURNAL, vol. 19 (issue 4), pp. 531-556
DOI: 10.1007/s00778-010-0178-6
Metrics:

See at: The VLDB Journal Restricted | CNR IRIS | CNR IRIS | www.springerlink.com

2008 Conference article Open Access

Ownership protection of shapes with geodesic distance preservation
Michail Vlachos, Claudio Lucchese, Deepak Rajan, Philip S Yu
Protection of one's intellectual property is a topic with important technological and legal facets. The significance of this issue is amplified nowadays due to the ease of data dissemination through the internet. Here, we provide technological mechanisms for establishing the ownership of a dataset consisting of multiple objects. The objects that we consider in this work are shapes (i.e., two dimensional contours), which abound in disciplines such as medicine, biology, anthropology and natural sciences. The protection of the dataset is achieved through means of embedding of an imperceptible ownership 'seal', that imparts only minute visual distortions. This seal needs to be embedded in the proper data space so that its removal or destruction is particularly difficult. Our technique is robust to many common transformations, such as data rotation, translation, scaling, noise addition and resampling. In addition to that, the proposed scheme also guarantees that important distances between the dataset shapes/objects are not distorted. We achieve this by preserving the geodesic distances between the dataset objects. Geodesic distances capture a significant part of the dataset structure, and their usefulness is recognized in many machine learning, visualization and clustering algorithms. Therefore, if a practitioner uses the protected dataset as input to a variety of mining, machine learning, or database operations, the output will be the same as on the original dataset. We illustrate and validate the applicability of our methods on image shapes extracted from anthropological and natural science data.DOI: 10.1145/1353343.1353379
Metrics:

See at: dl.acm.org Open Access | dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS

2008 Conference article Restricted

Rights protection of trajectory datasets
Lucchese C, Vlachos M, Rayan D, Yu P S
This work presents a technique of convincingly claiming ownership rights over a trajectory dataset. The presented methodology distorts imperceptibly a collection of sequences, effectively embedding a secret key, while retaining as well as possible the neighborhood of each object, which is vital for operations such as similarity search, classification or clustering.

See at: CNR IRIS Restricted | CNR IRIS

2010 Book Restricted

Workshop Report - LSDS-IR'10
Lucchese C, Blanco R, Cambazoglou B
The size of theWeb as well as user bases of search systems continue to grow exponentially. Consequently, providing subsecond query response times and high query throughput become quite challenging for large-scale information retrieval systems. Distributing different aspects of search (e.g., crawling, indexing, and query processing) is essential to achieve scalability in large-scale information retrieval systems. The 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'10) has provided a venue to discuss the current research challenges and identify new directions for distributed information retrieval. The workshop contained two industry talks as well as six research paper presentations. The hot topics in this year's workshop were collection selection architectures, application of MapReduce to information retrieval problems, similarity search, geographically distributed web search, and optimization techniques for search efficiency.DOI: 10.1145/1924475.1924486
Metrics:

See at: dl.acm.org Restricted | ACM SIGIR Forum | CNR IRIS | CNR IRIS

2011 Conference article Restricted

Direct local pattern sampling by efficient two-step random procedures
Boley M, Lucchese C, Paurat D, Gartner, T
We present several exact and highly scalable local pattern sampling algorithms. They can be used as an alternative to exhaustive local pattern discovery methods (e.g, frequent set mining or optimistic-estimator-based subgroup discovery) and can substantially improve efficiency as well as con- trollability of pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedures are direct, i.e., non process- simulating, sampling algorithms. The advantages of these direct methods are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced patterns. Namely, the proposed algorithms can sample (item-)sets according to frequency, area, squared fre- quency, and a class discriminativity measure. Experiments demonstrate that these procedures can improve the accuracy of pattern-based models similar to frequent sets and often also lead to substantial gains in terms of scalability.DOI: 10.1145/2020408.2020500
Project(s): LIFT via OpenAIRE

Metrics:

2011 Other Restricted

Direct pattern sampling with respect to pattern frequency
Lucchese C, Boley M, Gartner T, Paurat D
We present an exact and highly scalable sampling algorithm that can be used as an alternative to exhaustive local pattern discovery methods. It samples patterns according to their frequency of occurrence and can substantially improve efficiency and controllability of the pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedure is direct, i.e. a non process-simulating sampling algorithm. The ad- vantages of this direct method are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced pat- terns. In addition we present experimental results which demonstrate that these procedures can improve the accuracy of pattern-based models similar to frequent sets and often also lead to substantial gains in terms of scalabilityProject(s): LIFT via OpenAIRE

See at: CNR IRIS Restricted | CNR IRIS

2011 Book Restricted

LSDS-IR'11: the 9th workshop on large-scale and distributed systems for information retrieval
Lucchese C, Cambazoglu B B
The growth of the Web and user bases lead to important performance problems for large-scale Web search engines. The LSDS-IR '11 workshop focuses on research contributions re- lated to the scalability and efficiency of distributed information retrieval (IR) systems. The workshop also encourages contributions that propose different ways of leveraging diversity and multiplicity of resources available in distributed systems. More specifically, we are interested in novel applications, models, and architectures that deal with efficiency and scalability of distributed IR systems.DOI: 10.1145/2063576.2064054
Project(s): Advanced Search Services and Enhanced Technological Solutions for the European Digital Library
Metrics:

See at: dl.acm.org Restricted | doi.org | CNR IRIS | CNR IRIS

2012 Conference article Restricted

From chatter to headlines: harnessing the real-time web for personalized news recommendation
De Francisci M. G., Aristides G., Lucchese C.
We propose a new methodology for recommending interest- ing news to users by exploiting the information in their twit- ter persona. We model relevance between users and news articles using a mix of signals drawn from the news stream and from twitter: the prole of the social neighborhood of the users, the content of their own tweet stream, and topic popularity in the news and in the whole twitter-land. We validate our approach on a real-world dataset of ap- proximately 40k articles coming from Yahoo! News and one month of crawled twitter data. We train our model using a learning-to-rank approach and support-vector machines. The train and test set are drawn from Yahoo! toolbar log data. We heuristically identify 3 214 users of twitter in the log and use their clicks on news articles to train our system. Our methodology is able to predict with good accuracy the news articles clicked by the users and rank them higher than other news articles. The results show that the com- bination of various signals from real-time web and micro- blogging platforms can be a useful resource to understand user behavior.Source: Fifth ACM International Conference on Web Search and Data Mining, pp. 153–162, Seattle, Washington, USA, 8-12 February 2012
DOI: 10.1145/2124295.2124315
Metrics:

See at: doi.org Restricted | CNR ExploRA

2013 Conference article Open Access

Twitter anticipates bursts of requests for Wikipedia articles
Tolomei G, Orlando S, Ceccarelli D, Lucchese C
Most of the tweets that users exchange on Twitter make implicit mentions of named-entities, which in turn can be mapped to corresponding Wikipedia articles using proper Entity Linking (EL) techniques. Some of those become trending entities on Twitter due to a long-lasting or a sudden effect on the volume of tweets where they are mentioned. We argue that the set of trending entities discovered from Twitter may help predict the volume of requests for relating Wikipedia articles. To validate this claim, we apply an EL technique to extract trending entities from a large dataset of public tweets. Then, we analyze the time series derived from the hourly trending score (i.e., an index of popularity) of each entity as measured by Twitter and Wikipedia, respectively. Our results reveals that Twitter actually leads Wikipedia by one or more hours.DOI: 10.1145/2513577.2538768
Metrics:

2014 Journal article Restricted

Right-protected data publishing with provable distance-based mining
Lucchese C, Zoumpoulis S, Vlachos M, Freris N
Protection of one's intellectual property is a topic with important technological and legal facets. We provide mechanisms for establishing the ownership of a dataset consisting of multiple objects. The algorithms also preserve important properties of the dataset, which are important for mining operations, and so guarantee both right protection and utility preservation. We consider a right-protection scheme based on watermarking. Watermarking may distort the original distance graph. Our watermarking methodology preserves important distance relationships, such as: the Nearest Neighbors (NN) of each object and the Minimum Spanning Tree (MST) of the original dataset. This leads to preservation of any mining operation that depends on the ordering of distances between objects, such as NN-search and classification, as well as many visualization techniques. We prove fundamental lower and upper bounds on the distance between objects post-watermarking. In particular, we establish a restricted isometry property, i.e., tight bounds on the contraction/expansion of the original distances. We use this analysis to design fast algorithms for NN-preserving and MST-preserving watermarking that drastically prune the vast search space. We observe two orders of magnitude speedup over the exhaustive schemes, without any sacrifice in NN or MST preservation.Source: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (PRINT), vol. 26 (issue 8), pp. 2014-2028
DOI: 10.1109/tkde.2013.90
Project(s): MININEXACT via OpenAIRE

Metrics:

See at: IEEE Transactions on Knowledge and Data Engineering Restricted | CNR IRIS | ieeexplore.ieee.org | Infoscience - EPFL scientific publications | CNR IRIS

2016 Conference article Open Access

Summarizing linked data RDF graphs using approximate graph pattern mining
Zneika M, Lucchese C, Vodislav D, Kotzinos D
The Linked Open Data (LOD) cloud brings together infor- mation described in RDF and stored on the web in (possi- bly distributed) RDF Knowledge Bases (KBs). The data in these KBs are not necessarily described by a known schema and many times it is extremely time consuming to query all the interlinked KBs in order to acquire the necessary in- formation. To tackle this problem, we propose a method of summarizing large RDF KBs using approximate RDF graph patterns and calculating the number of instances covered by each pattern. Then we transform the patterns to an RDF schema that describes the contents of the KB. Thus we can then query the RDF graph summary to identify whether the necessary information is present and if so its size, before deciding to include it in a federated query result.DOI: 10.5441/002/edbt.2016.86
Project(s): SoBigData via OpenAIRE

Metrics:

See at: CNR IRIS Open Access | CNR IRIS Restricted

2016 Conference article Open Access

RDF graph summarization based on approximate patterns
Zneika M, Lucchese C, Vodislav D, Kotzinos D
The Linked Open Data (LOD) cloud brings together information described in RDF and stored on the web in (possibly distributed) RDF Knowledge Bases (KBs). The data in these KBs are not necessarily described by a known schema and many times it is extremely time consuming to query all the interlinked KBs in order to acquire the necessary information. But even when the KB schema is known, we need actually to know which parts of the schema are used. We solve this problem by summarizing large RDF KBs using top-K approximate RDF graph patterns, which we transform to an RDF schema that describes the contents of the KB. This schema describes accurately the KB, even more accurately than an existing schema because it describes the actually used schema, which corresponds to the existing data. We add information on the number of various instances of the patterns, thus allowing the query to estimate the expected results. That way we can then query the RDF graph summary to identify whether the necessary information is present and if it is present in significant numbers whether to be included in a federated query result.Source: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE (PRINT), pp. 69-87. Grand Forks, ND, USA, 1-2 October, 2015
DOI: 10.1007/978-3-319-43862-7_4
Metrics:

2005 Conference article Restricted

Pushing tougher constraints in frequent pattern mining
Bonchi F, Lucchese C
In this paper we extend the state-of-art of the constraints that can be pushed in a frequent pattern computation. We introduce a new class of tough constraints, namely Loose Anti-monotone constraints, and we deeply characterize them by showing that they are a superclass of convertible anti-monotone constraints (e.g. constraints on average or median) and that they model tougher constraints (e.g. constraints on variance or standard deviation). Then we show how these constraints can be exploited in a level-wise Apriori-like computation by means of a new data-reduction technique: the resulting algorithm outperforms previous proposals for convertible constraints, and it is to treat much tougher constraints with the same effectiveness of easier ones.

See at: CNR IRIS Restricted | CNR IRIS

2007 Journal article Restricted

Extending the state-of-the-art of constraint-based pattern discovery
Lucchese C, Bonchi F
The constraint-based pattern discovery paradigm was introduced with the aim of providing to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. In this paper we review and extend the state-of-the-art of the constraints that can be pushed in a frequent pattern computation. We introduce novel data reduction techniques which are able to exploit convertible anti-monotone constraints (e.g., constraints on average or median) as well as tougher constraints (e.g., constraints on variance or standard deviation). A thorough experimental study is performed and it confirms that our framework outperforms previous algorithms for convertible constraints, and exploit the tougher ones with the same effectiveness. Finally, we highlight that the main advantage of our approach, i.e., pushing constraints by means of data reduction in a level-wise framework, is that different properties of different constraints can be exploited all together, and the total benefit is always greater than the sum of the individual benefits. This consideration leads to the definition of a general Apriori-like algorithm which is able to exploit all possible kinds of constraints studied so far.Source: DATA & KNOWLEDGE ENGINEERING, vol. 60 (issue 2), pp. 377-399
DOI: 10.1016/j.datak.2006.02.006
Metrics:

See at: Data & Knowledge Engineering Restricted | CNR IRIS | CNR IRIS

2006 Journal article Restricted

On condensed representations of constrained frequent patterns
Bonchi F, Lucchese C
Constrained frequent patterns and closed frequent patterns are two paradigms aimed at reducing the set of extracted patterns to a smaller, more interesting, subset. Although a lot of work has been done with both these paradigms, there is still confusion around the mining problem obtained by joining closed and constrained frequent patterns in a unique framework. In this paper we shed light on this problem by providing a formal definition and a thorough characterization. We also study computational issues and show how to combine the most recent results in both paradigms, providing a very efficient algorithm which exploits the two requirements (satisfying constraints and being closed) together at mining time in order to reduce the computation as much as possible.Source: KNOWLEDGE AND INFORMATION SYSTEMS, vol. 9 (issue 2), pp. 180-201

See at: CNR IRIS Restricted | CNR IRIS

2004 Conference article Restricted

On closed constrained frequent pattern mining
Bonchi F, Lucchese C
Constrained frequent patterns and closed frequent patterns are two paradigms aimed at reducing the set of extracted patterns to a smaller, more interesting, subset. Although a lot of work has been done with both these paradigms, there is still confusion around the mining problem obtained by joining closed and constrained frequent patterns in a unique framework. In this paper we shed light on this problem by providing a formal definition and a thorough characterization. Wealso study computational issues and show how to combine the most recent results in both paradigms, providing a very efficient algorithm which exploits the two requirements (satisfying constraints and being closed) together at mining time in order to reduce the computation as much as possible.

See at: csdl.computer.org Restricted | CNR IRIS | CNR IRIS

2004 Conference article Restricted

kDCI: on using direct count up to the third iteration
Lucchese C, Orlando S, Perego R
We thus introduced such technique in the last version of kDCI, which is level-wise hybrid algorithm. kDCI stores the dataset with an horizontal format to disk during the first iterations. After some iteration the dataset may become small enough (thanks to anti-monotone frequency pruning) to be stored in the main memory in a vertical format, and after that the algorithm goes on performing tid-lists intersections to retrieve itemsets supports, and searches among candidates are not needed anymore. Usually the dataset happens to be small enough at most at the fourth iteration.

See at: ftp.informatik.rwth-aachen.de Restricted | CNR IRIS | CNR IRIS

2008 Contribution to book Restricted

Mining@home: public resource computing for distributed data mining
Lucchese C, Orlando S, Talia D, Matroianni C, Barbalace D
Several kinds of scientific and commercial applications require the execution of a large number of independent tasks. One highly successful and low cost mechanism for acquiring the necessary compute power for these applications is the "public-resource computing", or "desktop Grid" paradigm, which exploits the computational power of private computers. So far, this paradigmhas not been applied to data mining applications for two main reasons. First, it is not trivial to decompose a data mining algorithm into truly independent sub-tasks. Second, the large volume of data involved makes it difficult to handle the communication costs of a parallel paradigm. In this paper, we focus on one of the main data mining problem: the extraction of closed frequent itemsets from transactional databases. We show that is possible to decompose this problem into independent tasks, which however need to share a large volume of data. We thus introduce a data-intensive computing network, which adopts a P2P topology based on super peers with caching capabilities, aiming to support the dissemination of large amounts of information. Finally, we evaluate the execution of our data mining job on such network.DOI: 10.1007/978-0-387-09455-7_16
Metrics:

See at: doi.org Restricted | CNR IRIS | CNR IRIS | www.springerlink.com

2006 Software Metadata Only Access

ExAMinerGEN
Bonchi F, Lucchese C
ExAMinerGEN: è un software per il mining di pattern frequenti capace di sfruttare tutti i vincoli studiati ad oggi in letteratura. Il software `e implementato utilizzando lo stato dell'arte delle tecniche per il pushing di vincoli nell'estrazione di pattern frequenti. Il software è efficiente, robusto, data aware (adatta il suo comportamento ai dati in analisi) e resource-aware (passa in memoria principale, cambiando strategia di mining, non appena il database di input è stato ridotto a sufficienza rispetto alle risorse disponibili). ExAMinerGEN è il motore di mining utilizzato dal sistema ConQueSt.

See at: CNR IRIS Restricted