2009
Journal article
Restricted
Embedding and retrieving private metadata in electrocardiograms
Lucchese C., Kozat S., Vlachos M., Yu P. S., Herle H.Due to the recent explosion of 'identity theft' cases, the safeguarding of private data has been the focus of many scientific efforts. Medical data contain a number of sensitive attributes, whose access the rightful owner would ideally like to disclose only to authorized personnel. One way of providing limited access to sensitive data is through means of encryption. In this work we follow a different path, by proposing the fusion of the sensitive metadata within the medical data. Our work is focused on medical time-series signals and in particular on Electrocardiograms (ECG). We present techniques that allow the embedding and retrieval of sensitive numerical data, such as the patient's social security number or birth date, within the medical signal. The proposed technique not only allows the effective hiding of the sensitive metadata within the signal itself, but it additionally provides a way of authenticating the data ownership or providing assurances about the origin of the data. Our methodology builds upon watermarking notions, and presents the following desirable characteristics: (a) it does not distort important ECG characteristics, which are essential for proper medical diagnosis, (b) it allows not only the embedding but also the efficient retrieval of the embedded data, (c) it provides resilience and fault tolerance by employing multistage watermarks (both robust and fragile). Our experiments on real ECG data indicate the viability of the proposed scheme.Source: Journal of medical systems 33 (2009): 241–259. doi:10.1007/s10916-008-9185-1
DOI: 10.1007/s10916-008-9185-1Metrics:
See at:
Journal of Medical Systems | metapress.com | CNR ExploRA
2010
Journal article
Closed Access
Rights protection of trajectory datasets with nearest-neighbor preservation
Lucchese C., Vlachos M., Yu P. S., Rayan D.Companies frequently outsource datasets to mining firms, and academic institutions create repositories or share datasets in the interest of promoting research collaboration. Still, many practitioners have reservations about sharing or outsourcing datasets, primarily because of fear of losing the principal rights over the dataset. This work presents a way of convincingly claiming ownership rights over a trajectory dataset, without, at the same time, destroying the salient dataset characteristics, which are important for accurate search operations and data-mining tasks. The digital watermarking methodology that we present distorts imperceptibly a collection of sequences, effectively embedding a secret key, while retaining as well as possible the neighborhood of each object, which is vital for operations such as similarity search, classification, or clustering. A key contribution in this methodology is a technique for discovering the maximum distortion that still maintains such desirable properties. We demonstrate both analytically and empirically that the proposed dataset marking techniques can withstand a number of attacks (such a translation, rotation, noise addition, etc) and therefore can provide a robust framework for facilitating the secure dissemination of trajectory datasetsSource: The VLDB journal 19 (2010): 531–556. doi:10.1007/s00778-010-0178-6
DOI: 10.1007/s00778-010-0178-6Metrics:
See at:
The VLDB Journal | www.springerlink.com | CNR ExploRA
2008
Conference article
Open Access
Ownership protection of shapes with geodesic distance preservation
Michail Vlachos +, Claudio Lucchese ?, Deepak Rajan +, Philip S. Yu ?Protection of one's intellectual property is a topic with important technological and legal facets. The significance of
this issue is amplified nowadays due to the ease of data dissemination through the internet. Here, we provide technological mechanisms for establishing the ownership of a
dataset consisting of multiple objects. The objects that
we consider in this work are shapes (i.e., two dimensional
contours), which abound in disciplines such as medicine, biology, anthropology and natural sciences. The protection
of the dataset is achieved through means of embedding of
an imperceptible ownership 'seal', that imparts only minute
visual distortions. This seal needs to be embedded in the
proper data space so that its removal or destruction is particularly difficult. Our technique is robust to many common
transformations, such as data rotation, translation, scaling,
noise addition and resampling. In addition to that, the
proposed scheme also guarantees that important distances
between the dataset shapes/objects are not distorted. We
achieve this by preserving the geodesic distances between
the dataset objects. Geodesic distances capture a significant
part of the dataset structure, and their usefulness is recognized in many machine learning, visualization and clustering
algorithms. Therefore, if a practitioner uses the protected
dataset as input to a variety of mining, machine learning, or
database operations, the output will be the same as on the
original dataset. We illustrate and validate the applicability
of our methods on image shapes extracted from anthropological and natural science data.Source: 11th International Conference on Extending Database Technology - EDBT'08, pp. 276–286, Nantes, France, 25-30 March 2008
DOI: 10.1145/1353343.1353379Metrics:
See at:
dl.acm.org | dl.acm.org | doi.org | CNR ExploRA
2008
Conference article
Unknown
Rights protection of trajectory datasets
Lucchese C., Vlachos M., Rayan D., Yu P. S.This work presents a technique of convincingly claiming ownership rights over a trajectory dataset. The presented methodology distorts imperceptibly a collection of sequences, effectively embedding a secret key, while retaining as well as possible the neighborhood of each object, which is vital for operations such as similarity search, classification or clustering.Source: 24th International Conference on Data Engineering, pp. 1349 -–1351, Cancún, Mexico, 7-12 Aprile 2008
See at:
CNR ExploRA
2010
Contribution to book
Restricted
Workshop Report - LSDS-IR'10
Lucchese C., Blanco R., Cambazoglou B.The size of theWeb as well as user bases of search systems continue to grow exponentially. Consequently, providing subsecond query response times and high query throughput become quite challenging for large-scale information retrieval systems. Distributing different aspects of search (e.g., crawling, indexing, and query processing) is essential to achieve scalability in large-scale information retrieval systems. The 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'10) has provided a venue to discuss the current research challenges and identify new directions for distributed information retrieval. The workshop contained two industry talks as well as six research paper presentations. The hot topics in this year's workshop were collection selection architectures, application of MapReduce to information retrieval problems, similarity search, geographically distributed web search, and optimization techniques for search efficiency.DOI: 10.1145/1924475.1924486Metrics:
See at:
dl.acm.org | ACM SIGIR Forum | CNR ExploRA
2011
Conference article
Restricted
Direct local pattern sampling by efficient two-step random procedures
Boley M., Lucchese C., Paurat D., Gartner, T.We present several exact and highly scalable local pattern sampling algorithms. They can be used as an alternative to exhaustive local pattern discovery methods (e.g, frequent set mining or optimistic-estimator-based subgroup discovery) and can substantially improve efficiency as well as con- trollability of pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedures are direct, i.e., non process- simulating, sampling algorithms. The advantages of these direct methods are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced patterns. Namely, the proposed algorithms can sample (item-)sets according to frequency, area, squared fre- quency, and a class discriminativity measure. Experiments demonstrate that these procedures can improve the accuracy of pattern-based models similar to frequent sets and often also lead to substantial gains in terms of scalability.Source: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD'11, pp. 582–590, San Diego, USA, 21-24 August 2011
DOI: 10.1145/2020408.2020500Project(s): LIFT Metrics:
See at:
ACM Digital Library | dl.acm.org | doi.org | research.monash.edu | CNR ExploRA
2011
Contribution to conference
Unknown
Direct pattern sampling with respect to pattern frequency
Lucchese C., Boley M., Gartner T., Paurat D.We present an exact and highly scalable sampling algorithm that can be used as an alternative to exhaustive local pattern discovery methods. It samples patterns according to their frequency of occurrence and can substantially improve efficiency and controllability of the pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedure is direct, i.e. a non process-simulating sampling algorithm. The ad- vantages of this direct method are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced pat- terns. In addition we present experimental results which demonstrate that these procedures can improve the accuracy of pattern-based models similar to frequent sets and often also lead to substantial gains in terms of scalabilitySource: Workshop on Knowledge Discovery, Data Mining and Machine Learning, in conjunction with the LWA 2011. KDLM'11 - LWA 2011, Magdeburg, Germany, 28-30 September 2011
Project(s): LIFT
See at:
CNR ExploRA
2011
Contribution to conference
Restricted
LSDS-IR'11: the 9th workshop on large-scale and distributed systems for information retrieval
Lucchese C., Cambazoglu B. B.The growth of the Web and user bases lead to important performance problems for large-scale Web search engines. The LSDS-IR '11 workshop focuses on research contributions re- lated to the scalability and efficiency of distributed information retrieval (IR) systems. The workshop also encourages contributions that propose different ways of leveraging diversity and multiplicity of resources available in distributed systems. More specifically, we are interested in novel applications, models, and architectures that deal with efficiency and scalability of distributed IR systems.Source: New York: ACM, Association for computing machinery, 2011
DOI: 10.1145/2063576.2064054Project(s): ASSETS
Metrics:
See at:
dl.acm.org | doi.org | CNR ExploRA
2012
Conference article
Restricted
From chatter to headlines: harnessing the real-time web for personalized news recommendation
De Francisci M. G., Aristides G., Lucchese C.We propose a new methodology for recommending interest- ing news to users by exploiting the information in their twit- ter persona. We model relevance between users and news articles using a mix of signals drawn from the news stream and from twitter: the prole of the social neighborhood of the users, the content of their own tweet stream, and topic popularity in the news and in the whole twitter-land. We validate our approach on a real-world dataset of ap- proximately 40k articles coming from Yahoo! News and one month of crawled twitter data. We train our model using a learning-to-rank approach and support-vector machines. The train and test set are drawn from Yahoo! toolbar log data. We heuristically identify 3 214 users of twitter in the log and use their clicks on news articles to train our system. Our methodology is able to predict with good accuracy the news articles clicked by the users and rank them higher than other news articles. The results show that the com- bination of various signals from real-time web and micro- blogging platforms can be a useful resource to understand user behavior.Source: Fifth ACM International Conference on Web Search and Data Mining, pp. 153–162, Seattle, Washington, USA, 8-12 February 2012
DOI: 10.1145/2124295.2124315Metrics:
See at:
doi.org | CNR ExploRA
2014
Journal article
Restricted
Right-protected data publishing with provable distance-based mining
Lucchese C., Zoumpoulis S., Vlachos M., Freris N.Protection of one's intellectual property is a topic with important technological and legal facets. We provide mechanisms for establishing the ownership of a dataset consisting of multiple objects. The algorithms also preserve important properties of the dataset, which are important for mining operations, and so guarantee both right protection and utility preservation. We consider a right-protection scheme based on watermarking. Watermarking may distort the original distance graph. Our watermarking methodology preserves important distance relationships, such as: the Nearest Neighbors (NN) of each object and the Minimum Spanning Tree (MST) of the original dataset. This leads to preservation of any mining operation that depends on the ordering of distances between objects, such as NN-search and classification, as well as many visualization techniques. We prove fundamental lower and upper bounds on the distance between objects post-watermarking. In particular, we establish a restricted isometry property, i.e., tight bounds on the contraction/expansion of the original distances. We use this analysis to design fast algorithms for NN-preserving and MST-preserving watermarking that drastically prune the vast search space. We observe two orders of magnitude speedup over the exhaustive schemes, without any sacrifice in NN or MST preservation.Source: IEEE transactions on knowledge and data engineering (Print) 26 (2014): 2014–2028. doi:10.1109/TKDE.2013.90
DOI: 10.1109/tkde.2013.90Project(s): MININEXACT Metrics:
See at:
IEEE Transactions on Knowledge and Data Engineering | ieeexplore.ieee.org | Infoscience - EPFL scientific publications | CNR ExploRA
2016
Contribution to conference
Open Access
Summarizing linked data RDF graphs using approximate graph pattern mining
Zneika M., Lucchese C., Vodislav D., Kotzinos D.The Linked Open Data (LOD) cloud brings together infor- mation described in RDF and stored on the web in (possi- bly distributed) RDF Knowledge Bases (KBs). The data in these KBs are not necessarily described by a known schema and many times it is extremely time consuming to query all the interlinked KBs in order to acquire the necessary in- formation. To tackle this problem, we propose a method of summarizing large RDF KBs using approximate RDF graph patterns and calculating the number of instances covered by each pattern. Then we transform the patterns to an RDF schema that describes the contents of the KB. Thus we can then query the RDF graph summary to identify whether the necessary information is present and if so its size, before deciding to include it in a federated query result.Source: EDBT'16 - 19th International Conference on Extending Database Technology, pp. 684–685, Bordeaux, France, 15-18 March 2016
DOI: 10.5441/002/edbt.2016.86Project(s): SoBigData Metrics:
See at:
CNR ExploRA
2016
Conference article
Open Access
RDF graph summarization based on approximate patterns
Zneika M., Lucchese C., Vodislav D., Kotzinos D.The Linked Open Data (LOD) cloud brings together information described in RDF and stored on the web in (possibly distributed) RDF Knowledge Bases (KBs). The data in these KBs are not necessarily described by a known schema and many times it is extremely time consuming to query all the interlinked KBs in order to acquire the necessary information. But even when the KB schema is known, we need actually to know which parts of the schema are used. We solve this problem by summarizing large RDF KBs using top-K approximate RDF graph patterns, which we transform to an RDF schema that describes the contents of the KB. This schema describes accurately the KB, even more accurately than an existing schema because it describes the actually used schema, which corresponds to the existing data. We add information on the number of various instances of the patterns, thus allowing the query to estimate the expected results. That way we can then query the RDF graph summary to identify whether the necessary information is present and if it is present in significant numbers whether to be included in a federated query result.Source: 10th International Workshop on Information Search, Integration and Personalization (ISIP), pp. 69–87, Grand Forks, ND, USA, 1-2 October, 2015
DOI: 10.1007/978-3-319-43862-7_4Metrics:
See at:
hal.archives-ouvertes.fr | ISTI Repository | doi.org | Hyper Article en Ligne | link.springer.com | CNR ExploRA
2005
Conference article
Unknown
Pushing tougher constraints in frequent pattern mining
Bonchi F., Lucchese C.In this paper we extend the state-of-art of the constraints that can be pushed in a frequent pattern computation. We introduce a new class of tough constraints, namely Loose Anti-monotone constraints, and we deeply characterize them by showing that they are a superclass of convertible anti-monotone constraints (e.g. constraints on average or median) and that they model tougher constraints (e.g. constraints on variance or standard deviation). Then we show how these constraints can be exploited in a level-wise Apriori-like computation by means of a new data-reduction technique: the resulting algorithm outperforms previous proposals for convertible constraints, and it is to treat much tougher constraints with the same effectiveness of easier ones.Source: Advances in Knowledge Discovery and Data Mining, Pacific-Asia, pp. 114–124, Hanoi, Vietnam, 18-20 May 2005
See at:
CNR ExploRA
2007
Journal article
Restricted
Extending the state-of-the-art of constraint-based pattern discovery
Lucchese C., Bonchi F.The constraint-based pattern discovery paradigm was introduced with the aim of providing to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. In this paper we review and extend the state-of-the-art of the constraints that can be pushed in a frequent pattern computation. We introduce novel data reduction techniques which are able to exploit convertible anti-monotone constraints (e.g., constraints on average or median) as well as tougher constraints (e.g., constraints on variance or standard deviation). A thorough experimental study is performed and it confirms that our framework outperforms previous algorithms for convertible constraints, and exploit the tougher ones with the same effectiveness. Finally, we highlight that the main advantage of our approach, i.e., pushing constraints by means of data reduction in a level-wise framework, is that different properties of different constraints can be exploited all together, and the total benefit is always greater than the sum of the individual benefits. This consideration leads to the definition of a general Apriori-like algorithm which is able to exploit all possible kinds of constraints studied so far.Source: Data & knowledge engineering 60 (2007): 377–399. doi:10.1016/j.datak.2006.02.006
DOI: 10.1016/j.datak.2006.02.006Metrics:
See at:
Data & Knowledge Engineering | CNR ExploRA
2006
Journal article
Unknown
On condensed representations of constrained frequent patterns
Bonchi F., Lucchese C.Constrained frequent patterns and closed frequent patterns are two paradigms aimed at reducing the set of extracted patterns to a smaller, more interesting, subset. Although a lot of work has been done with both these paradigms, there is still confusion around the mining problem obtained by joining closed and constrained frequent patterns in a unique framework. In this paper we shed light on this problem by providing a formal definition and a thorough characterization. We also study computational issues and show how to combine the most recent results in both paradigms, providing a very efficient algorithm which exploits the two requirements (satisfying constraints and being closed) together at mining time in order to reduce the computation as much as possible.Source: Knowledge and Information Systems 9 (2006): 180–201.
See at:
CNR ExploRA
2004
Conference article
Restricted
On closed constrained frequent pattern mining
Bonchi F., Lucchese C.Constrained frequent patterns and closed frequent patterns are two paradigms aimed at reducing the set of extracted patterns to a smaller, more interesting, subset. Although a lot of work has been done with both these paradigms, there is still confusion around the mining problem obtained by joining closed and constrained frequent patterns in a unique framework. In this paper we shed light on this problem by providing a formal definition and a thorough characterization. Wealso study computational issues and show how to combine the most recent results in both paradigms, providing a very efficient algorithm which exploits the two requirements (satisfying constraints and being closed) together at mining time in order to reduce the computation as much as possible.Source: ICDM'04 - Fourth IEEE International Conference on Data Mining, pp. 35–42, Brighton, UK, 1-4 November 2004
See at:
csdl.computer.org | CNR ExploRA
2004
Conference article
Restricted
kDCI: on using direct count up to the third iteration
Lucchese C., Orlando S., Perego R.We thus introduced such technique in the last version of kDCI, which is level-wise hybrid algorithm. kDCI stores the dataset with an horizontal format to disk during the first iterations. After some iteration the dataset may become small enough (thanks to anti-monotone frequency pruning) to be stored in the main memory in a vertical format, and after that the algorithm goes on performing tid-lists intersections to retrieve itemsets supports, and searches among candidates are not needed anymore. Usually the dataset happens to be small enough at most at the fourth iteration.Source: ICDM Workshop on Frequent Itemset Mining Implementations, pp. 1–1, Brighton, UK, 1 November 2004
See at:
ftp.informatik.rwth-aachen.de | CNR ExploRA
2008
Contribution to book
Restricted
Mining@home: public resource computing for distributed data mining
Lucchese C., Orlando S., Talia D., Matroianni C., Barbalace D.Several kinds of scientific and commercial applications require the execution of a large number of independent tasks. One highly successful and low cost mechanism for acquiring the necessary compute power for these applications is the "public-resource computing", or "desktop Grid" paradigm, which exploits the computational power of private computers. So far, this paradigmhas not been applied to data mining applications for two main reasons. First, it is not trivial to decompose a data mining algorithm into truly independent sub-tasks. Second, the large volume of data involved makes it difficult to handle the communication costs of a parallel paradigm. In this paper, we focus on one of the main data mining problem: the extraction of closed frequent itemsets from transactional databases. We show that is possible to decompose this problem into independent tasks, which however need to share a large volume of data. We thus introduce a data-intensive computing network, which adopts a P2P topology based on super peers with caching capabilities, aiming to support the dissemination of large amounts of information. Finally, we evaluate the execution of our data mining job on such network.Source: From Grids to Service and Pervasive Computing, edited by Thierry Priol, Marco Vanneschi, pp. 217–227, 2008
DOI: 10.1007/978-0-387-09455-7_16Metrics:
See at:
doi.org | www.springerlink.com | CNR ExploRA
2006
Software
Unknown
ExAMinerGEN
Bonchi F., Lucchese C.ExAMinerGEN: è un software per il mining di pattern frequenti capace di sfruttare tutti i vincoli studiati ad oggi in letteratura. Il software `e implementato utilizzando lo stato dell'arte delle tecniche per il pushing di vincoli nell'estrazione di pattern frequenti. Il software è efficiente, robusto, data aware (adatta il suo comportamento ai dati in analisi) e resource-aware (passa in memoria principale, cambiando strategia di mining, non appena il database di input è stato ridotto a sufficienza rispetto alle risorse disponibili). ExAMinerGEN è il motore di mining utilizzato dal sistema ConQueSt.
See at:
CNR ExploRA
2010
Conference article
Restricted
Mining top-K patterns from binary datasets in presence of noise
Lucchese C., Orlando S., Perego R.The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns,
presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the
Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.Source: Tenth SIAM International Conference on Data Mining, pp. 165–176, Columbus, Ohio, US, April 29 - May 1 2010
See at:
www.siam.org | CNR ExploRA