Page 1 of 2

2004 Conference article Restricted

YaDT: Yet another Decision Tree builder
Salvatore Ruggieri
YaDT is a from-scratch main-memory implementation of the C4.5-like decision tree algorithm. Our presentation will be focused on the design principles that allowed for obtaining an extremely efficient system. Experimental results are reported comparing YaDT withWeka, dti, Xelopes and (E)C4.5.Source: PROCEEDINGS - INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, pp. 260-265. Boca Raton, FL, USA, 15-17 November 2004

See at: CNR IRIS Restricted | CNR IRIS | www.computer.org

2010 Conference article Restricted

Frequent regular itemset mining
Ruggieri S
Concise representations of frequent itemsets sacrifice readability and direct interpretability by a data analyst of the concise patterns extracted. In this paper, we introduce an extension of itemsets, called regular, with an immediate semantics and interpretability, and a conciseness comparable to closed itemsets. Regular itemsets allow for specifying that an item may or may not be present; that any subset of an itemset may be present; and that any non-empty subset of an itemset may be present. We devise a procedure, called {\bf RegularMine}, for mining a set of regular itemsets that is a concise representation of frequent itemsets. The procedure computes a covering, in terms of regular itemsets, of the frequent itemsets in the class of equivalence of a closed one. We report experimental results on several standard dense and sparse datasets that validate the proposed approach.

See at: dl.acm.org Restricted | CNR IRIS | CNR IRIS

2012 Journal article Restricted

A complexity perspective on entailment of parameterized linear constraints
Eirinakis P, Ruggieri S, Subramani K, Wojciechowski P
Extending linear constraints by admitting parameters allows for more abstract problem modeling and reasoning. A lot of focus has been given to conducting research that demonstrates the usefulness of parameterized linear constraints and implementing tools that utilize their modeling strength. However, there is no approach that considers basic theoretical tools related to such constraints that allow for reasoning over them. Hence, in this paper we introduce satisf iability with respect to polyhedral sets and entailment for the class of parameterized linear constraints. In order to study the computational complexities of these problems, we relate them to classes of quantified linear implications. The problem of satisfiability with respect to polyhedral sets is then shown to be co-NP hard. The entailment problem is also shown to be co-NP hard in its general form. Nevertheless, we characterize some subclasses for which this problem is in P. Furthermore, we examine a weakening and a strengthening extension of the entailment problem. The weak entailment problem is proved to be NP complete. On the other hand, the strong entailment problem is shown to be co-NP hard.

See at: CNR IRIS Restricted | CNR IRIS | link.springer.com

2012 Conference article Restricted

Discovering gender discrimination in project funding
Romei A, Ruggieri S, Turini F
The selection of projects for funding can hide discriminatory decisions. We present a case study investigating gender discrimination in a dataset of scientific research proposals submitted to an Italian national call. The method for the analysis relies on a data mining classification strategy that is inspired by a legal methodology for proving evidence of social discrimination against protected-by-law groups.

See at: CNR IRIS Restricted | ieeexplore.ieee.org | CNR IRIS

2012 Conference article Restricted

Computational complexities of inclusion queries over polyhedral sets
Eirinakis P, Ruggieri S, Subramani K, Wojciechowski P
In this paper we discuss the computational complexities of procedures for inclusion queries over polyhedral sets. The polyhedral sets that we consider occur in a wide range of applications, ranging from logistics to program verification. The goal of our study is to establish boundaries between hard and easy problems in this context.

See at: CNR IRIS Restricted | CNR IRIS | www.cs.uic.edu

2012 Conference article Restricted

Deciding membership in a class of polyhedra
Ruggieri S
Parameterized linear systems allow for modelling and reasoning over classes of polyhedra. Collections of squares, rectangles, polytopes, and so on can readily be defined by means of linear systems with parameters in constant terms. In this paper, we consider the membership problem of deciding whether a given polyhedron belongs to the class defined by a parameterized linear system. As an example, we are interested in questions such as: "does a given polytope belong to the class of hypercubes?" We show that the membership problem is NP-complete, even when restricting to the 2-dimensional plane. Despite the negative result, the constructive proof allows us to devise a concise decision procedure using constraint logic programming over the reals, namely CLP(R), which searches for a characterization of all instances of a parameterized system that are equivalent to a given polyhedron.

See at: ebooks.iospress.nl Restricted | CNR IRIS | CNR IRIS

2012 Conference article Restricted

Subtree replacement in decision tree simplication
Ruggieri S
The current availability of efficient algorithms for decision tree induction makes intricate post-processing tech- niques worth to be investigated both for eciency and effectiveness. We study the simplification operator of subtree replacement, also known as grafting, originally implemented in the C4.5 system. We present a parametric bottom-up algorithm integrating grafting with the standard pruning operator, and analyze its complexity in terms of the number of nodes visited. Immediate instances of the parametric algorithm include extensions of error based, reduced error, minimum error, and pessimistic error pruning. Experimental results show that the computational cost of grafting is paid of by statis- tically significant smaller trees without accuracy loss.

See at: CNR IRIS Restricted | CNR IRIS | siam.omnibooksonline.com

2013 Contribution to book Restricted

Discrimination Data Analysis: A Multi-disciplinary Bibliography
Romei A, Ruggieri S
Discrimination data analysis has been investigated for the last fifty years in a large body of social, legal, and economic studies. Recently, discrimination discovery and prevention has become a blooming research topic in the knowledge discovery community. This chapter provides a multi-disciplinary annotated bibliography of the literature on discrimination data analysis, with the intended objective to provide a common basis to researchers from a multi-disciplinary perspective. We cover legal, sociological, economic and computer science references

See at: CNR IRIS Restricted | CNR IRIS

2010 Journal article Restricted

Typing linear constraints
Ruggieri S, Mesnard F
We present a type system for linear constraints over the reals intended for reasoning about the input-output directionality of variables. Types model the properties of definiteness, range width or approximation, lower and upper bounds of variables in a linear constraint. Several proof procedures are presented for inferring the type of a variable and for checking validity of type assertions. We rely on theory and tools for linear programming problems, linear algebra, parameterized polyhedra and negative constraints. An application of the type system is proposed in the context of the static analysis of constraint logic programs. Type assertions are at the basis of the extension of wellmoding from pure logic programming. The proof procedures (both for type assertion validity and for well-moding) are implemented and their computational omplexity is discussed. We report experimental results demonstrating the efficiency in practice of the proposed approach. © 2010 ACM.

See at: dl.acm.org Restricted | CNR IRIS | CNR IRIS

2013 Conference article Restricted

Data anonimity meets non-discrimination
Ruggieri S
We investigate the relation between t-closeness, a well-known model of data anonymization, and alpha-protection, a model of data discrimination. We show that t-closeness implies bd(t)-protection, for a bound function bd() depending on the discrimination measure at hand. This allows us to adapt an inference control method, the Mondrian multidimensional generalization technique, to the purpose of non-discrimination data protection. The parallel between the two analytical models raises intriguing issues on the interplay between data anonymization and nondiscrimination research in data mining.

See at: CNR IRIS Restricted | ieeexplore.ieee.org | CNR IRIS

2013 Conference article Open Access

Learning from polyhedral sets
Ruggieri S
Parameterized linear systems allow for modelling and reasoning over classes of polyhedra. Collections of squares, rectangles, polytopes, and so on, can readily be defined by means of linear systems with parameters. In this paper, we investigate the problem of learning a parameterized linear system whose class of polyhedra includes a given set of example polyhedral sets and it is minimal.

See at: CNR IRIS Open Access | ijcai.org | CNR IRIS Restricted

2014 Journal article Restricted

A multidisciplinary survey on discrimination analysis
Romei A, Ruggieri S
The collection and analysis of observational and experimental data represent the main tools for assessing the presence, the extent, the nature, and the trend of discrimination phenomena. Data analysis techniques have been proposed in the last 50 years in the economic, legal, statistical, and, recently, in the data mining literature. This is not surprising, since discrimination analysis is a multidisciplinary problem, involving sociological causes, legal argumentations, economic models, statistical techniques, and computational issues. The objective of this survey is to provide a guidance and a glue for researchers and anti-discrimination data analysts on concepts, problems, application areas, datasets, methods, and approaches from a multidisciplinary perspective. We organize the approaches according to their method of data collection as observational, quasi-experimental, and experimental studies. A fourth line of recently blooming research on knowledge discovery based methods is also covered. Observational methods are further categorized on the basis of their application context: labor economics, social profiling, consumer markets, and others.Source: KNOWLEDGE ENGINEERING REVIEW, vol. 29 (issue 5), pp. 582-638

See at: CNR IRIS Restricted | CNR IRIS | journals.cambridge.org

2014 Journal article Restricted

Decision tree building on multi-core using FastFlow
Aldinucci M, Ruggieri S, Torquati M
The whole computer hardware industry embraced the multi-core. The extreme optimisation of sequential algorithms is then no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an in-depth study of the parallelisation of an implementation of the C4.5 algorithm for multi-core architectures. We characterise elapsed time lower bounds for the forms of parallelisations adopted and achieve close to optimal performance. Our implementation is based on the FastFlow parallel programming environment, and it requires minimal changes to the original sequential code. Copyright © 2013 John Wiley & Sons, Ltd. Copyright © 2013 John Wiley & Sons, Ltd.Source: CONCURRENCY AND COMPUTATION, vol. 26 (issue 3), pp. 800-820

See at: CNR IRIS Restricted | CNR IRIS | onlinelibrary.wiley.com

2017 Conference article Open Access

Efficiently clustering very large attributed graphs
Baroni A, Conte A, Patrignani M, Ruggieri S
Attributed graphs model real networks by enriching their nodes with attributes accounting for properties. Several techniques have been proposed for partitioning these graphs into clusters that are homogeneous with respect to both semantic attributes and to the structure of the graph. However, time and space complexities of state of the art algorithms limit their scalability to medium-sized graphs. We propose SToC (for Semantic-Topological Clustering), a fast and scalable algorithm for partitioning large attributed graphs. The approach is robust, being compatible both with categorical and with quantitative attributes, and it is tailorable, allowing the user to weight the semantic and topological components. Further, the approach does not require the user to guess in advance the number of clusters. SToC relies on well known approximation techniques such as bottom-k sketches, traditional graph-theoretic concepts, and a new perspective on the composition of heterogeneous distance measures. Experimental results demonstrate its ability to efficiently compute high-quality partitions of large scale attributed graphs.

See at: dl.acm.org Open Access | CNR IRIS | ISTI Repository | CNR IRIS Restricted | CNR IRIS

2004 Journal article Restricted

Bounded nondeterminism of logic programs
Pedreschi D, Ruggieri S
We introduce the notion of bounded nondeterminism for logic programs and queries. A program and a query have bounded nondeterminism if there are finitely many refutations for them via any selection rule. We offer a declarative characterization of the class of programs and queries that have bounded nondeterminism by defining bounded programs and queries. The characterization is provided in terms of Herbrand interpretations and level mappings, in the style of existing characterizations of universal termination. A direct application of the theoretical framework is concerned with the automatic generation of a terminating control. We present a transformational approach that given a bounded program and a bounded query yields a terminating program and query with the same set of refutations. Concerning the issue of automating the approach, by means of an example we sketch how an automatic method for proving left termination can be adapted to the purpose of inferring boundedness. Such an adaption reveals that the method does not scale up to medium/large sized programs due to scarce modularity of the required proof obligations. We provide then a modular refinement of boundedness for the significant class of well-moded programs and queries.Source: ANNALS OF MATHEMATICS AND OF ARTIFICIAL INTELLIGENCE, vol. 42 (issue 4), pp. 313-343

See at: CNR IRIS Restricted | CNR IRIS | www.springerlink.com

2011 Contribution to book Restricted

Who/where are my new customers?
Rinzivillo Salvatore, Ruggieri Salvatore
We present a knowledge discovery case study on customer classification having the objective of mining the distinctive characteristics of new customers of a service of tax return. Two general approaches are described. The first one, a symbolic approach, is based on extracting and ranking classification rules on the basis of significativeness measures defined on the 4-fold contingency table of a rule. The second one, a spatial approach, is based on extracting geographic areas with predominant presence of new customers.Source: STUDIES IN COMPUTATIONAL INTELLIGENCE (PRINT)

See at: CNR IRIS Restricted | CNR IRIS | link.springer.com

2011 Conference article Restricted

k-NN as an implementation of situation testing for discrimination discovery and prevention
Luong Binh Thanh, Ruggieri Salvatore, Turini Franco
With the support of the legally-grounded methodology of situation testing, we tackle the problems of discrimination discovery and prevention from a dataset of historical decisions by adopting a variant of k-NN classifi cation. A tuple is labeled as discriminated if we can observe a signi ficant di erence of treatment among its neighbors belonging to a protected-by-law group and its neighbors not belonging to it. Discrimination discovery boils down to extracting a classi fication model from the labeled tuples. Discrimination prevention is tackled by changing the decision value for tuples labeled as discriminated before training a classi fier. The approach of this paper overcomes legal weaknesses and technical limitations of existing proposals.

See at: CNR IRIS Restricted | CNR IRIS

2013 Journal article Restricted

Discrimination discovery in scientific project evaluation: a case study
Romei A, Ruggieri S, Turini F
Discovering contexts of unfair decisions in a dataset of historical decision records is a non-trivial problem. It requires the design of ad hoc methods and techniques of analysis, which have to comply with existing laws and with legal argumentations. While some data mining techniques have been adapted to the purpose, the state-of-the-art of research still needs both methodological refinements, the consolidation of a Knowledge Discovery in Databases (KDD) process, and, most of all, experimentation with real data. This paper contributes by presenting a case study on gender discrimination in a dataset of scientific research proposals, and by distilling from the case study a general discrimination discovery process. Gender bias in scientific research is a challenging problem, that has been tackled in the social sciences literature by means of statistical regression. However, this approach is limited to test an hypothesis of discrimination over the whole dataset under analysis. Our methodology couples data mining, for unveiling previously unknown contexts of possible discrimination, with statistical regression, for testing the significance of such contexts, thus obtaining the best of the two worlds. (C) 2013 Elsevier Ltd. All rights reserved.Source: EXPERT SYSTEMS WITH APPLICATIONS, vol. 40 (issue 15), pp. 6064-6079

See at: CNR IRIS Restricted | CNR IRIS | www.sciencedirect.com

2018 Other Open Access

Assessing the stability of interpretable models
Guidotti R, Ruggieri S
Interpretable classification models are built with the purpose of providing a comprehensible description of the decision logic to an external oversight agent. When considered in isolation, a decision tree, a set of classification rules, or a linear model, are widely recognized as human-interpretable. However, such models are generated as part of a larger analytical process, which, in particular, comprises data collection and filtering. Selection bias in data collection or in data pre-processing may affect the model learned. Although model induction algorithms are designed to learn to generalize, they pursue optimization of predictive accuracy. It remains unclear how interpretability is instead impacted. We conduct an experimental analysis to investigate whether interpretable models are able to cope with data selection bias as far as interpretability is concerned.Project(s): SoBigData via OpenAIRE

See at: arxiv.org Open Access | CNR IRIS | ISTI Repository | CNR IRIS Restricted

2020 Journal article Open Access

Causal inference for social discrimination reasoning
Qureshi B, Kamiran F, Karim A, Ruggieri S, Pedreschi D
The discovery of discriminatory bias in human or automated decision making is a task of increasing importance and difficulty, exacerbated by the pervasive use of machine learning and data mining. Currently, discrimination discovery largely relies upon correlation analysis of decisions records, disregarding the impact of confounding biases. We present a method for causal discrimination discovery based on propensity score analysis, a statistical tool for filtering out the effect of confounding variables. We introduce causal measures of discrimination which quantify the effect of group membership on the decisions, and highlight causal discrimination/favoritism patterns by learning regression trees over the novel measures. We validate our approach on two real world datasets. Our proposed framework for causal discrimination has the potential to enhance the transparency of machine learning with tools for detecting discriminatory bias both in the training data and in the learning algorithms.Source: JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, vol. 54 (issue 2), pp. 425-437

See at: CNR IRIS Open Access | link.springer.com | ISTI Repository | CNR IRIS Restricted | CNR IRIS