## Kzi.polsl.pl

Spec. Issue on Feature Transformation and Subset Selection z Zupan, Marko Bohanec, Janez Demˇsar, Ivan Bratko While not explicitly intended for feature transformation, some methods for switching circuit design implicitly deal with this problem. Given a tab-ulated Boolean function, these methods construct a circuit that implementsthat function. In 1950s and 1960s, Ashenhurst [1] and Curtis [2] proposed afunction decomposition method that develops a switching circuit by construct-ing a nested hierarchy of tabulated Boolean functions. Both the hierarchy andthe functions themselves are discovered by the decomposition method and arenot given in advance. This is especially important from the viewpoint of fea-ture construction, since the outputs of such functions can be regarded as newfeatures not present in the original problem description.
The basic principle of function decomposition is the following. Let a tab- ulated function y = F (X) use a set of input features X = x1, . . . , xn. Thegoal is to decompose this function into y = G(A, H(B)), where A and B aresubsets of features in X such that A ∪ B = X. G and H are tabulated func-tions that are determined by the decomposition and are not predefined. Theirjoint complexity (determined by some complexity measure) should be lowerthan the complexity of F . Such a decomposition also discovers a new featurec = H(B). Since the decomposition can be applied recursively on H and G,the result in general is a hierarchy of features. For each feature in the hierarchy,there is a corresponding tabulated function (such as H(B)) that determinesthe dependency of that feature on its immediate descendants in the hierarchy.
Ashenhurst-Curtis decomposition was intended for switching circuit design of completely specified Boolean functions. Recently, Wan and Perkowski [3]extended the decomposition to handle incompletely specified Boolean func-tions. In the framework of feature extraction, Ross et al. [4] used a set of sim-ple Boolean functions to show the decomposition’s capability to discover andextract useful features. This article presents an approach to feature transfor- mation that is based on the extension of the function decomposition method byZupan et al. [5]. This method allows the decomposition to deal with functionsthat involve nominal (i.e., not necessarily binary) features, and is implementedin a system called HINT(Hierarchy INduction Tool).
The core of the decomposition algorithm is a single-step decomposition whichdecomposes a tabulated function y = F (X) into two possibly less complextabulated functions G and H, so that y = G(A, c) and c = H(B).
resulting functions G and H have to be consistent with F . For the purposeof decomposition, a set of features X is partitioned to two disjoint subsets Aand B, referred to as a free and bound set, respectively. For a given featurepartition, single-step decomposition discovers a new feature c and a tabularrepresentation of H and G.
For example, consider a function y = F (x1, x2, x3) as given in Table 1. The input features x1, x2, and the output feature y can take the values lo, med, hi;input feature x3 can take the values lo, hi.
Table 1: Tabulated function y = F (x1, x2, x3).
Suppose that we want to discover a description of a new feature c = H(x2, x3). For this purpose, we represent F by a partition matrix that usesthe values of x1 for row labels, and the combinations of values of x2 and x3for column labels (Figure 1.a). Partition matrix entries with no correspondinginstance in the tabulated representation of F are denoted with “-” and treatedas don’t-care.
Figure 1: Partition matrix (a) and the corresponding incompatibility graph (b)for the function from Table 1. The node labels of the incompatibility graphfound by graph coloring are circled.
Each column in the partition matrix denotes the behavior of F for a specific combination of x2 and x3. Columns that have pairwise equal row entries or atleast one row entry is a don’t-care are called compatible. The decompositionhas to assign each column a label that corresponds to a value of c. If twocolumns are compatible, they can be assigned the same label. To preserve theconsistency, different labels have to be used for incompatible columns.
The partition matrix column labels are found by coloring an incompati- bility graph (Figure 1.b). This has a distinct node for each column of thepartition matrix. Two nodes are connected if the corresponding columns ofpartition matrix are incompatible. For instance, the nodes hi,hi and lo,hiare connected because their corresponding columns are incompatible due tothe entries in the first row (hi=lo).
With optimal coloring of the incompatibility graph, the new feature c ob- tains a minimal set of values needed for consistent derivation of H and G fromF . The optimal coloring of the graph from Figure 1.b requires three differentcolors, i.e., three abstract values for c. The tabulated functions G and H canthen be derived straightforwardly from the labeled partition matrix.
The resulting decomposition is given in Figure 2. The following can be • The tabulated functions G and H are overall smaller than the original Figure 2: Decomposition of tabulated function from Table 1 using a new fea-ture c = H(x2, x3).
• By combining the features x2 and x3 we obtained a new feature c and the corresponding tabulated function H, which can be interpreted asc = MIN(x2, x3). Similarly, the function G that relates x1 and c to y canbe interpreted as y = MAX(x1, c).
• The decomposition generalizes some undefined entries of the partition matrix. For example, F (hi, lo, hi) is generalized to hi because the col-umn lo,hi has the same label as columns lo,lo and hi,lo.
So far, we have assumed that a partition of the features to free and boundsets is given. However, for each function F there are many possible partitions,each one yielding a different feature c and a different pair of functions G andH. In feature transformation, it is important to identify partitions that leadto simple but useful new features. Typically, we prefer features with a smallnumber of values, and those that yield functions G and H of low complexity.
The simplest measure for finding good feature partitions is the number of values required for the new feature c. That is, when decomposing F , a set of candidate partitions is examined and the one that yields c with a smallest setof possible values is selected for decomposition.
An alternative measure for the selection of partitions can be based on the complexity of functions. Let I(F ) denote a number of bits to encode somefunction F [6, 4]. Then, the best partition is the one that minimizes theoverall complexity of newly discovered functions H and G, i.e., the partitionwith minimal I(H) + I(G). The complexity of G and H should be lower thanthat of F , and decomposition takes place only if I(H) + I(G) < I(F ) for thebest partition.
The number of possible partitions increases exponentially with the number of input features. To limit the complexity of search, the decomposition shouldexamine only a subset of partitions. In HINT, the partitions examined areonly those with less than b features in the bound set, where b is a user-definedparameter usually set to 2 or 3.
The partition matrix can be very large. However, it is never explicitly represented in the program as HINT derives the incompatibility graph directlyfrom the data [6]. Partition matrix is thus a formal construct used for ex-planation or analysis of the decomposition algorithm. The coloring of theincompatibility graph is another potential bottleneck. HINT uses an efficientheuristic algorithm for graph coloring [6].
A special characteristic of function decomposition is its capability to identifyand remove redundant features and/or their values. Namely, whenever a fea-ture c = H(xi) is discovered such that c uses only a single value (i.e., H is aconstant function), the feature xi is redundant and can be removed from thedataset. Similarly, whenever a feature c = H(xi) uses less values than xi, theoriginal feature xi can be replaced by c.
Both these dataset transformations are particularly useful for data pre- processing, since they can reduce the size of the problem and increase thecoverage of the problem space by the learning examples. HINT removes re-dundant features in the increased order of their relevance: the less relevantfeatures are checked first, and if any type of redundancy mentioned above isdiscovered, the dataset is accordingly transformed. The features are estimatedby a ReliefF measure of relevance [7].
Figure 3: MONK1 feature hierachy as discovered by HINT.
With all the above ingredients, decomposition enables the discovery of featurehierarchies. Given a dataset, it is first pre-processed to remove redundant fea-tures and their values. Next, a single-step decomposition is used to decomposethe pre-processed tabulated function F to G and H. This process is recur-sively repeated on G and H until they can not be decomposed further, i.e.,their further decomposition would increase the overall complexity of resultingfunctions.
Let us illustrate this process with several examples. The first example is a well-known machine learning problem MONK1 [8]. The dataset uses six 2 to4-valued input features (x1 to x6) and contains 124 (of 435 possible) distinctlearning examples that partially define the target concept x1 = x2 OR x5 = 1.
HINT develops a feature hierarchy (Figure 3) that 1. correctly excludes irrelevant input features x3, x4, and x6, 2. transforms x5 to x by mapping four values of x 3. includes a new feature c and its tabular representation for x1 = x2, and 4. relates c and x with a tabular representation of the OR function.
In other words, the resulting hierarchy correctly represents the target concept.
Similarly as MONK1, the MONK2 learning problem [8] uses the same set of input features but defines the concept: xi = 1 for exactly two choices ofi ∈ {1, 2, . . . , 6}. The dataset contains 172 (of 435 possible) distinct learningexamples. Although the discovered structure (Figure 4) does not directly cor-respond to the original concept definition, it correctly reformulates the targetconcept by introducing features that count the number of ones in their argu-ments. Also note that all input features that use more than two values arereplaced by new binary features.
Figure 4: The feature hierarchy discovered for MONK2. Each node gives aname of the feature and cardinality of its set of values.
For a real-world example, consider a HOUSING dataset from a manage- ment decision support system for allocating housing loans. This system wasdeveloped for the Housing Fund of Slovenia and used since 1991 in 13 floats ofloans with a total value of approximately 90 million ECU. Basically, its task isto rank applicants into one of nine priority classes based on 12 input features.
The system uses a hierarchical decision model.
For this experiment, we wanted to assess the ability of decomposition to reconstruct the system’s model given a random selection of classified cases.
Figure 5: The feature structure discovered for housing loan allocation problem.
Each node gives a name of the feature and cardinality of its set of values.
Figure 6: Averaged classification accuracy of HINT and C4.5 for 10 experimentswith learning sets consisting of a random sample of p percents of the housingproblem space.
First, it was observed that in most cases when HINT was given 8000 or morelearning examples (4% coverage of problem space), it discovered the same fea-ture hierarchy (Figure 5). When presented to a domain expert, the discoveredfeatures were found meaningful. For instance, the feature c4 represents thepresent housing status of the applicant. Features c2, c6, and c8 respectivelyspecify the way of solving the housing problem, applicant’s current status, andhis/her social and health conditions. Thus, the structure of the original modelwas successfully reconstructed. This reconstruction takes about 2.5 minutesof processor time on a HP J210 Unix workstation.
The quality of the functions discovered for housing problem was then as- sessed by classification of the remaining examples that completely cover thisdomain but were not included in the learning set. The generalization of HINTwas assessed as a learning curve (Figure 6) and compared to that of the state-of-the-art learning tool C4.5 [9] that induces decision trees. For this domain,HINT generalizes well as it achieves 100% classification accuracy by using learn-ing sets that cover 3 or more percents of the problem space.
Housing problem domain can be viewed as a typical representative of prac- tical domains in the field of decision making that HINT was tested on. Similartests and experimental conclusion were for instance obtained by re-discoveryof hierarchical decision models for nursing school applications, job applicationevaluation, and performance evaluation of public enterprises [6].
Function decomposition can also be used for feature construction. This isdefined as a process of augmenting the feature space by inferring or creatingadditional features. We here show how the single-step decomposition can beused both for finding appropriate combinations of original features and usingthem to construct new features which are then added to the original dataset.
Again, the new features that use small number of values are preferred.
For example, consider again the dataset from Table 1. Suppose we want to augment it with a single new feature. For this purpose, single-step decomposi-tion examines all pairs of original input features and for each pair derives a newfeature. A new feature that has the fewest values is preferred. Such feature isc = H(x2, x3), and is defined by the example set from Figure 2. Function Hcan now be used for each example from Table 1 to obtain the corresponding Table 2: A dataset from Table 1 augmented with a new feature c = H(x2, x3).
value for c; such augmented dataset is shown in Table 2.
Obviously, an original dataset can be augmented with more than just a sin- gle new feature. We propose a schema where, using the single-step decompo-sition, m new features are added to a dataset: a candidate set of combinationsof the original features are examined, and corresponding m new features thathave the fewest values are selected. Ties are resolved arbitrarily. Only originalfeatures are used to generate new features. Note also that in the process nohierarchy of features is constructed. The feature construction and correspond-ing augmentation of the original dataset results in a new dataset which maythen be further used by some machine learning algorithm.
We evaluate this schema on the above-mentioned domains. For new fea- tures, only those that depend only on two original input features are consid-ered. For MONK1 and MONK2 domains, the benefit of adding new featureswas assessed by a 10-fold cross validation. For HOUSING, 10 experimentswere performed with a learning set consisting of a random sample of 1% ofthe problem space, while the test set consisted of the remaining 99%. Eachlearning set was first augmented with m new features as described above, andthen given to C4.5 [9] which was used to induce the target concept from theaugmented dataset. The induced decision tree was then tested on a test set,which was also augmented by the same set of new features, i.e., using theirdefinition as derived from the learning set.
In all the cases, a considerable increase of classification accuracy (Figure 7) and considerable decrease of the sizes of induced decision trees (Figure 8) occurwhen new features are added. In other words, adding new features improves ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ••••••••••••••• Figure 7: Classification accuracy of C4.5 as a function of number of new fea-tures m added to MONK1 and MONK2 (a) and housing loans dataset (b).
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦m Figure 8: Number of nodes N of C4.5-induced decision trees as a function ofnumber of new features m added to MONK1 and MONK2 (a) and housingloans dataset (b).
both the classification accuracy and the transparency of decision trees. Theresults indicate that function decomposition discovers features that are relevantfor these domains. It is further interesting to note how well a simple featureselection criterion based on the number of new feature’s values performs interms of yielding augmented datasets that enable C4.5 to derive classifierswith high classification accuracies.
In the framework of feature transformation, function decomposition is a promis-ing approach to discover and construct new features that are either added tothe original dataset, or transform this dataset to a hierarchy of less complexdatasets. The function decomposition algorithm in HINT performs a kind ofgeneralization, so it can also be viewed as a machine learning algorithm. Theapproach described in this article is limited to consistent datasets and nom-inal features. It is therefore desired to extend the approach to discover newfeatures from noisy data, and from data that uses continuous features.
To handle noisy data, a minimal-error decomposition was recently pro- posed [6]. It is based on a representation of learning examples with classdistributions and uses successive column merging of partition matrix, so thatthe expected error of classification is minimized. So far, the method was eval-uated only as a machine learning tool, while further work is needed to assessits appropriateness for feature transformation problems.
[1] R. L. Ashenhurst, “The decomposition of switching functions,” tech. rep., Bell Laboratories BL-1(11), pages 541–602, 1952.
[2] H. A. Curtis, A New Approach to the Design of Switching Functions. Van [3] W. Wan and M. A. Perkowski, “A new approach to the decomposition of incompletely specified functions based on graph-coloring and local trans-formations and its application to FPGA mapping,” in Proc. of the IEEEEURO-DAC ’92, (Hamburg), pp. 230–235, 1992.
[4] T. D. Ross, M. J. Noviskey, D. A. Gadd, and J. A. Goldman, “Pattern the- oretic feature extraction and constructive induction,” in Proc. ML-COLT’94 Workshop on Constructive Induction and Change of Representation,(New Brunswick, New Jersey), July 1994.
[5] B. Zupan, M. Bohanec, I. Bratko, and J. Demˇsar, “Machine learning by function decomposition,” in Proc. Fourteenth International Conference onMachine Learning (J. D. H. Fisher, ed.), (San Mateo, CA), pp. 421–429,Morgan Kaufmann, 1997.
http://www-ai.ijs.si/BlazZupan/papers.html.
[7] I. Kononenko, “Estimating attributes,” in Proc. of the European Confer- ence on Machine Learning (ECML-94) (F. Bergadano and L. D. Raedt,eds.), vol. 784 of Lecture Notes in Artificial Intelligence, (Catania), pp. 171–182, Springer-Verlag, 1994.
[8] S. B. Thrun et al., “A performance comparison of different learning algo- rithms,” tech. rep., Carnegie Mellon University CMU-CS-91-197, 1991.
[9] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann zef Stefan Institute, Ljubljana, Slovenia.
His research interest include machine learning, intelligent data analysis and knowl-edge discovery in databases, computer-aided support for medical decision making,and medical informatics. He has his MS in computer science from University of Hous-ton, Texas, and PhD from University of Ljubljana, Slovenia. He can be reached atDepartment of Intelligent Systems, J. Stefan Institute, Jamova 39, SI-1000 Ljubl-jana, Slovenia; blaz.zupan@ijs.si; http://www-ai.ijs.si/BlazZupan.
Marko Bohanec is a research associate at Joˇ nia, and assistant professor in information systems at the Faculty of OrganizationalSciences, University of Maribor. His research and development interests are in de-cision support systems and machine learning. He obtained his PhD in ComputerScience at the Faculty of Electrical Engineering and Computer Science, University ofLjubljana. He has published in journals such as Machine Learning, Acta Psychologicaand Information & Management. Contact him at Department of Intelligent Systems,J. Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia; marko.bohanec@ijs.si;http://www-ai.ijs.si/MarkoBohanec/mare.html.
Ivan Bratko is a professor of computer science at the Faculty of Computer andInformation Science, University of Ljubljana, Slovenia. He heads the AI labora-tories at Joˇ zef Stefan Institute and the University. He is the chairman of ISSEK, International School for the Synthesis of Expert Knowledge.
research in machine learning, knowledge-based systems, qualitative modeling, intel-ligent robotics, heuristic programming and computer chess. His main interests inmachine learning have been in learning from noisy data, combining learning andqualitative reasoning, and various applications of machine learning and InductiveLogic Programming, including medicine, ecological modeling and control of dynamicsystems. He is the author of widely adopted text PROLOG Programming for Arti-ficial Intelligence (Addison-Wesley 1986, second edition 1990). He co-edited (withN. Lavraˇ c) Progress in Machine Learning (Sigma Press 1987) and co-authored (with c) KARDIO: a Study in Deep and Qualitative Knowledge for Expert Systems (MIT Press, 1989). Contact him at the Faculty of Computerand Information Science, University of Ljubljana, Trˇ Slovenia; ivan.bratko@fri.uni-lj.si.
sar is a research assistant at the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. He has BSc in computer science from thesame faculty. His research interest are machine learning and intelligent data analy-sis. He can be reached at the Faculty of Computer and Information Science, Univer-sity of Ljubljana, Trˇ ska 25, SI-1000 Ljubljana, Slovenia; janez.demsar@fri.uni-lj.si;