Decision Tree Classification

Next: Combined Bayesian Classification Up: Classification Previous: k-Nearest Neighbours Classification Contents

Decision Tree Classification

The second classifier used is a decision tree. A decision tree recursively subdivides regions in the feature space into different subspaces, using different thresholds in each dimension to maximize class discrimination. Ideally, for a given subspace the process stops when it only contains patterns of one class. However, in practice, sometimes it is not possible or computationally prohibitive to use such stopping criterion, and the algorithm stops when most of the patterns belong to the same region. In this work we have used the C

decision tree [146], which is an extension of the ID

decision tree [145] and naturally deals with continuous data.

Both the ID and C algorithms are based on the use of the ID information criterion [43] to simultaneously determine the best attribute to split on and at which threshold it has to be split. This criterion is based on the information gain, which measures how well a given attribute separates the training examples according to their target classification at each step. The main difference of both algorithms is that, while ID used the information gain itself, C used the gain ratio.

In our implementation, each leaf corresponds to a test of the kind: $A_f \leq t$ (feature is lesser or equal to the given threshold ). This test only has two possible outcomes: true or false, and the issue is to find the best threshold to partition the data. The splitting criterion used is the gain ratio.

The information gained by a test with $\vert T\vert$ outcomes is defined by:

$\displaystyle Gain(D,T) = I(D)-E(D,T)$

(3.1)

where is the information associated with the partitions over the set of patterns (the database of mammograms) and the entropy of the given partition, which is defined below. Mathematically, is also defined by the entropy:

$\displaystyle I(D) = -\sum_{c=1}^{\vert B\vert}p(D,B_c)log_2(p(D,B_c))$

(3.2)

where is the probability that a pattern/mammogram belongs to the $c^{th}$ class (in this work, $\vert B\vert=4$ , as there are four BIRADS classes). On the other hand, refers to the entropy defined by the partition, which is calculated as:

$\displaystyle E(D,T) = \sum_{i=1}^{\vert T\vert} \frac{\vert D_i\vert}{\vert D\vert} I(D_i)$

(3.3)

where are the partitions of the data defined by the test (the true and false instances). Thus, the information gain is simply the expected reduction in entropy caused by partitioning the examples according to an attribute. This is the criterion used by the ID decision tree.

As the above criterion is maximal when there is one case in each class, a new term is used to penalize this situation. Thus, the split information ( ) is defined as the potential information obtained by partitioning a set of cases and knowing the class which a pattern falls in, and is given by:

$\displaystyle Sp(D,T) = -\sum_{i=1}^{\vert T\vert} \frac{\vert D_i\vert}{\vert D\vert} log_2 \left( \frac{\vert D_i\vert}{\vert D\vert} \right)$

(3.4)

The gain ratio criterion $G_{RC}$ used in the C decision tree is defined by:

$\displaystyle G_{RC} = \frac{Gain(D,T)}{Sp(D,T)}$

(3.5)

This ratio is determined for every possible partition, and the split that gives the maximum $G_{RC}$ is selected.

In order to obtain a more robust classifier, the boosting procedure described in [147] is used. The underlying idea of this machine-learning method is to combine simple classifiers to form an ensemble such that the performance of the simple ensemble member is improved. To achieve this, boosting assigns a weight to each instance of the training data, reflecting their importance. Adjusting these weights causes the learner to focus on different instances, leading to different classifiers. Thus, to construct the first decision tree, the weight of all the instances are initialized with the same value. Subsequently, the training data is classified by using this initial tree and an error rate is obtained, by counting the amount of misclassified data. Using this error rate, the weights are recalculated such that those belonging to misclassified instances are increased, and those belonging to correct classified instances are decreased. This process is repeated until either all instances are correctly classified or convergence is achieved. The final classification aggregates the learned classifiers by voting, where each classifier's vote is a function of its accuracy (the error rate).

Next: Combined Bayesian Classification Up: Classification Previous: k-Nearest Neighbours Classification Contents

Arnau Oliver 2008-06-17