Decision trees are mainly used, as a predictive model, for two purposes: classification and regression. In classification tasks the purpose is to label the observations with one of a limited number of categories. For example, we want to classify the applications of a credit card into two classes: high risk, and low risk. In regression trees the objective of the model is to estimate a continuous value such as the expected revenue from each client in the coming year. In this article we will limit the discussion to classification trees.
Before we get into describing how a good decision tree looks like, let’s first describe how predictions are made from tree models. Figure 1 shows a simple classification tree with two splits. The DV is the name of the dependent variable that represents the response to a marketing campaign.
Figure 1: Simple decision tree with two splits
The red highlighted section in each node represents the percentage of the responders in that node. For example, node 8 has an expected response rate of 71.59%, while node 4 has only 1.94%. These values will be the probability of response that the model will output. If we set the cut-off value at a probability of 0.5, then only records that satisfy the rules of node 8 would result in a predicted response of “Yes”. In this case, 13% of the records will be classified by this tree as responders. This is the size of node 8 as shown in Figure 1. This way of applying the model will result in a predicted response rate which is lower than the real response rate of 23%, as shown in the root node in Figure 1.
In general, a good model has four characteristics:
- Small (simple)
As shown in Figure 1, the probability assigned by a classification decision tree is the percentage of the target category in each terminal node. Therefore, the degree of purity of each node will decide the accuracy of the tree predictions. For example, the tree predictions from node 4 will be more accurate than those from node 7. Applying the cut-off value of 0.5 on node 4 will result in the tree classifying all the records that satisfy its conditions as non-responders, with an error rate of only 1.94% (the percentage of Yes responders in that node). However, applying the same rule on node 7 will also result in classifying all the records as non-responders but with an expected error rate of 37.77%.
The above example confirms our requirement that a decision tree model should have its terminal node as pure as possible.
Figure 2: Shows the tree model in Figure 1 with an additional split (off node 4).
Figure 2: The tree model with an additional split
Node 10 in Figure 2 represents only 0.57% of the population of the dataset. Although it isolates most of the responders from node 4, it is a very small node and most likely is obtained as a result of chance. Therefore, if we use the rule resulting from this node to score another dataset we would not expect many records to actually fall in that node and use its probability prediction (response rate of 18.28%). This means that the current tree with this node is not robust, i.e. its predictions do not generalize on other datasets.
This example shows that the robustness requirement is at odds with the accuracy requirement. Although we want the terminal nodes to be as pure as possible, we cannot allow them to be very small in size (number of records). This is very often hard to achieve and requires some experimentation with the tree. Interactive tree software, such as that offered by Angoss, is the only way to achieve the required level of balance between accuracy and robustness.
The next characteristic of a good tree model is that it should be simple, which means it should not be a very large tree. To control the size of the tree we will need to set limits on the number of branches in each split as well as the depth of the tree. Modern decision tree software should allow the user to easily control both. There are some automated methods to control the size of the tree. Most of these methods rely on growing a large tree and then pruning it. Other methods attempt to prune the tree while it is being developed.
The last characteristic of a good tree model is that most of the rules extracted from the tree should make sense to subject matter experts. For example, the data scientist who constructed the tree should be able to explain its logic to their manager. A common data issue that decision trees expose (not cause!) is the problem known as Simpson’s Paradox where a trend is reversed when the data is split with another variable before observing the trend.
Explaining trees to non-analytical audience can be facilitated by translating some of the tree rules to plain English or some charts, such as the profile chart as shown in Figure 3.
Figure 3: Profile chart of the tree in Figure 2
In the profile chart shown in Figure 3, each bar represents a terminal node in the tree. The base line (dotted line in the middle of the chart) represents the response rate in the population (root node) and the height of each bar represents the percentage of “Yes”. The width of each bar represents the size of each node (number of records).
The profile chart can quickly show the efficiency of the tree as a classification model by comparing it to the perfect classifier. The perfect classifier would be a tree with one split, as shown in Figure 4.
Figure 4: Perfect classification tree with the equivalent profile chart
The profile chart in this case comprises only two bars; one with zero response, and one with 100% response.