Topic 15 Decision Trees using R

Some references: Boehmke and Greenwell (2019), Hastie et al. (2013) and Lantz (2019)

In this section we discuss tree based methods for classification.

  • Tree-based models are a class of non-parametric algorithms that work by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules.

  • These involve stratifying or segmenting the predictor space into a number of simple regions.

  • Typically use the mean or the mode of the training observations in the region to which it belongs.

  • Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.

  • Classification and Regression Tree (CART) (Breiman et al. (1984)) is the most well-known decision tree algorithm.

    • CART uses binary recursive partitioning: Each split depends on the split above (before) it.
  • A basic decision tree partitions the training data into homogeneous subgroups (i.e., groups with similar response values) and then fits a simple constant in each subgroup (e.g., the mean of the within group response values for regression).

  • The subgroups (also called nodes) are formed recursively using binary partitions formed by asking simple yes-or-no questions about each feature.

  • This is done a number of times until a suitable stopping criteria is satisfied (e.g., a maximum depth of the tree is reached).

  • After all the partitioning has been done, the model predicts the output based on

      1. the average response values for all observations that fall in that subgroup (regression problem), or
      1. the class that has majority representation (classification problem)
  • Root node: First subgroup

  • Terminal node or Leaf node: Final subgroup

  • Internal node: Other subgroups between root and terminal node

  • Branches: Connection between the nodes

We will apply the CART method on a Credit Risk Example

Two types of risks are associated with the bank’s decision

  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

This analysis is an example and is not exhaustive list of methods available for data description, visualisation or ML using R.

References

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with r. CRC Press. https://bradleyboehmke.github.io/HOML/.
Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.
Hastie, Trevor, Robert Tibshirani, Gareth James, and Daniela Witten. 2013. An Introduction to Statistical Learning with Applications in r. Springer New York.
Lantz, Brett. 2019. Machine Learning with r (3rd Edition). Packt Publishing. https://app.knovel.com/hotlink/toc/id:kpMLRE000A/machine-learning-with/machine-learning-with.