Table of Contents

1 Decision Tree

1.1 Goals

  • Know what a decision tree is.

选一个特征,使得熵降低程度最大,直到降为0. greedy searching.

  • Give examples of functions which can't be represented compactly (e.g. majority, parity)
  • Be able to fit a decision tree using a recursive greedy strategy.

特征选择,决策树的生成,决策树的修剪。

特征的选择,计算每个特征的信息增益,ID3, C4.5, CART。

recursive searching生成树。

  • 剪枝(pruning a tree)

The standard approach is therefore to grow a “full” tree, and then to perform pruning. This can be done using a scheme that prunes the branches giving the least increase in the error.

cost function: Gini function, CART.

  • What is the information gain criterion, and why does it produce better splits than classification accuracy?
  • Be aware that decision trees can be unstable, in that the structure changes dramatically with respect to small changes in the training data.

1.2 Terminology

  • CART

Classification and regression trees or CART models, also called decision trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space. This can be represented by a tree, with one leaf per region.

1.3 Tree construction

To build a decision tree, you need to make a first decision on the dataset to dictate which feature is used to split the data. To determine this, you try every feature and measure which split will give you the best results. After that, you’ll split the dataset into subsets. The subsets will then traverse down the branches of the first decision node. If the data on the branches is the same class, then you’ve properly classified it and don’t need to continue splitting it. If the data isn’t the same, then you need to repeat the splitting process on this subset. The decision on how to split this subset is done the same way as the original dataset, and you repeat this process until you’ve classified all the data.

1.3.1 Pseudo code

```text Check if every item in the dataset is in the same class: If so return the class label Else find the best feature to split the data split the dataset create a branch node for each split call createBranch and add the result to the branch node return branch nod ```

1.3.2 Information gain

measure of information of a set is known as Shannon entropy, entropy is defined as the expected value of the information. information: \[l(x_i) = log_2p(x_i)\]

  1. entropy, or deviance, expected value of all the information of all posible values of our class.

    \[H=-\sum_{i=1}^n p(x_i)log_2p(x_i)\] Goals

    • Understand the notion of entropy of a discrete random variable.
    • What is the largest possible entropy of a discrete random variable which takes on r possible values?
    • Know the definitions of joint entropy and conditional entropy.
    • Derive the chain rule for writing joint entropy as a sum of conditional entropies.
    • Show that the entropy of a set of independent random variables is the sum of the individual entropies.

1.3.3 条件熵

\[H(Y|X) = H(X,Y) - H(X)\]

1.3.4 互信息

\[I(X,Y) = H(Y) - H(Y|X)\]

1.4 Pros and cons of trees

1.4.1 Pros:

  • they are easy to interpret.
  • they can easily handle mixed discrete and continuous inputs.
  • they are insensitive to monotone transformations of the inputs (because the split points are based on ranking the data points).
  • they perform automatic variable selection.
  • they are relatively robust to outliers.
  • they scale well to large data sets.
  • they can be modified to handle missing inputs.

1.4.2 Cons:

  • they do not predict very accurately compared to other kinds of model. This is in part due to the

greedy nature of the tree construction algorithm.

  • A related problem is that trees are unstable:

small changes to the input data can have large effects on the structure of the tree, due to the hierarchical nature of the tree-growing process, causing errors at the top to affect the rest of the tree.

2 Random Forest

Random forests are a machine learning algorithm which averages the predictions over decision trees restricted to random subsets of the input features. They are widely used because they often perform very well with almost no parameter tuning.

2.1 Context

This concept has the prerequisites:

  • decision trees (Random forests are ensembles of decision trees.)
  • bagging (Bagging is a part of the random forest algorithm.)

Bagging is a technique for reducing the variance of a learning algorithm by averaging the predictions obtained from random resamplings of the training data. It can improve the performance of unstable algorithms such as decision trees.

  • generalization (Averaging over random resamplings is intended to improve generalization performance.)

2.2 Goals

  • Know the basic random forest algorithm
  • What effect does varying the number of features have? What are the advantages of larger or smaller values?
  • How do you determine the relevance of each of the input features to the classification?
  • How do you estimate out-of-sample error as the training is progressing?