How to Fit Classification and Regression Trees in R (2024)

When the relationship between a set of predictor variables and a response variable is linear, methods like multiple linear regression can produce accurate predictive models.

However, when the relationship between a set of predictors and a response is more complex, then non-linear methods can often produce more accurate models.

One such method is classification and regression trees (CART), which use a set of predictor variable to build decision trees that predict the value of a response variable.

If the response variable is continuous then we can build regression trees and if the response variable is categorical then we can build classification trees.

This tutorial explains how to build both regression and classification trees in R.

Example 1: Building a Regression Tree in R

For this example, we’ll use theHitters dataset from theISLR package, which contains various information about 263 professional baseball players.

We will use this dataset to build a regression tree that uses the predictor variables home runs and years played to predict the Salary of a given player.

Use the following steps to build this regression tree.

Step 1: Load the necessary packages.

First, we’ll load the necessary packages for this example:

library(ISLR) #contains Hitters datasetlibrary(rpart) #for fitting decision treeslibrary(rpart.plot) #for plotting decision trees

Step 2: Build the initial regression tree.

First, we’ll build a large initial regression tree. We can ensure that the tree is large by using a small value forcp, which stands for “complexity parameter.”

This means we will perform new splits on the regression tree as long as the overall R-squared of the model increases by at least the value specified by cp.

We’ll then use the printcp() function to print the results of the model:

#build the initial treetree <- rpart(Salary ~ Years + HmRun, data=Hitters, control=rpart.control(cp=.0001))#view resultsprintcp(tree)Variables actually used in tree construction:[1] HmRun YearsRoot node error: 53319113/263 = 202734n=263 (59 observations deleted due to missingness) CP nsplit rel error xerror xstd1 0.24674996 0 1.00000 1.00756 0.138902 0.10806932 1 0.75325 0.76438 0.128283 0.01865610 2 0.64518 0.70295 0.127694 0.01761100 3 0.62652 0.70339 0.123375 0.01747617 4 0.60891 0.70339 0.123376 0.01038188 5 0.59144 0.66629 0.118177 0.01038065 6 0.58106 0.65697 0.116878 0.00731045 8 0.56029 0.67177 0.119139 0.00714883 9 0.55298 0.67881 0.1196010 0.00708618 10 0.54583 0.68034 0.1198811 0.00516285 12 0.53166 0.68427 0.1199712 0.00445345 13 0.52650 0.68994 0.1199613 0.00406069 14 0.52205 0.68988 0.1194014 0.00264728 15 0.51799 0.68874 0.1191615 0.00196586 16 0.51534 0.68638 0.1204316 0.00016686 17 0.51337 0.67577 0.1163517 0.00010000 18 0.51321 0.67576 0.11615n=263 (59 observations deleted due to missingness)

Step 3: Prune the tree.

Next, we’ll prune the regression tree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest test error.

Note that the optimal value for cp is the one that leads to the lowestxerror in the previous output, which represents the error on the observations from the cross-validation data.

#identify best cp value to usebest <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]#produce a pruned tree based on the best cp valuepruned_tree <- prune(tree, cp=best)#plot the pruned treeprp(pruned_tree, faclen=0, #use full names for factor labels extra=1, #display number of obs. for each terminal node roundint=F, #don't round to integers in output digits=5) #display 5 decimal places in output

We can see that the final pruned tree has six terminal nodes. Each terminal node shows the predicted salary of player’s in that node along with the number of observations from the original dataset that belong to that note.

For example, we can see that in the original dataset there were 90 players with less than 4.5 years of experience and their average salary was $225.83k.

Step 4: Use the tree to make predictions.

We can use the final pruned tree to predict a given player’s salary based on their years of experience and average home runs.

For example, a player who has 7 years of experience and 4 average home runs has a predicted salary of $502.81k.

Example 2: Building a Classification Tree in R

For this example, we’ll use theptitanic dataset from therpart.plot package, which contains various information about passengers aboard the Titanic.

We will use this dataset to build a classification tree that uses the predictor variables class, sex, and age to predict whether or not a given passenger survived.

Use the following steps to build this classification tree.

Step 1: Load the necessary packages.

First, we’ll load the necessary packages for this example:

library(rpart) #for fitting decision treeslibrary(rpart.plot) #for plotting decision trees

Step 2: Build the initial classification tree.

First, we’ll build a large initial classification tree. We can ensure that the tree is large by using a small value for cp, which stands for “complexity parameter.”

This means we will perform new splits on the classification tree as long as the overall fit of the model increases by at least the value specified by cp.

We’ll then use the printcp() function to print the results of the model:

#build the initial treetree <- rpart(survived~pclass+sex+age, data=ptitanic, control=rpart.control(cp=.0001))#view resultsprintcp(tree)Variables actually used in tree construction:[1] age pclass sex Root node error: 500/1309 = 0.38197n= 1309 CP nsplit rel error xerror xstd1 0.4240 0 1.000 1.000 0.0351582 0.0140 1 0.576 0.576 0.0299763 0.0095 3 0.548 0.578 0.0300134 0.0070 7 0.510 0.552 0.0295175 0.0050 9 0.496 0.528 0.0290356 0.0025 11 0.486 0.532 0.0291177 0.0020 19 0.464 0.536 0.0291988 0.0001 22 0.458 0.528 0.029035

Step 3: Prune the tree.

Next, we’ll prune the regression tree to find the optimal value to use for cp (the complexity parameter) that leads to the lowest test error.

Note that the optimal value for cp is the one that leads to the lowestxerror in the previous output, which represents the error on the observations from the cross-validation data.

#identify best cp value to usebest <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]#produce a pruned tree based on the best cp valuepruned_tree <- prune(tree, cp=best)#plot the pruned treeprp(pruned_tree, faclen=0, #use full names for factor labels extra=1, #display number of obs. for each terminal node roundint=F, #don't round to integers in output digits=5) #display 5 decimal places in output

We can see that the final pruned tree has 10 terminal nodes. Each terminal node shows the number of passengers that died along with the number that survived.

For example, in the far left node we see that 664 passengers died and 136 survived.

Step 4: Use the tree to make predictions.

We can use the final pruned tree to predict the probability that a given passenger will survive based on their class, age, and sex.

For example, a male passenger who is in 1st class and is 8 years old has a survival probability of 11/29 = 37.9%.

You can find the complete R code used in these examples here.