Introduction

Random forests have become a very popular data mining algorithm due to the high accuracy of predictions and the ability to handle large amounts of input variables. However, this ensemble method is difficult to interpret. It is inefficient to analyze every tree individually. Moreover, the stochastic properties entailing individual trees provides minimal intuition as to why the overall model gives accurate predictions. This R shiny app presents a new method to visualize the ensemble of trees generated by a random forest model.

Visualizations

This app adapts several methods from Matthew Leonawicz that illustruates common methods of interpreting Random Forests. In addition to these methods, hamann similarity is derived to evaluate diversity amongst trees. Hamann similarity provides a way to filter and present trees based on tree similarity. This app uses methods illustrated in (leonawicz 2014) such as variable importance, proximity, and partial dependence. In addition, a measure of diversity amongst the ensemble of trees is derived to determine trees that are more correlated to each other.

Hamann similarity measure is suggested in (Gatnar 2005) to evaluate the diversity of all possible component classifier pairs. It is required to count the number of cases that predict the same class as shown on the left table below, and the number of cases that predict different classes as shown on the right table below.

Predicted Same Class
Correct Incorrect
Correct a1 0
Incorrect 0 d1
Predicted Different Class
Correct Incorrect
Correct 0 b2
Incorrect c2 d2

The hamann similarity measure is slightly modified in the binary case to incorporate multiple classes. The measure is defined as \[ H=\frac{(a_{1}+d_{1})-(b_{2}+c_{2}+d_{3})}{a_{1}+d_{1}+b_{2}+c_{2}+d_{2}} \] Hamann similarity is computed pairwise for each tree against all the remaining trees. The trees are then ranked based on hamann similarity.

Data

This app currently serves as a demo, and only handles a specific set of data. This data is obtained from my MSc project work. There are 6 types of modulation signals (or classes) in total. These modulations include OOK, BPSK, OQPSK, and 3 types of BFSK. BFSKA is modulated at a medium intermediate frequency. BFSKB is modulated at a high intermediate frequency, and BFSKR2 is modulated at random intermediate frequencies within some range. There are 5 quantitative features and a total of 600 observations.

library(plyr)
urlTrain = "datTrn_small.txt"
trainDat <- read.table(file = urlTrain, sep=",", header=T)
trainDat$cl = factor(trainDat$cl)
row.names(trainDat) = 1:nrow(trainDat)
hashTable = c("1"="OOK", "2"="BPSK", "3"="OQPSK", "4"="BFSKA", 
                            "5"="BFSKB", "6"="BFSKR2")
trainDat$cl = revalue(trainDat$cl, hashTable)
m1 m2 m3 m4 m5 cl
28.084 0.50321 0.55908 81 0.79173 BFSKA
55.311 0.63574 0.58586 81 0.75672 BFSKA
28.248 0.41817 0.48105 81 0.80030 BFSKA
46.533 0.72903 0.76270 81 0.77206 BFSKA
31.203 0.49942 0.76194 81 0.73678 BFSKA
31.698 0.88632 0.77689 81 0.76828 BFSKA


m1 to m5 corresponds to the input variables, and cl corresponds to the response variable.

Components of the App

We start by entering the number of trees. By default, the number of trees is 100. With 600 observations, 200-300 trees should be a sufficient number to achieve optimal predictions. This app currently does not scale well. The bottle neck occurs when deriving the hamann similarity. Computational complexity worsens significantly as the number of trees increases. I would not recommend fitting over 500 trees. When we press the fit random forest button, we can track the computation progress on the upper right corner of the page.

After the model fitting and hamann similarity computations are complete, a slider appears on the side bar panel. This slider filters the number of trees of the fitted model based on hamann similarity. The histogram on the main panel reflects the hamann range selected with the slider. As we move the slider, proportions of the histogram will be highlighted. The hamann range statistics are shown on the left side of the histogram, where the bracketed value corresponds to the percentile.

In addition to the hamann range slider, we can choose the type of plot to render on the main panel. The 3 plot types consists of variable importance, proximity, and partial dependence. Each of these plots are rendered based on the trees selected from the hamann range.

An additional input variable selection box is displayed on the side bar panel below the plot type selection. This input variable is required for the partial dependence plot.

Bibliography

Gatnar, Eugeniusz. 2005. “A Diversity Measure for Tree-Based Classifier Ensembles.” In Data Analysis and Decision Support, 30–38. Springer.

leonawicz, Matthew. 2014. “Random Forest Examples with R and Shiny.” http://spark.rstudio.com/uafsnap/random_forest_example/.