Extended About on RF Ensemble Visualizer

Introduction
Visualizations
Data
Components of the App

Introduction

Random forests have become a very popular data mining algorithm due to the high accuracy of predictions and the ability to handle large amounts of input variables. However, this ensemble method is difficult to interpret. It is inefficient to analyze every tree individually. Moreover, the stochastic properties entailing individual trees provides minimal intuition as to why the overall model gives accurate predictions. This R shiny app presents a new method to visualize the ensemble of trees generated by a random forest model.

Visualizations

This app adapts several methods from Matthew Leonawicz that illustruates common methods of interpreting Random Forests. In addition to these methods, hamann similarity is derived to evaluate diversity amongst trees. Hamann similarity provides a way to filter and present trees based on tree similarity. This app uses methods illustrated in (leonawicz 2014) such as variable importance, proximity, and partial dependence. In addition, a measure of diversity amongst the ensemble of trees is derived to determine trees that are more correlated to each other.

Hamann similarity measure is suggested in (Gatnar 2005) to evaluate the diversity of all possible component classifier pairs. It is required to count the number of cases that predict the same class as shown on the left table below, and the number of cases that predict different classes as shown on the right table below.

Predicted Same Class

	Correct	Incorrect
Correct	a1	0
Incorrect	0	d1

Predicted Different Class

	Correct	Incorrect
Correct	0	b2
Incorrect	c2	d2

The hamann similarity measure is slightly modified in the binary case to incorporate multiple classes. The measure is defined as \[ H=\frac{(a_{1}+d_{1})-(b_{2}+c_{2}+d_{3})}{a_{1}+d_{1}+b_{2}+c_{2}+d_{2}} \] Hamann similarity is computed pairwise for each tree against all the remaining trees. The trees are then ranked based on hamann similarity.

Data

This app currently serves as a demo, and only handles a specific set of data. This data is obtained from my MSc project work. There are 6 types of modulation signals (or classes) in total. These modulations include OOK, BPSK, OQPSK, and 3 types of BFSK. BFSKA is modulated at a medium intermediate frequency. BFSKB is modulated at a high intermediate frequency, and BFSKR2 is modulated at random intermediate frequencies within some range. There are 5 quantitative features and a total of 600 observations.

library(plyr)
urlTrain = "datTrn_small.txt"
trainDat <- read.table(file = urlTrain, sep=",", header=T)
trainDat$cl = factor(trainDat$cl)
row.names(trainDat) = 1:nrow(trainDat)
hashTable = c("1"="OOK", "2"="BPSK", "3"="OQPSK", "4"="BFSKA", 
                            "5"="BFSKB", "6"="BFSKR2")
trainDat$cl = revalue(trainDat$cl, hashTable)

m1	m2	m3	m4	m5	cl
28.084	0.50321	0.55908	81	0.79173	BFSKA
55.311	0.63574	0.58586	81	0.75672	BFSKA
28.248	0.41817	0.48105	81	0.80030	BFSKA
46.533	0.72903	0.76270	81	0.77206	BFSKA
31.203	0.49942	0.76194	81	0.73678	BFSKA
31.698	0.88632	0.77689	81	0.76828	BFSKA

m1 to m5 corresponds to the input variables, and cl corresponds to the response variable.

Components of the App

We start by entering the number of trees. By default, the number of trees is 100. With 600 observations, 200-300 trees should be a sufficient number to achieve optimal predictions. This app currently does not scale well. The bottle neck occurs when deriving the hamann similarity. Computational complexity worsens significantly as the number of trees increases. I would not recommend fitting over 500 trees. When we press the fit random forest button, we can track the computation progress on the upper right corner of the page.

After the model fitting and hamann similarity computations are complete, a slider appears on the side bar panel. This slider filters the number of trees of the fitted model based on hamann similarity. The histogram on the main panel reflects the hamann range selected with the slider. As we move the slider, proportions of the histogram will be highlighted. The hamann range statistics are shown on the left side of the histogram, where the bracketed value corresponds to the percentile.

In addition to the hamann range slider, we can choose the type of plot to render on the main panel. The 3 plot types consists of variable importance, proximity, and partial dependence. Each of these plots are rendered based on the trees selected from the hamann range.

An additional input variable selection box is displayed on the side bar panel below the plot type selection. This input variable is required for the partial dependence plot.

Bibliography

Gatnar, Eugeniusz. 2005. “A Diversity Measure for Tree-Based Classifier Ensembles.” In Data Analysis and Decision Support, 30–38. Springer.

leonawicz, Matthew. 2014. “Random Forest Examples with R and Shiny.” http://spark.rstudio.com/uafsnap/random_forest_example/.