Maximum Margin Interval Trees

Authors: Parismita Das, Alexandre Drouin, Torsten Hothorn and Toby Dylan Hocking


Overview

There are few R packages available for interval regression, a machine learning problem which is important in genomics and medicine. Like usual regression, the goal is to learn a function that inputs a feature vector and outputs a real-valued prediction. Unlike usual regression, each response in the training set is an interval of acceptable values (rather than one value). In the terminology of the survival analysis literature, this is regression with “left, right, and interval censored” output/response data.

Max margin interval trees are a new type of nonlinear model for this problem (Drouin et al., 2017). A dynamic programming algorithm is used for computing the optimal solution inlog-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets. This algorithm was implemented using the partykit library for creation of tree structure and better visualization.

Specifically, the following learning algorithms are implemented in the mmit package:

For each algorithm, we implemented the following:

This document is an short overview of the functionalities of the mmit package. The overview of the work done as part of Google summer of code is covered here


Table of contents


Installation

The mmit R package can be installed via the following R commands.

if(!require(devtools))install.package("devtools")
devtools::install_github("aldro61/mmit/Rpackage")

Building the tree

As Breiman’s CART algorithm is widely used for creation of decision trees, random forest and boosted decision trees. In the CARTS algorithm, trees are grown by recursive partitioning of leaves, each time maximizing some task-specific criterion. It has the ability to learn non-linear models from both numerical and categoricaldata of various scales, and also has a relatively low training time complexity. Hence we will be using the same concept in building maximum margin interval trees.

The Learning Setting

The training data is a set of labelled examples, where the target values are intervals indicating a range of acceptable values: \[S = {(x_1,y_1), (x_2,y_2)... (x_n,y_n)}\] where \(x \epsilon R^n\) is the feature variable of the model and \(y = [\underline{y},\bar{y}]\) is the intervaled target data.

The Loss Function

We consider two types of losses, hinge and squared. As our target data is interval type, hence the loss inside the interval is zero and outside follow the linear or quadratic function. We aim to minimise the convex objective function for model \(h\) \[\phi_l(\underline{y_i}-h(x_i)+\epsilon) + \phi_l(h(x_i)-\bar{y_i}+\epsilon)\] where \(l\) is a convex function and \(\phi_l(x) = l[(x)+]\) is the loss, where \((x)_+\) is the positive part of the function and \(\epsilon\) as the margin hyperparameter with regularizing effect.

Linear hinge loss: \(l(x) = x\)

Squared hinge loss: \(l(x) = x^2\)

Splitting algorithm

As proposed in the paper, the splitting is done using margin-based loss function, which yields a sequence of convex optimization problems. A dynamic programming algorithm is used to compute the optimal solution to all of these problems in log-linear time. Hence obtaining the optimal splitting value.

Partitioning Leaf

The partitioning of the tree into left and right leaf is done using 2 convex minimization problem and considering the splitting index based on minimum cost of sum of cost of left and right leaves.

Dynamic programing

The dynamic programing algorithm is used to calculate the total loss of the optimal prediction. The sum of above mentioned hinge losses is a convex piecewise function \(Pt(\mu)\).

While tracking the functions minima, the dynamic programming algorithm maintains a pointer on the left-most breakpoint that is to the right of all minima of \(Pt(\mu)\). The optimization step consists in finding the new position of this pointer after a hinge loss is added to the sum.

In this Fig, First two steps of the dynamic programming algorithm for the data \(y_1= 4, s_1= 1, y_2= 1, s_2=−1\) and margin \(\epsilon = 1\), using the linear hinge loss \(l(x) = x\).

  • Left: The algorithm begins by creating a first breakpoint at \(b_{1,1} = y_1−\epsilon= 3\), with corresponding function \(f_{1,1}(\mu) =\mu−3\). At this time, we have \(j_1= 2\) and thus \(b_{1,j_1}=∞\). Note that the cost \(p_{1,1}\) before the first breakpoint is not yet stored by the algorithm.

  • Middle: The optimization step is to move the pointer to the minimum (\(J_1=j_1−1\)) and update the cost function, \(M_1(\mu) =p_{1,2}(\mu)−f_{1,1}(\mu)\).

  • Right: The algorithm adds the second breakpoint at \(b_{2,1}=y_2+\epsilon= 2\) with \(f_{2,1}(μ) =μ−2\). The cost at the pointer is not affected by the new data point, so the pointer does not move.

mmit()

This function is used to create the maximum margin interval trees. It returns the learned tree as a party object.

Usage:

mmit(feature.mat, target.mat, max_depth = Inf, margin = 0, loss = "hinge", min_sample = 1)

Example:

library(mmit)
target.mat <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

feature.mat <- rbind(
  c(1,0,0), c(1,1,0), c(1,2,0),
  c(1,3,0), c(1,4,0), c(1,5,0))

colnames(feature.mat) <- c("a", "b", "c")
feature.mat <- data.frame(feature.mat)


out <- mmit(feature.mat, target.mat)

Output:

Prediction

The prediction is done using predict.mmit(). It fits the new data into the MMIT model to give prediction values

Usage:

predict.mmit(tree, newdata = NULL, perm = NULL)

Example:

library(mmit)
target.mat <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

feature.mat <- rbind(
  c(1,0,0), c(1,1,0), c(1,2,0),
  c(1,3,0), c(1,4,0), c(1,5,0))

colnames(feature.mat) <- c("a", "b", "c")
feature.mat <- data.frame(feature.mat)

tree <- mmit(feature.mat, target.mat)
pred <- predict.mmit(tree)

Output:

## [1] 0.5 0.5 0.5 2.5 2.5 2.5

Pruning

Overfitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. To prevent overfitting of the decision trees, we implemented post pruning. In post pruning methods, the trees are pruned after they are build. In mmit.pruning() function, we have implemented a bottom-up algorithm of minimum cost complexity pruning.

Algorithm

Output

mmit.pruning()

Pruning the regression tree for censored data to give all the alpha (regularization parameter) values and trees as output.

Usage:

mmit.pruning(tree)

Example:

library(mmit)
target.mat <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

feature.mat <- rbind(
  c(1,0,0), c(1,1,0), c(1,2,0),
  c(1,3,0), c(1,4,0), c(1,5,0))

colnames(feature.mat) <- c("a", "b", "c")
feature.mat <- data.frame(feature.mat)


tree <- mmit(feature.mat, target.mat)
pruned_tree <- mmit.pruning(tree)

Output:

## [1] "pruned tree for alpha = 0"

## [1] "pruned tree for alpha = 3"

K-fold Cross Validation

Performing grid search to select the best parameters via cross validation on the a regression tree for censored data. It outputs all the CV results, the best model and best parameters. It is done using the mmit.cv() function.

mmit.cv()

Usage:

mmit.cv(feature.mat, target.mat, param_grid, n_folds = 3, scorer = NULL, pruning = TRUE, future.seed = FALSE)

Example:

library(mmit)
data(neuroblastomaProcessed, package="penaltyLearning")
feature.mat <- data.frame(neuroblastomaProcessed$feature.mat)[1:45,]
target.mat <- neuroblastomaProcessed$target.mat[1:45,]
new.mat <- data.frame(neuroblastomaProcessed$feature.mat)[46:90,]
  
param_grid <- NULL
param_grid$max_depth <- c(Inf, 2)
param_grid$margin <- c(2, 3)
param_grid$min_sample <- c(10, 20)
param_grid$loss <- c("hinge", "square")
if(require(future)){ plan(multiprocess)}
set.seed(1)
fit <- mmit.cv(feature.mat, target.mat, param_grid, scorer = mse, future.seed = TRUE)
pred <- predict(fit, new.mat)

Output:

## Loading required namespace: future.apply
## [1] "result: "
##    max_depth margin min_sample   loss    alpha
## 13       Inf      2         20 square 5.977923
## [1] "best estimator: "

## [1] "prediction: "
## Warning in matrix(unlist(n), nrow = length(n), byrow = T): data length [35]
## is not a sub-multiple or multiple of the number of rows [45]
##  [1] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
##  [8] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
## [15] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
## [22] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
## [29] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
## [36] 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224 2.641224
## [43] 2.641224 2.641224 2.641224

Random Forest

The random forest implemented in the mmit package uses the concept of random forest described by Brietman. The mmif function is used for creating the collection of trees by random sampling training examples with replacement and features without replacement. The predict.mmif() function is used to calculate the average prediction done by the random forest algorithm and mmif.cv() function is the implementaiton of k-fold cross validaiton for the random forest model.

mmif()

Learning a random forest of Max Margin Interval Tree and giving list of trees as output.

Usage:

mmif(feature.mat, target.mat, max_depth = Inf, margin = 0, loss = "hinge", min_sample = 1, n_trees = 10, n_features = ceiling(ncol(feature.mat)^0.5), future.seed = FALSE)

Example:

library(mmit)

  data(neuroblastomaProcessed, package="penaltyLearning")
  feature.mat <- data.frame(neuroblastomaProcessed$feature.mat)[1:45,]
  target.mat <- neuroblastomaProcessed$target.mat[1:45,]
  if(require(future)){ plan(multiprocess)}
  set.seed(1)
  trees <- mmif(feature.mat, target.mat, max_depth = Inf, margin = 2.0, loss = "hinge", min_sample = 1, future.seed = TRUE)

Output:

## Loading required package: future

Example:

Predictions with random forests of Max Margin Interval Trees.

library(mmit)

target.mat <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

feature.mat <- rbind(
  c(1,0,0), c(1,1,0), c(1,2,0),
  c(1,3,0), c(1,4,0), c(1,5,0))

colnames(feature.mat) <- c("a", "b", "c")
feature.mat <- data.frame(feature.mat)

forest <- mmif(feature.mat, target.mat)
pred <- predict(forest, feature.mat)

Output:

## [1] 0.75 1.15 1.35 1.95 1.95 1.95

mmif.cv()

Performing grid search to select the best hyperparameters of mmif via cross-validation.

Usage:

mmif.cv(feature.mat, target.mat, param_grid, n_folds = 3, scorer = NULL, future.seed = FALSE)

Example:

library(mmit)

data(neuroblastomaProcessed, package="penaltyLearning")
feature.mat <- data.frame(neuroblastomaProcessed$feature.mat)[1:45,]
target.mat <- neuroblastomaProcessed$target.mat[1:45,]
test.mat <- data.frame(neuroblastomaProcessed$feature.mat)[46:90,]
 
param_grid <- NULL
param_grid$max_depth <- c(4, 3)
param_grid$margin <- c(2, 3)
param_grid$min_sample <- c(5, 20)
param_grid$loss <- c("hinge", "square")
param_grid$n_trees <- c(10)
param_grid$n_features <- c(ceiling(ncol(feature.mat)**0.5))
if(require(future)){ plan(multiprocess)}
set.seed(1)
fit <- mmif.cv(feature.mat, target.mat, param_grid, scorer = mse, future.seed = TRUE)
pred <- predict(fit, test.mat)

Output:

## [1] "result: "
## [1] "best estimators: "

## [1] "predictions: "
##  [1] 1.93148366 2.48334882 0.28933430 2.38273411 2.38273411 1.29647660
##  [7] 0.76599574 2.48334882 2.18214452 0.28933430 0.28933430 0.28933430
## [13] 0.81748850 2.18214452 1.36554482 0.28933430 0.28933430 1.45769361
## [19] 1.02275715 0.28933430 2.07341982 1.14140504 1.11895240 1.43189255
## [25] 2.28275923 0.28933430 1.38209917 1.61074822 0.73953441 2.44908185
## [31] 0.28933430 1.79829214 1.14223319 0.28933430 2.48334882 1.23579071
## [37] 0.28933430 0.01184083 0.01184083 0.73953441 0.28933430 0.73953441
## [43] 2.24849226 0.28933430 1.13053767

Other Functions

mse()

Calculation of the mean square error for intervals.

Usage:

mse(y_true, y_pred)

Example:

library(mmit)
y_true <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

y_pred <- c(0.5, 2, 0, 1.5, 3.5, 2.5)

out <- mse(y_true, y_pred)

Output:

## [1] 0.25

zero_one_loss()

Calculation of the zero-one loss for interval, i.e., zero error if the prediction is inside the interval and one error if it is ouside.

Usage:

zero_one_loss(y_true, y_pred)

Example:

library(mmit)
y_true <- rbind(
  c(0,1), c(0,1), c(0,1),
  c(2,3), c(2,3), c(2,3))

y_pred <- c(0.5, 2, 0, 1.5, 3.5, 2.5)

out <- zero_one_loss(y_true, y_pred)

Output:

## [1] 0.5