Notes on Compilation in R

Luke Tierney
School of Statistics
University of Minnesota

Introduction

These are some very sketchy notes about trying to get a handle on how much compilation can be hoped to do in R. I've taken one example and manually gone through a few of the optimizations that can probably be done with with the aid of things like name spaces and some notions of sealing some environments and some bindings. I think it looks reasonably promising, but more examples are needed, along with more understanding of where the interpreter can be tuned.

A Simple Example

The examples below use one simple example, a simplified version of dnorm defined as

<compilation sample function>=
f<-function(x, mu=0,sigma=1) {
  (1/sqrt(2 * pi)) * exp(-0.5 * ((x - mu)/sigma)^2) / sigma
}

This function is simplified in the following ways:

The timing tests below all use as their argument for f and its variants the vector x defined by

<test data>=
x<-seq(0,3,len=5)

Timings are obtained by taking three runs and reporting the one with the median user CPU time (first value).

Interpreter Optimizations

There are a few places where it may be useful to tweak the interpreter for performance improvement. One such place is in the test used to determine whether a group method dispatch is needed for arithmetic operations. The test starts off with some potentially costly string comparisons. Only then are simpler comparisons tried that would usually abort to a default method for simple data. The following code can be placed at the beginning of DispatchGroup to cheaply reject dispatching for many standard cases:

<pretest in DispatchGroup>=
/* pre-test to avoid string computations when there is nothing to
   dispatch on */
if (args != R_NilValue && ! isObject(CAR(args)) &&
    (CDR(args) == R_NilValue || ! isObject(CADR(args))))
  return 0;

I believe this does not represent a semantic change, just a change in where dispatching is aborted. This should be double checked; if I got it wrong something similar will work.

Before adding this test the timings for the simple example were

<before pretest in dispatch>=
> system.time(for (i in 1:100000) f(x))
[1] 18.82  0.16 18.98  0.00  0.00
> system.time(for (i in 1:100000) dnorm(x))
[1] 3.91 0.21 4.12 0.00 0.00

After adding the test they become

<after pretest in dispatch>=
> system.time(for (i in 1:100000) f(x))
[1] 13.86  0.17 14.03  0.00  0.00
> system.time(for (i in 1:100000) dnorm(x))
[1] 3.92 0.16 4.12 0.00 0.00

This function is particularly numerically intensive; others are less likely to see such significant improvements.

Reducing Lookup Cost

Variable lookup can be quite costly. If the location of a variable in the environment can be determined at compile time, then the function can be written to look up the value directly in that location and environment searching can be avoided.

Because of the availability of sys.parent, assign and rm and friends, along with the convention that loaded libraries shadow the base library, it is virtually impossible to determine the meaning of any variable in any piece of R code from looking at the code alone. It may however be possible and useful to add some mechanism, such as name space management and the ability to seal certain environments and for a compiler to then use this information to determine binding locations. Let's suppose that this is the case.

To simulate the speed improvement that is possible by reducing lookup costs, here is a version of the sample function that binds all the globals it uses into an environment that is the immediate parent of its execution environment. This means that variable lookup will find all free variables in two frame lookups instead of having to look all the way down to the base environment.

<variable lookup simplification>=
f1<-local({
      exp <- exp
      mul <- get("*")
      sub <- get("-")
      expt <- get("^")
      uminus <- get("-")
      pi <- pi
      div <- get("/")
      sqrt <- sqrt
      function(x,mu=0,sigma=1) {
        mul(div(1, sqrt(mul(2, pi))),
            div(exp(mul(uminus(0.5), expt(div(sub(x,mu),sigma),2))),sigma))
          }})

The timing comparison of this function to the original f and dnorm is

<timings with local versions of globals>=
>  system.time(for (i in 1:100000) f(x))
[1] 14.07  0.19 14.26  0.00  0.00
>  system.time(for (i in 1:100000) f1(x))
[1] 12.68  0.12 12.90  0.00  0.00
> system.time(for (i in 1:100000) dnorm(x))
[1] 3.90 0.17 4.44 0.00 0.00

All timings so far have been done with no libraries loaded. The current lookup mechanism searches all loaded libraries before it searches the base library. Each loaded library search is constant time because of hashing (at least in theory) but each library adds a hash table search. To illustrate this we can load all the libraries in the standard distribution

<load all standard libraries>=
library("ctest")
library("lqs")  
library("mva")
library("splines")
library("tcltk")
library("eda")
library("modreg")
library("nls")
library("stepfun")
library("ts")

and repeat the timings:

<timings after loading libraries>=
>  system.time(for (i in 1:100000) f(x))
[1] 21.12  0.27 22.11  0.00  0.00
>  system.time(for (i in 1:100000) f1(x))
[1] 13.30  0.09 13.40  0.00  0.00
> system.time(for (i in 1:100000) dnorm(x))
[1] 4.85 0.16 5.01 0.00 0.00

There is a small increase in the timings for f1, possibly due to changes in heap occupancy, a larger change in the dnorm example, which involves one base lookup per iteration, and a huge increase in f, which involves nine base lookups per iteration if I counted right.

This may be a case where the interpreter can be improved. On option would be to maintain a cache of bindings values in the global environment and flush the cache when it becomes invalid; another option would be to use a shallow binding strategy for global values. Both are a bit tricky to implement.

Sealing Some Builtin Functions

The previous example actually does more than reduce lookup time: it also freezes the meaning of the values of the global symbols like exp at their values when f1 was created. These are really two separate issues. If in addition to providing a mechanism for specifying that exp, say, really refers to the exp in the base package we also are willing to provide a mechanism for saying that the value of that symbol must be the builtin exp primitive, then we can use that information to directly call the primitive function without going through any bindings.

To simulate this, we can take the body of the expression used to create f1 and substitute for its globals the values they have in the base package:

<sample function with builtins replaced by values>=
vals <- list(exp=exp, mul=get("*"), sub=get("-"), expt=get("^"),
             uminus=get("-"), pi=pi, div=get("/"), sqrt=sqrt)

f2 <- f
body(f2) <-
substitute(mul(div(1, sqrt(mul(2, pi))),
               div(exp(mul(uminus(0.5), expt(div(sub(x,mu),sigma),2))),sigma)),
           vals)

Returning to a session with no libraries loaded, the timings are

<timings with builtins replaced>=
> system.time(for (i in 1:100000) f(x))
[1] 14.26  0.22 15.01  0.00  0.00
> system.time(for (i in 1:100000) f1(x))
[1] 12.72  0.12 12.85  0.00  0.00
> system.time(for (i in 1:100000) f2(x))
[1] 11.75  0.20 11.98  0.00  0.00

Constant Folding

If we can identify some of the functions that are to be used in an expression then we can also carry out at compile time any calculation using these functions applied to arguments that are known at compile time. In the example, the normalizing constant is a constant expression. We can replace it by its value,

<sample function with builtins and constant folding>=
f3 <- f
body(f3) <-
substitute(mul(c1,
               div(exp(mul(c2, expt(div(sub(x,mu),sigma),2))),sigma)),
           c(c1=1 / sqrt(2 * pi), c2=-0.5, vals))

which produces quite a significant savings in speed.

<timings with constant folding>=
> system.time(for (i in 1:100000) f3(x))
[1] 8.94 0.15 9.09 0.00 0.00

Because of issues in numerical precision constant folding there are limits to the amount of constant folding a compiler can do. But a compiler may be able to identify places where a rewrite might allow constant folding to occur and might (optionally) optionally issue a message pointing this out.

Byte Code Compilation

The overhead of the evaluation process can be further reduced by moving to a byte code representation of the code. I put together a very simple interpreter for a tiny bit of expression language and made a small change to my version of eval.c to use it. All this is only intended to check out concepts and possibilities at this point.

The opcodes of the interpreter and a little function for making up byte code arrays are given by

<byte code creation>=
idx <-0
RETURN.OP <- idx; idx <- idx + 1
INIT.OP <- idx; idx <- idx + 1
ADD.OP <- idx; idx <- idx + 1
SUB.OP <- idx; idx <- idx + 1
MUL.OP <- idx; idx <- idx + 1
DIV.OP <- idx; idx <- idx + 1
EXPT.OP <- idx; idx <- idx + 1
SQRT.OP <- idx; idx <- idx + 1
EXP.OP <- idx; idx <- idx + 1

bc<-function(...) as.integer(c(...))

A byte coded version of the constant-folded version of the sample function is constructed by

<sample function in byte code>=
f4 <- local({
  forms <- formals(function(x, mu=0,sigma=1) {})
  code <- bc(INIT.OP, 7, 3,
             SUB.OP, 0, 1, 7,
             DIV.OP, 7, 2, 7,
             EXPT.OP, 7, 3, 7,
             MUL.OP, 4, 7, 7,
             EXP.OP, 7, 7,
             MUL.OP, 3, 7, 7,
             RETURN.OP, 7)
  consts <- list(1 / sqrt(2 * pi), -0.5, 2)
  as.function(c(forms,list(list(as.name(".Code"),code,consts))), NULL)})

Timings for the byte coded version are:

<timings for byte code>=
> system.time(for (i in 1:100000) f3(x))
[1] 8.84 0.15 8.99 0.00 0.00
> system.time(for (i in 1:100000) f4(x))
[1] 7.17 0.18 7.35 0.00 0.00
> system.time(for (i in 1:100000) dnorm(x))
[1] 4.06 0.23 4.29 0.00 0.00

The INIT instruction here forces the arguments. In this case the function is strict in all its arguments and would force them in the first expressions, so this isn't doing anything wrong (I think it even preserves order of evaluation). But in general the issue of where to deal with argument forcing is a tricky one in compiling to byte code.

Strength Reduction

One of the points where the sample function f] differs from [[dnorm is in the range of arguments for which it produces answers without raising an error and in how they handle objects with attributes. The dnorm function requires its arguments to me numeric and coerces them as if applying as.double.default to them (if I read the code right). The function f on the other hand will use generic arithmetic operations and only signal errors when one of those does. Signaling an error for non-numeric data might be preferable. In addition, insisting on simple numeric data might enable other optimizations.

One example of such an optimization is strength reduction: replacing x^2 by x * x. We can easily do this in the byte code version,

<byte code example with strength reduction>=
f5 <- local({
  forms <- formals(function(x, mu=0,sigma=1) {})
  code <- as.integer(c(INIT.OP, 7, 3,
                       SUB.OP, 0, 1, 7,
                       DIV.OP, 7, 2, 7,
                       MUL.OP, 7, 7, 7,   # strength reduction here
                       MUL.OP, 4, 7, 7,
                       EXP.OP, 7, 7,
                       MUL.OP, 3, 7, 7,
                       RETURN.OP, 7))
  consts <- list(1 / sqrt(2 * pi), -0.5, 2)
  as.function(c(forms,list(list(as.name(".Code"),code,consts))), NULL)})

and it pays off in the timings:

<timings for strength reduction>=
> system.time(for (i in 1:100000) f5(x))
[1] 6.18 0.18 6.37 0.00 0.00

This is another case where it might pay to do a test in the power code to handle squaring separately, though the amount of benefit may not be large when compared to a run-time test. Compile time detection on the other hand is free at run time and so is definitely worth while if it is legitimate. If the arithmetic could involve dispatch on a power method defined by a user then there is no reason this strength reduction would be legitimate.

If the function f includes information indicating that only simple numerical arguments are acceptable, then further optimizations may be possible. For example, it may be possible to use lower level numerical primitives that avoid allocating as much space for intermediate results. Again a compiler could provide a notification if it sees a setting where additional information might allow it to do further optimization.

Open Issues

The simple byte code used so far does not allow for flow control or variable assignments or calling arbitrary functions, or anything else that would be useful--it is very primitive.

I used a register/frame form of byte code rather than a stack based code. A stack-based code is probably worth trying since that seems to be what most people use, but it always seems less efficient to me.

Tricks for crating threaded code may be worth trying, but probably won't pay off unless we can compile down to non-generic arithmetic.