dnorm
defined as
<compilation sample function>= f<-function(x, mu=0,sigma=1) { (1/sqrt(2 * pi)) * exp(-0.5 * ((x - mu)/sigma)^2) / sigma }
This function is simplified in the following ways:
log
argument.
is.numeric
.
as.real
to
the arguments would.
sigma
is positive.
The timing tests below all use as their argument for f
and its
variants the vector x
defined by
<test data>= x<-seq(0,3,len=5)
Timings are obtained by taking three runs and reporting the one with the median user CPU time (first value).
DispatchGroup
to cheaply reject
dispatching for many standard cases:
<pretest in DispatchGroup
>=
/* pre-test to avoid string computations when there is nothing to
dispatch on */
if (args != R_NilValue && ! isObject(CAR(args)) &&
(CDR(args) == R_NilValue || ! isObject(CADR(args))))
return 0;
I believe this does not represent a semantic change, just a change in where dispatching is aborted. This should be double checked; if I got it wrong something similar will work.
Before adding this test the timings for the simple example were
<before pretest in dispatch>= > system.time(for (i in 1:100000) f(x)) [1] 18.82 0.16 18.98 0.00 0.00 > system.time(for (i in 1:100000) dnorm(x)) [1] 3.91 0.21 4.12 0.00 0.00
After adding the test they become
<after pretest in dispatch>= > system.time(for (i in 1:100000) f(x)) [1] 13.86 0.17 14.03 0.00 0.00 > system.time(for (i in 1:100000) dnorm(x)) [1] 3.92 0.16 4.12 0.00 0.00
This function is particularly numerically intensive; others are less likely to see such significant improvements.
Because of the availability of sys.parent
, assign
and rm
and friends, along with the convention that loaded libraries shadow
the base library, it is virtually impossible to determine the meaning
of any variable in any piece of R code from looking at the code alone.
It may however be possible and useful to add some mechanism, such as
name space management and the ability to seal certain environments and
for a compiler to then use this information to determine binding
locations. Let's suppose that this is the case.
To simulate the speed improvement that is possible by reducing lookup costs, here is a version of the sample function that binds all the globals it uses into an environment that is the immediate parent of its execution environment. This means that variable lookup will find all free variables in two frame lookups instead of having to look all the way down to the base environment.
<variable lookup simplification>= f1<-local({ exp <- exp mul <- get("*") sub <- get("-") expt <- get("^") uminus <- get("-") pi <- pi div <- get("/") sqrt <- sqrt function(x,mu=0,sigma=1) { mul(div(1, sqrt(mul(2, pi))), div(exp(mul(uminus(0.5), expt(div(sub(x,mu),sigma),2))),sigma)) }})
The timing comparison of this function to the original f
and
dnorm
is
<timings with local versions of globals>= > system.time(for (i in 1:100000) f(x)) [1] 14.07 0.19 14.26 0.00 0.00 > system.time(for (i in 1:100000) f1(x)) [1] 12.68 0.12 12.90 0.00 0.00 > system.time(for (i in 1:100000) dnorm(x)) [1] 3.90 0.17 4.44 0.00 0.00
All timings so far have been done with no libraries loaded. The current lookup mechanism searches all loaded libraries before it searches the base library. Each loaded library search is constant time because of hashing (at least in theory) but each library adds a hash table search. To illustrate this we can load all the libraries in the standard distribution
<load all standard libraries>= library("ctest") library("lqs") library("mva") library("splines") library("tcltk") library("eda") library("modreg") library("nls") library("stepfun") library("ts")
<timings after loading libraries>= > system.time(for (i in 1:100000) f(x)) [1] 21.12 0.27 22.11 0.00 0.00 > system.time(for (i in 1:100000) f1(x)) [1] 13.30 0.09 13.40 0.00 0.00 > system.time(for (i in 1:100000) dnorm(x)) [1] 4.85 0.16 5.01 0.00 0.00
There is a small increase in the timings for f1
, possibly due to
changes in heap occupancy, a larger change in the dnorm
example,
which involves one base lookup per iteration, and a huge increase in
f
, which involves nine base lookups per iteration if I counted
right.
This may be a case where the interpreter can be improved. On option would be to maintain a cache of bindings values in the global environment and flush the cache when it becomes invalid; another option would be to use a shallow binding strategy for global values. Both are a bit tricky to implement.
exp
at their values when f1
was created. These are really two
separate issues. If in addition to providing a mechanism for
specifying that exp
, say, really refers to the exp
in the base
package we also are willing to provide a mechanism for saying that the
value of that symbol must be the builtin exp
primitive, then we
can use that information to directly call the primitive function
without going through any bindings.
To simulate this, we can take the body of the expression used to
create f1
and substitute for its globals the values they have in
the base package:
<sample function with builtins replaced by values>= vals <- list(exp=exp, mul=get("*"), sub=get("-"), expt=get("^"), uminus=get("-"), pi=pi, div=get("/"), sqrt=sqrt) f2 <- f body(f2) <- substitute(mul(div(1, sqrt(mul(2, pi))), div(exp(mul(uminus(0.5), expt(div(sub(x,mu),sigma),2))),sigma)), vals)
Returning to a session with no libraries loaded, the timings are
<timings with builtins replaced>= > system.time(for (i in 1:100000) f(x)) [1] 14.26 0.22 15.01 0.00 0.00 > system.time(for (i in 1:100000) f1(x)) [1] 12.72 0.12 12.85 0.00 0.00 > system.time(for (i in 1:100000) f2(x)) [1] 11.75 0.20 11.98 0.00 0.00
<sample function with builtins and constant folding>= f3 <- f body(f3) <- substitute(mul(c1, div(exp(mul(c2, expt(div(sub(x,mu),sigma),2))),sigma)), c(c1=1 / sqrt(2 * pi), c2=-0.5, vals))
which produces quite a significant savings in speed.
<timings with constant folding>= > system.time(for (i in 1:100000) f3(x)) [1] 8.94 0.15 9.09 0.00 0.00
Because of issues in numerical precision constant folding there are limits to the amount of constant folding a compiler can do. But a compiler may be able to identify places where a rewrite might allow constant folding to occur and might (optionally) optionally issue a message pointing this out.
eval.c
to use it. All this is
only intended to check out concepts and possibilities at this point.
The opcodes of the interpreter and a little function for making up byte code arrays are given by
<byte code creation>= idx <-0 RETURN.OP <- idx; idx <- idx + 1 INIT.OP <- idx; idx <- idx + 1 ADD.OP <- idx; idx <- idx + 1 SUB.OP <- idx; idx <- idx + 1 MUL.OP <- idx; idx <- idx + 1 DIV.OP <- idx; idx <- idx + 1 EXPT.OP <- idx; idx <- idx + 1 SQRT.OP <- idx; idx <- idx + 1 EXP.OP <- idx; idx <- idx + 1 bc<-function(...) as.integer(c(...))
A byte coded version of the constant-folded version of the sample function is constructed by
<sample function in byte code>= f4 <- local({ forms <- formals(function(x, mu=0,sigma=1) {}) code <- bc(INIT.OP, 7, 3, SUB.OP, 0, 1, 7, DIV.OP, 7, 2, 7, EXPT.OP, 7, 3, 7, MUL.OP, 4, 7, 7, EXP.OP, 7, 7, MUL.OP, 3, 7, 7, RETURN.OP, 7) consts <- list(1 / sqrt(2 * pi), -0.5, 2) as.function(c(forms,list(list(as.name(".Code"),code,consts))), NULL)})
Timings for the byte coded version are:
<timings for byte code>= > system.time(for (i in 1:100000) f3(x)) [1] 8.84 0.15 8.99 0.00 0.00 > system.time(for (i in 1:100000) f4(x)) [1] 7.17 0.18 7.35 0.00 0.00 > system.time(for (i in 1:100000) dnorm(x)) [1] 4.06 0.23 4.29 0.00 0.00
The INIT
instruction here forces the arguments. In this case the
function is strict in all its arguments and would force them in the
first expressions, so this isn't doing anything wrong (I think it even
preserves order of evaluation). But in general the issue of where to
deal with argument forcing is a tricky one in compiling to byte code.
f] differs from
[[dnorm
is in the range of arguments for which it produces answers
without raising an error and in how they handle objects with
attributes. The dnorm
function requires its arguments to me
numeric and coerces them as if applying as.double.default
to them
(if I read the code right). The function f
on the other hand will
use generic arithmetic operations and only signal errors when one of
those does. Signaling an error for non-numeric data might be
preferable. In addition, insisting on simple numeric data might
enable other optimizations.
One example of such an optimization is strength reduction: replacing
x^2
by x * x
. We can easily do this in the byte code version,
<byte code example with strength reduction>= f5 <- local({ forms <- formals(function(x, mu=0,sigma=1) {}) code <- as.integer(c(INIT.OP, 7, 3, SUB.OP, 0, 1, 7, DIV.OP, 7, 2, 7, MUL.OP, 7, 7, 7, # strength reduction here MUL.OP, 4, 7, 7, EXP.OP, 7, 7, MUL.OP, 3, 7, 7, RETURN.OP, 7)) consts <- list(1 / sqrt(2 * pi), -0.5, 2) as.function(c(forms,list(list(as.name(".Code"),code,consts))), NULL)})
and it pays off in the timings:
<timings for strength reduction>= > system.time(for (i in 1:100000) f5(x)) [1] 6.18 0.18 6.37 0.00 0.00
This is another case where it might pay to do a test in the power code to handle squaring separately, though the amount of benefit may not be large when compared to a run-time test. Compile time detection on the other hand is free at run time and so is definitely worth while if it is legitimate. If the arithmetic could involve dispatch on a power method defined by a user then there is no reason this strength reduction would be legitimate.
If the function f
includes information indicating that only simple
numerical arguments are acceptable, then further optimizations may be
possible. For example, it may be possible to use lower level numerical
primitives that avoid allocating as much space for intermediate
results. Again a compiler could provide a notification if it sees a
setting where additional information might allow it to do further
optimization.
I used a register/frame form of byte code rather than a stack based code. A stack-based code is probably worth trying since that seems to be what most people use, but it always seems less efficient to me.
Tricks for crating threaded code may be worth trying, but probably won't pay off unless we can compile down to non-generic arithmetic.
DispatchGroup
>: D1