Notes on Name Spaces for R

Luke Tierney
School of Statistics
University of Minnesota

This note is a first start at outlining the issues and starting a discussion on how to add name spaces to R.

[Whenever it says below ``XYZ is true in R'' this should be read as ``it is my, possibly completely incorrect, impression that XYZ is true in R''.]

Introduction

Packages written in R define a set of global variable that are made available when the package is attached after loading. Sometimes some of these variables are really only intended for internal use by the package, but there is currently no convenient mechanism for keeping them out of the global name space. Functions defined by packages make use of many other globally defined variables. Usually the package author has a very clear idea of which package the variables should be found in, but there is no convenient way to insure that they will be found in the intended package rather than another package that happens to be loaded earlier in the search path. The objective of a name space mechanism for R is to provide a way of managing the set of global variables that a package uses (imports) and makes available (exports).

Put another way, a name space mechanism would provide a way of creating some static structure to the global variables used by a package, static structure that protects the package functions from the variation in the current dynamic global environment where loading packages and attaching frames can lead to unintended name conflicts. The base package is a case in point. Most functions that refer to a free variable exp, say, intend this to be the exp variable in the base package. Creating the function in a name space that uses the base package will insure that this will be the case.

Global variables need not all come from name spaces. If a function defined in a name space uses a global variable that is not found in the name space or any name spaces it uses, then the standard dynamic global environment is used.

Components and Features

A name space needs to support the following:

A way to refer to the name space. One option is to allow name spaces to be first class objects that are assigned to variables (in other name spaces). This would be the most powerful approach, but it would also be quite complex and probably be more trouble than it is worth.
Since name spaces are most useful for organizing packages, a simpler option would be to specify that each package has (at least optionally) an associated name space. Details of how to specify features of the name space could be merged with the current file structure of packages. I will assume this approach in the remainder of this note.
A way to specify the variables to be exported from the name space. This could be handled by the INDEX file in a package, or an additional EXPORTS file could be created.
A way to specify the packages used or imported by the package. In almost all cases this would include the base package, so this could be included by default; if so, it may be useful to provide a way to exclude the base package.
The current DESCRIPTION file could be augmented by allowing one or more Imports lines, or an IMPORTS file could be added.

Optional features might include:

The ability to selectively import a few variable from another package.
The ability to selectively import a few variables from another package but possible with another name. For example, we might want to import a function simpson from a package integrate but refer to this function in our package as integral.
The ability to export variables under a different name.

An Example

[This is only intended as a simple illustration of the issues, not a good way to do anything.]

Suppose we have a package mynorm with code in a file mynorm.R:

<mynorm/R/mynorm.R>=
    c1 <- 1/sqrt(2 * pi)
    lc1 <- log(c1)
    phi <- function(z)  c1 * exp(-0.5 * z^2)
    lphi <- function(z) lc1 - 0.5 * z^2

    my.dnorm<-function(x, mu = 0, sigma = 1, log = F) {
        z <- (x - mu) / sigma
        if (log) lphi(z) - log(sigma)
        else phi(z) / sigma
    }

    my.pnorm <- function(x, mu = 0, sigma = 1) {
        z <- (x - mu) / sigma
        integral(phi, -5, z);
    }

We would like only my.pnorm and my.dnorm to be public, so the EXPORTS file would be

<mynorm/EXPORTS>=
my.dnorm
my.pnorm

Within the package we can use top level definitions like those for c1, lc1, and phi and keep them private. The name space mechanism should make sure that they are not visible outside the name space and will not be obscured by other definitions of those symbols in the global name space, for example in loaded packages.

The IMPORTS file might look like

<mynorm/IMPORTS>=
base
integrate integral=simpson

The global variable integral is intended to refer to an integration rule, say the one implemented by a function simpson in a package integrate. Importing the base name space (which could be made the default) means we are specifying that the variables pi, exp, *, etc., refer to the variables defined in base, not any others that might exist in the global search path ahead of base.

Name Spaces and Environments

Environments are the natural basis for a name space system. Currently, the body of a function defined in the global environment as

<sample function>=
f <- function(x) x+1

and called as f(2) is evaluated in an environment that looks like this:

 ------------
|    x = 2   |
 ------------
| Global Env |
 ------------

If the function is defined in a package name space foo that imports base and bar, then the evaluation environment for its body could be made to look like this:

 ---------------
|     x = 2     |
 ---------------
| foo internals |
 ---------------
|  bar exports  |
 ---------------
|     base      |
 ---------------
|  Global Env   |
 ---------------

That is, instead of giving the function a null environment, representing the global environment as the place to search for free variables, it is given an environment consisting of the internal frame for its name space foo, followed by the exports frame for bar and a frame representing the base package (which currently has its values stored in the SYMVALUE cell), and then the global environment.

A variation would use a representation like

 ---------------
|     x = 2     |
 ---------------
|      foo -----|---->| foo internals |
 ---------------       ---------------
|  Global Env   |     |  bar exports  |
 ---------------       ---------------
                      |     base      |
                       ---------------

This way environments would only directly refer to a name space, not its imports structure. This would simplify save/load code.

This approach can be quite straight forward to implement or very difficult, depending on the level of mutability allowed for the name spaces. Implementation should be fairly simple if name spaces are made read-only once the .First.lib of their package has been run. Allowing reloading of packages would complicate matters but might be quite important.

Mutation of Name Space Environments

Currently all environments, whether global or local, allow variable bindings to be changed by assign, allow new bindings to be created, also by assign, and allow bindings to be removed by rm (here the base environment is exempted). Name spaces could have their mutability restricted by ruling out adding/removing of bindings or by ruling out any mutation at all. Restrictions could be applied to exported variable only or to both exported and internal variables.

Imposing restrictions has a number of benefits:

Preventing binding changes prevents inadvertent destruction of a global function.
Preventing the addition of new bindings and the removal of existing ones enables performance enhancements such as caching or pre-computation of binding locations.
Preventing all mutations means sharing can be done at the value level; this significantly simplifies implementation.

There are also costs. The most important cost of completely eliminating the possibility to change a binding's value is that it becomes very difficult to implement a sensible mechanism for re-loading a package after changes have been made during development or after it has been unloaded temporarily, perhaps to reduce memory usage.

Currently an environment frame looks like this (chains in a hashed frame are analogous):

 -----------------      -----------------      ------
| name | val |  --|--->| name | val |  --|--->|.....
 -----------------      -----------------      ------

If name spaces are entirely immutable, then sharing can be based on creating new frames with identical values and the same structure can be used for export frames. If name spaces do not allow addition or removal of bindings but do allow assignment to existing bindings, then sharing has to be at the binding level. A representation for export frames like

 ---------------      ---------------      -----
| name |   |  --|--->| name |   |  --|--->|.....
 ---------------      ---------------      -----
         |                    |
        \|/                  \|/
 ----------------------     ----------------------
| orig. name | val |   |   | orig. name | val |   |
 ---------------------      ----------------------

should be sufficient. Here the cells containing the actual binding and the original names would be the binding cells from the internal environment of the owning package name space.

It may still be desirable to mark export frames as immutable to prevent inadvertent assignments. Selective permission to assign could be made available to a reload function.

Allowing name space environment to be fully mutable would be considerably more complicated to implement and also raise some tricky semantic issues. For example, it is not clear what should happen in the example above, say, if integral was imported as referring to simpson in name space integrate but simpson was then removed.

To allow a package name space to be constructed during loading its internal name space frame must allow assignments that create bindings, at least until loading is complete. Restrictions could be imposed once .First.lib has been run. The export frame could be prepared in advance based on an EXPORTS file and locked against adding or removing bindings. A reload function would not be able to change the exports except perhaps if no other name space depends on them.

Implications for Package Loading

Currently loading a package also attaches it to the global search path. These two could be separated. Loading a package with name space foo that uses bar should cause bar to be loaded but need not necessarily make bar part of the global name space.

With explicit declaration of all exports in an EXPORTS file the loading of the corresponding package, or perhaps of parts of a package, could be done on demand as a kind of autoload.

Package loading could be split into two steps, load.library and attach.library. If library is called then

if the package is loaded and attached, do nothing
if the package is loaded but not attached, attach it
if the package is not loaded, then load and attach it.

A list of package name spaces could be maintained as a weak list; that way once a package is not attached and no longer used by any other packages it can be garbage collected. Whether this should be done automatically or only as an option is not clear.

A reloading mechanism should be provided for package development and possibly for supporting purging large packages.

Some packages, data packages in particular, might not need a name space; some way to specify this should be provided (e.g. no EXPORTS file).

Implications for Save/load

A function with an environment that references a name space might be written to a file with save. The save format would need to be augmented to include descriptions of the packages needed by saved functions. The loading of these packages could be deferred until needed or attempted at load time. One issue is how to properly handle the search path.

Compilation

Compilation would benefit from having name spaces where bindings cannot be added or removed. This would allow pre-computation of binding locations and eliminate the need for search. But to be able to take advantage of this we would also need a sensible way limit on programatic creation of potentially shadowing bindings in local environments.

Compilation would also benefit from the ability to declare some bindings as constant, though this would probably be most useful for bindings in the base package. Making pi a constant might allow constant folding, and making exp a constant might allow some inlining.

Dispatching and UseMethod

Ideally we would like to be able to have packages create private classes and methods for those classes or to create classes and methods that can be exported and used selectively by other packages without necessarily becoming part of the global name space. This would be possible with CLOS or S4 style classes and generic functions but not with the current approach.

Currently classes are specified by a class attribute consisting of strings. Method dispatch is based on concatenating generic function and class names and searching for a function by the resulting name. This search is currently done in .GlobalEnv (not quite true but true for all practical purposes). With name spaces along the lines outlined here, this means that all methods for classes have to be exported and their export frames attached for them to be found by dispatching. It would be possible to have method search occur in the caller's environment or in the non-local part of the caller's environment (i.e. starting with the name space frames). This would almost allow the definition of private classes that are used entirely within a name space. But it would not allow the appropriate methods to be found if the members of a private class had generic functions called on them from outside the name space.

This situation is less than ideal, but is fairly simple to describe: classes and their methods must be considered global. We can probably live with this.

Other Issues

I do not at present see a sensible way of allowing the base package to have a private component to its work space.
Performance may be an issue. One option is to use caching along the lines of the global variable cache i am currently experimenting with. Disallowing the addition and removal of bindings will make this possible and simpler than it is for the globals since there will be no need to flush the cache. But it may still raise some tricky issues for threading. In any case, performance will probably no longer be an issue if name spaces can be combined with byte code compilation, but that remains to be seen.
Cyclic dependencies among name spaces should probably not be permitted.
Would it be useful to have a syntax line foo::bar to access the exported variable bar in name space foo?
Would it be useful to have the ability to programatically access values of variables in a name space's internal frame?

<mynorm/EXPORTS>: D1
<mynorm/IMPORTS>: D1
<mynorm/R/mynorm.R>: D1
<sample function>: D1