Notes on References, External Objects, or Mutable State for R

Luke Tierney
School of Statistics
University of Minnesota

Introduction

In principle R uses pass-by-value semantics in its function calls. If two different variable names x and y start out with equal values, there is (almost) no way for R-level operations on the value of x to change the value of y.

Pass-by-value is usually used by functional programming languages. It eliminates the possibility of side effects on values and thus eliminates a whole class of often very difficult to track bugs. S and R are however a bit unusual among functional languages in relying very heavily on assignment to variables. In both S and R variables represent storage locations for their values, not the values themselves. This reduces some of the benefits of the functional model, especially when combined with the overloading of assignment, which is used both for creating new variable bindings and for changing existing ones.

Pure functional semantics can be very helpful for reasoning about programs, but many ideas are more easily expressed using concepts of mutable state. This is particularly true when a computation must deal with an external object like a piece of memory to be operated on by a library of external functions, a window on a work station, or a data base, for example.

External objects also often have life time management issues associated with them. Connections to a data base, for example, should be shut down when they are no longer needed. The user code may be primarily responsible for managing life times, but it may be useful to provide backup or to entirely transfer the responsibility of lifetime management to the basic memory management system.

All of these points suggest that it may be worth considering bringing pass-by-reference semantics into R in a limited and carefully controlled way. The purpose of this note is to start a discussion of how this might be done.

Current Status

Data Types Passed By Reference

Pass-by-value semantics are not entirely universal in R. There appear to be four exceptions: Environments, symbols, specials and builtins. All four are mutable because their attributes can be modified. In addition, the value cell of a symbol is mutable but only through the assignment mechanism (as far as I know).

The interaction of attributes with environments was recently discussed on some of the mailing lists. For symbols, here are some examples:

<symbol examples>= [D->]
> x<-as.name("z")
> y<-x
> attr(x,"fred")<-"bob"
> x
z
attr(,"fred")
[1] "bob"
> y
z
attr(,"fred")
[1] "bob"

An additional peculiarity, perhaps a bug in the as.vector code that implements as.name:

<symbol examples>+= [<-D]
> x<-as.name("z")
> x
z

The attributes of existing symbols are stripped by as.name. For consistency, it might be a good idea to make sure the attributes are preserved or that assignment of attributes to symbols be made illegal as it is for R_NilValue.

Similar things happen for specials (e.g. .Internal) and builtins (e.g. proc.time).

These four exceptions to pass by value semantics could be eliminated by ruling out assigning attributes in these cases. This would probably not be an issue for symbols, specials, or builtins, but name attributes on environments are currently being used for naming environments on the global search path.

The Alias Mechanism

There is also the .Alias mechanism. For now that may already be saying too much :-).

Environments

Environments can be used to represent mutable state, as Robert and Ross show in their lexical scoping paper. They are however not always the ideal choice, especially for representing external state.

Examples

This section gives a few examples of settings where some form of reference behavior might be useful. In some cases mechanisms that might provide the required functionality are also described.

File Pointers and Data Base Connections

Several proposals for incorporating external connections into R are currently being considered. All that I have seen assume that connections will be closed explicitly once they are no longer needed. This is probably the right approach for most purposes. But it is error prone, and it would be nice to have some backup for cases where a user forgets to close a connection, or a connection is left open after an error by code that doesn't adequately protect open connections with appropriate on.exit code. Special purpose mechanisms can be added for handling each different kind of connection, but it would probably be better to have a general mechanism that can be re-used in many situations.

The usual approach for managing external resources in garbage-collected systems is to associate a unique SEXPREC with each resource, keep a list of these in the collector, and to check at the end of each collection whether any are no longer reachable from the reference graph. Resources that are no longer reachable can then be reclaimed by executing the appropriate code. This mechanism can be abstracted out into what is usually called a finalization mechanism. This allows an arbitrary piece of code to be associated with an allocated object; that code will be executed when the object is no longer reachable. The code can be either compiled C level code or interpreted code; with interpreted code an issue that arises is whether the garbage object should be ``resurrected'' to allow the finalization code to use its state.

A related issue usually comes under the heading of weak references. Suppose we want to be able to produce a list of all open files, but at the same time be able to reclaim file pointers that are no longer reachable from program variables. We could keep a list of all open files, but doing so with ordinary references will prevent open files from ever being garbage-collected. The alternative is to wrap the SEXPREC representing the open file in a special container that does not protect its contents from garbage collection. Once the contents is no longer reachable through ordinary references, it is garbage collected and replaced by NULL. The container can either be designed to allow us to query whether it is NULL or not, or to return its contents, which will be either the original value or NULL. Containers like this are called weak references or weak boxes.

COM Objects

A COM client interface for R might eventually allow us to write code something like this to extract the rectangular region containing a specified cell from a specified sheet in an Excel workbook object. (This is essentially a direct transliteration of the xlispstat COM interface).

<COM client example 1>=
getRegion <- function(workbook, sheet = 1, row = 0, col = 0) {
  sheet <- getProperty(workbook, "Worksheets", sheet)
  range <- getProperty(getProperty(sheet, "Cells", row, col), "CurrentRegion")
  return(getProperty(range, "Value"))
}

By taking advantage of dispatching on the $ operator, it could be possible to allow code to be written that looks almost exactly the same as its VB counterpart (for better or worse):

<COM client example 2>=
getRegion <- function(workbook, sheet = 1, row = 0, col = 0) {
  return(workbook$Worksheets(sheet)$Cells(row, col)$CurrentRegion$Value)
}

In both of these examples a series of COM calls is used, each one returning a new COM object reference, before we finally get the value we are interested in. The other intermediate objects are of no interest. However, as these objects are received from COM, their reference counts are incremented. To insure they are released, we need to be able to decrement their reference counts once we no longer need them. Otherwise a memory leak results. As near as I can tell, the Splus/COM interface leaves management of reference counts to the programmer, so code like this will not work. It would have to be surrounded with reference count releases protected with on.exit code to work correctly. VB does not require this; its memory management system takes care of the reference count management.

To do the same in R, we need a unique SEXPREC to represent each COM object. Once this becomes unreachable a GC can decrement the COM reference count before completing the realease. Again this can be handled in the context of a general finalization mechanism.

Storage for Foreign Procedure Calls

There are many libraries that are based around collections of functions where one function is called for initializing a data structure then several other functions are called to perform operations on the structure. At the moment, R interfaces have to choose between creating one C functions that does the initialization and manipulation in a single .C call and some jury-rigged mechanism for keeping memory at the C level alive across several .C calls (there may be other approaches I am not aware of). Neither approach is very satisfactory. It would be prefereable to be able to create a foreign data object with one external call, manage it as the value of an R variable, and pass it on to other exernal calls as needed.

An example of this situation might be a lower-level interface to POSIX regular expressions. The C interface consists of three functions: regcomp for creating a compiled regular espression, regexp for using the compiled form, and regfree for releasing it. With a finalization mechanism the value returned by an interface to regcomp could be marked to be released with regfree when it is no longer needed.

Storage for Exported Event Handlers

Both Peter's Tcl/TK interface and Duncan's CORBA and Java interfaces need at some point to allow external systems to trigger callbacks into R. Both need to give the external system some way to identify the R code to run. This requires some method of identifying at least an R closure and some level of life-time management integration---the closure must not be garbage collected as long as it might be called from the external system.

Exposing R Objects as Foreign Event Handlers

Taking one step further, both Java and CORBA think in terms of mutable objects and doing things like asking a vector to change its third element from a 7 to a 6. This is not the natural R way to think about things, so some bridging is needed. The most convenient approach from the CORBA/Java point of view is to have a mechanism for actually mutating a particular R object.

R Objects with Mutable State

There are also some situations where having mutable state within R alone can be useful. In principle all could be addressed using environments, but this is usually not the most effective approach. The .Alias mechanism might also be an option, but I am a little uncomfortable with its semantics at this point (maybe just because I don't understand it yet).

Direct Array Manipulation

There may be times when it is useful to be able to directly manupulate the contents of an R array. The heuristics used to insure pass-by-value semantics may in some cases result in very expensive copying; it might be useful to be able to guarantee that this will not happen.

Random Number Generation

Robert and Ross' paper on scoping gives examples of using environments to represent the mutable state of random number generators. To allow an efficient implementation of flexible generator classes would rewquire the use of a more efficient state representation, among other things. Mutable R data structures would be useful for this.

Window Representation

This isn't how the Tcl/Tk interface works now but something along these lines might be desirable in the future.

Suppose you create three windows. The third window is named "Fred". You create two variables,

<window interface examples>= [D->]
w3<-Windows(3)
wf<-Windows("Fred")

Both variables represent the same window on the screen. You would thus expect the following interaction:

<window interface examples>+= [<-D]
> w3$backcolor
"white"
> w3$backcolor<-"green"
> w3$backcolor
"green"
> wf$backcolor
"green"

Changing w3's background color also changes wf's since both are the same physical window that happens to be known by different names.

Implementing this behavior is not too hard: the state can be stored in the external window management system. But suppose we want to be able to disconnect from the external system and preserve the state of our connection so we can re-establish it later. Suppose we also want to allow limited changes to inactive windows, such as changing their background color. We would like to have the same behavior as before: changing the background color in the R representation of w3 should change the color in the representation for wf since both represent the same physical window when the connection is re-established.

Possible Approaches

A Pointer Data Type

A simple step in this direction would be either to allow a representation of external pointers that can be passed around for use by C code, or a bit more generally allow a native data type that can consist of an allocation of memory of arbitrary size. A mechanism for tagging the memory with type information would help to insure some level of type safety. With such a mechanism the C side of an R interface to regcomp might look something like

<body of simple regcomp interface>=
PROTECT(val = R_NewNativeMem(sizeof(regexp_t), regexp_tag));
R_RegisterCFinalizer(val, regfree);
status = regcomp((regex_t *) R_NativeMemAddr(val), ...)

The interface for regexec could then type check the argument and then proceed to use it with something like

<body of simple regexec interface>=
R_CheckNativeMemType(val, regexp_tag)
regexec((regex_t *) R_NativeMemAddr(val), ...)

There are two choices on handling native data like this at the R level. One would be to make this new object another pass-by-reference object, like environments. This would be easy enough, but could lead to some confusions if users attempt to attach attributes to the object. A class attribute would be particularly useful in many situations. Programmers always have the option of wrapping a pass-by-reference object in a list and attaching attributes to the list wrapper, but perhaps we should do that for them right from the start and have the new object be in two parts,

 -----------      ---------------
| attr |  --|--->| native object |
 -----------      ---------------

The header part would obey standard pass-by-value semantics but duplicate would not copy the native object itself. Finalization would be based on the native object.

A Reference Wrapper

The idea of having a wrapper object that stops copying could in principle be extended to allow arbitrary R objects to be passed by reference. This could be sufficient for making R objects available to external programs. Whether it would be possible to allow arbitrary R code to operate usefully in the contents of these wrappers and have the effect you intend is not obvious.

A Pass-By-Reference Bit

An alternative to a wrapper based approach for allowing R objects to be passed by reference would be to allow a pass-by-reference bit to be set in any object. This would insure that the object is not copied. A mechanism for unsetting the bit should be provided (but not for setting it--it must only be set on a clean copy) should also be provided. In principle this would allow a new object to be created in writable state, initialized by modifying assignments, and then locked. Conceptually this might be the cleanest approach, but I suspect it might be the hardest to implement since the current copying semantics are assumed in all sorts of places in the internal C code.

Issues

Introducing any kind of sharing will potentially complicate save/load quite a bit. For native data objects, as a first pass, marking native stuff as invalid on restore should do. In the longer term some mechanism for specifying pickle/unpickle operations would probably be needed. If mutability of R objects is allowed, then circular structures through more than environments become possible and something like the old-style save/restore code would be needed.
If copying of R objects is no longer done (conceptually) by default, we will need a mechanism for explicitly requesting a copy when we need one.

<body of simple regcomp interface>: D1
<body of simple regexec interface>: D1
<COM client example 1>: D1
<COM client example 2>: D1
<symbol examples>: D1, D2
<window interface examples>: D1, D2