In principle R uses pass-by-value semantics in its function calls. If
two different variable names x
and y
start out with equal
values, there is (almost) no way for R-level operations on the value
of x
to change the value of y
.
Pass-by-value is usually used by functional programming languages. It eliminates the possibility of side effects on values and thus eliminates a whole class of often very difficult to track bugs. S and R are however a bit unusual among functional languages in relying very heavily on assignment to variables. In both S and R variables represent storage locations for their values, not the values themselves. This reduces some of the benefits of the functional model, especially when combined with the overloading of assignment, which is used both for creating new variable bindings and for changing existing ones.
Pure functional semantics can be very helpful for reasoning about programs, but many ideas are more easily expressed using concepts of mutable state. This is particularly true when a computation must deal with an external object like a piece of memory to be operated on by a library of external functions, a window on a work station, or a data base, for example.
External objects also often have life time management issues associated with them. Connections to a data base, for example, should be shut down when they are no longer needed. The user code may be primarily responsible for managing life times, but it may be useful to provide backup or to entirely transfer the responsibility of lifetime management to the basic memory management system.
All of these points suggest that it may be worth considering bringing pass-by-reference semantics into R in a limited and carefully controlled way. The purpose of this note is to start a discussion of how this might be done.
The interaction of attributes with environments was recently discussed on some of the mailing lists. For symbols, here are some examples:
<symbol examples>= [D->] > x<-as.name("z") > y<-x > attr(x,"fred")<-"bob" > x z attr(,"fred") [1] "bob" > y z attr(,"fred") [1] "bob"
An additional peculiarity, perhaps a bug in the as.vector
code
that implements as.name
:
<symbol examples>+= [<-D] > x<-as.name("z") > x z
The attributes of existing symbols are stripped by as.name
. For
consistency, it might be a good idea to make sure the attributes are
preserved or that assignment of attributes to symbols be made illegal
as it is for R_NilValue
.
Similar things happen for specials (e.g. .Internal
) and builtins
(e.g. proc.time
).
These four exceptions to pass by value semantics could be eliminated by ruling out assigning attributes in these cases. This would probably not be an issue for symbols, specials, or builtins, but name attributes on environments are currently being used for naming environments on the global search path.
.Alias
mechanism. For now that may already be
saying too much :-).
on.exit
code. Special purpose
mechanisms can be added for handling each different kind of
connection, but it would probably be better to have a general
mechanism that can be re-used in many situations.
The usual approach for managing external resources in
garbage-collected systems is to associate a unique SEXPREC
with
each resource, keep a list of these in the collector, and to check at
the end of each collection whether any are no longer reachable from
the reference graph. Resources that are no longer reachable can then
be reclaimed by executing the appropriate code. This mechanism can be
abstracted out into what is usually called a finalization mechanism.
This allows an arbitrary piece of code to be associated with an
allocated object; that code will be executed when the object is no
longer reachable. The code can be either compiled C level code or
interpreted code; with interpreted code an issue that arises is
whether the garbage object should be ``resurrected'' to allow the
finalization code to use its state.
A related issue usually comes under the heading of weak
references. Suppose we want to be able to produce a list of all open
files, but at the same time be able to reclaim file pointers that are
no longer reachable from program variables. We could keep a list of
all open files, but doing so with ordinary references will prevent open
files from ever being garbage-collected. The alternative is to wrap
the SEXPREC
representing the open file in a special container that
does not protect its contents from garbage collection. Once the
contents is no longer reachable through ordinary references, it is
garbage collected and replaced by NULL
. The container can either
be designed to allow us to query whether it is NULL
or not, or to
return its contents, which will be either the original value or
NULL
. Containers like this are called weak references or weak
boxes.
xlispstat
COM
interface).
<COM client example 1>= getRegion <- function(workbook, sheet = 1, row = 0, col = 0) { sheet <- getProperty(workbook, "Worksheets", sheet) range <- getProperty(getProperty(sheet, "Cells", row, col), "CurrentRegion") return(getProperty(range, "Value")) }
By taking advantage of dispatching on the $
operator, it could be
possible to allow code to be written that looks almost exactly the
same as its VB counterpart (for better or worse):
<COM client example 2>= getRegion <- function(workbook, sheet = 1, row = 0, col = 0) { return(workbook$Worksheets(sheet)$Cells(row, col)$CurrentRegion$Value) }
In both of these examples a series of COM calls is used, each one
returning a new COM object reference, before we finally get the value
we are interested in. The other intermediate objects are of no
interest. However, as these objects are received from COM, their
reference counts are incremented. To insure they are released, we
need to be able to decrement their reference counts once we no longer
need them. Otherwise a memory leak results. As near as I can tell,
the Splus/COM interface leaves management of reference counts to the
programmer, so code like this will not work. It would have to be
surrounded with reference count releases protected with on.exit
code to work correctly. VB does not require this; its memory
management system takes care of the reference count management.
To do the same in R, we need a unique SEXPREC
to represent each
COM object. Once this becomes unreachable a GC can decrement the COM
reference count before completing the realease. Again this can be
handled in the context of a general finalization mechanism.
C
functions that does the
initialization and manipulation in a single .C
call and some
jury-rigged mechanism for keeping memory at the C
level alive
across several .C
calls (there may be other approaches I am not
aware of). Neither approach is very satisfactory. It would be
prefereable to be able to create a foreign data object with one
external call, manage it as the value of an R variable, and pass it on
to other exernal calls as needed.
An example of this situation might be a lower-level interface to POSIX
regular expressions. The C interface consists of three functions:
regcomp
for creating a compiled regular espression, regexp
for
using the compiled form, and regfree
for releasing it. With a
finalization mechanism the value returned by an interface to
regcomp
could be marked to be released with regfree
when it is
no longer needed.
.Alias
mechanism might also be an option, but I am a little
uncomfortable with its semantics at this point (maybe just because I
don't understand it yet).
Suppose you create three windows. The third window is named "Fred"
.
You create two variables,
<window interface examples>= [D->] w3<-Windows(3) wf<-Windows("Fred")
Both variables represent the same window on the screen. You would thus expect the following interaction:
<window interface examples>+= [<-D] > w3$backcolor "white" > w3$backcolor<-"green" > w3$backcolor "green" > wf$backcolor "green"
Changing w3
's background color also changes wf
's since both
are the same physical window that happens to be known by different
names.
Implementing this behavior is not too hard: the state can be stored in
the external window management system. But suppose we want to be able
to disconnect from the external system and preserve the state of our
connection so we can re-establish it later. Suppose we also want to
allow limited changes to inactive windows, such as changing their
background color. We would like to have the same behavior as before:
changing the background color in the R representation of w3
should
change the color in the representation for wf
since both represent
the same physical window when the connection is re-established.
regcomp
might look something like
<body of simple regcomp
interface>=
PROTECT(val = R_NewNativeMem(sizeof(regexp_t), regexp_tag));
R_RegisterCFinalizer(val, regfree);
status = regcomp((regex_t *) R_NativeMemAddr(val), ...)
The interface for regexec
could then type check the argument and
then proceed to use it with something like
<body of simple regexec
interface>=
R_CheckNativeMemType(val, regexp_tag)
regexec((regex_t *) R_NativeMemAddr(val), ...)
There are two choices on handling native data like this at the R level. One would be to make this new object another pass-by-reference object, like environments. This would be easy enough, but could lead to some confusions if users attempt to attach attributes to the object. A class attribute would be particularly useful in many situations. Programmers always have the option of wrapping a pass-by-reference object in a list and attaching attributes to the list wrapper, but perhaps we should do that for them right from the start and have the new object be in two parts,
----------- --------------- | attr | --|--->| native object | ----------- ---------------The header part would obey standard pass-by-value semantics but
duplicate
would not copy the native object itself. Finalization
would be based on the native object.
save
/load
quite a bit. For native data objects, as a first
pass, marking native stuff as invalid on restore
should do. In
the longer term some mechanism for specifying pickle/unpickle
operations would probably be needed. If mutability of R objects is
allowed, then circular structures through more than environments become
possible and something like the old-style save
/restore
code
would be needed.
regcomp
interface>: D1
regexec
interface>: D1