Package-Wide Variables/Cache in R Packages

It’s often beneficial to have a variable shared between all the functions in an R package. One obvious example would be the maintenance of a package-wide cache for all of your functions. I’ve encountered this situation multiple times and always forget at least one important step in the process, so I thought I’d document it here for myself and anyone else who might encounter this issue. I setup a simple project on GitHub to demonstrate the various attempts you may take to solve the problem and, ultimately, a solution: https://github.com/trestletech/RCache. The rest of this article will presume some knowledge of authoring R packages. If that’s not you, check out RStudio’s guide to authoring R packages.

The fundamental problem comes down to R’s management of environments. (I’ll introduce the basics here, but for a thorough discussion on the topic, be sure to check out Hadley Wickham’s writings on the topic.) An environment is responsible for mapping data to named objects. When I execute

> a <- "hello"
> a
[1] "hello"

there was an environment responsible for initially associating the length-one character vector containing the text hello with the variable named a. Later, when I go to reference a variable by name, an environment is responsible for retrieving the data I had associated with this variable. Functions have their own environments in which they run, which is why you observe behavior like:

> x <- 1
> foo <- function(){ 
  print (x) 
  x <- 2 
  print(x) 
  x 
} 
> x 
[1] 1
> foo() 
[1] 1 
[1] 2 
[1] 2
> x
[1] 1

You can see that the variable named x, depending on which environment it’s in, can take one of two values in this example. Inside of the foo function, the x variable initially only existed in the parent environment, so it took on the value of x in that environment and was 1. Later, it was assigned a value of 2 and retained that value when it was returned. Outside of the function, however, x was associated with the value 1, and this binding was maintained despite another variable named x being created inside of a function’s environment.

This is a powerful feature of R, when properly understood. This allows you to create variables that only exist inside of a function or, more relevant to today’s discussion, only within a particular package. So it should be possible to leverage these environments to create a variable named cache which is accessible to all the functions in my package, but won’t accidentally likely be overwritten or modified by the user or, even more likely, collide with another variable named cache used in some other package.

Example Package – rev. a804

For the rest of the demonstration, we’ll use the example of creating an R package which merely downloads files from the Internet. Of course, it might be appropriate to cache data in such a package so that the same URL won’t have to be retrieved remotely multiple times. One simple approach would be to use a list to associated character strings (such as URLs) with some data (such as the content of the web page). We can first create a function which will download data without using any cache:

#' Download a file.
#'
#' @importFrom httr GET
#' @importFrom httr content
#' @export
download <- function(url){
  content(GET(url))
}

 

Example Package – rev. a81e

Creating a variable that is available within all functions of your package is as simple as binding a variable to data outside of any functions in a .R file in your package.

cache <- list()

 

One might think we could then put data into that cache variable by assigning named elements within to it from packages that can access it. For instance:

download <- function(url){
  if (!is.null(cache[[url]])){
    return(cache[[url]])
  }

  file <- content(GET(url))
  cache[[url]] <- file

  file
}

Unfortunately, this, like our first example, is assigning data to a separate variable named cache which exists only inside of this function. If you were to build the package and run it, you’d find that the code:

down <- download("http://www.gutenberg.org/cache/epub/2500/pg2500.txt")
RCache:::cache

would successfully download the book, Siddhartha, but the package cache would be empty. (If you’re unfamiliar with the ::: operator, it checks inside of the package — named on the left — for a variable — named on the right and returns it. So we can inspect the cache variable from code outside of our package.)

Special Assignment – rev. 5548

In order to alter a variable created in the parent environment, one must use the special assignment operator, <<-. This will adjust a binding not in the current environment, but in a parent environment. So we can adjust our line in which we assign a value to the cache variable to use this operator:

   
  file <- content(GET(url))
  cache[[url]] <<- file

However, if we run this, we’ll find that we get an interesting error:

down <- download("http://www.gutenberg.org/cache/epub/2500/pg2500.txt")
Error in cache[[url]] <<- file :
  cannot change value of locked binding for 'cache'

What’s the deal? R has a concept of “Locked Bindings” which allow you to forbid discourage changes in variables by locking either a particular variable binding or an entire environment. In this case, the cache binding has been locked and can’t be altered by constituent functions. So we’ll need to take a different approach altogether.

Environments – rev. 320e

It seems we can properly access a package-wide variable from within a function of a package, but we’re not allowed to overwrite it (or create new variables at that level from within a function). Perhaps we could leverage environments to solve this problem. As it turns out, environments were likely a cleaner solution to our problem all along. Instead of creating a cache list, we can create it a cache environment:

cacheEnv <- new.env()

As long as this environment is created in one of our package’s .R files and not inside of a function, it will be accessible across our entire package. We can do all the things with this environment that we’re used to doing in our regular R environment (whether we knew we were using environments or not): create new variables (assign), modify existing variables (assign), remove variables (rm), retrieve data associated with a variable (get), list the variables in an environment (ls), etc.

All of the functions mentioned above accept an envir argument which specifies in which environment you’d like to perform the operation. The default is your current environment, but you could just as easily point these functions at your new environment to do something like assign(url, file, envir=cacheEnv) to assign the value currently stored in the file variable to a new variable who’s name is the value currently contained in the url variable within our cache environment. Then we could use get(url, envir=cacheEnv) to get the variable who name matches the current value of the url variable in the cache environment.

For instance:

> url <- "http://mytext.com"
> file <- "This is the content I downloaded"
> cacheEnv <- new.env()
> assign(url, file, envir=cacheEnv)
> get(url, envir=cacheEnv)
[1] "This is the content I downloaded"

Now we can incorporate this into our package by changing the download function to use these facilities:

download <- function(url){
  if (exists(url, envir=cacheEnv)){
    return(get(url, envir=cacheEnv))
  }

  file <- content(GET(url))
  assign(url, file, envir=cacheEnv)

  file
}

And we finally have a working cache. When a URL is requested via our package’s download function, the data will first be stored in this package’s cacheEnv environment before being returned. The next time that URL is requested, the cacheEnv environment will be checked to see if we already downloaded the content of that URL. Because a variable by that name already exists in our cacheEnv environment, the value will be pulled from there and returned rather than retrieved remotely.

Conclusion

Hopefully you’ve learned a thing or two about environments in R and how to use them from within R packages. Environments can be very valuable tools for advanced R programmers. They’re one of the tried-and-true ways to replicate “pass-by-reference” programming in R (though that’s typically unexpected — thus discouraged — behavior for an R object). They also have some unique lookup properties being built on hashes that allow them to more expediently map character strings to data when there is a very large number of possible keys.

Happy packaging!

5 Comments

  1. Alex Ishkin says:

    Hi Jeff,

    Thanks for the post, it is very clearly written! I use the environment-based cache in my packages as well. In my code, the cache environment is created by a function and explicitly placed into global environment (so it is in the search path for whatever operations executed in the console. I’d like to ask: if you just create the cache env somewhere in R script (I guess in this case it is created when the package is loaded?), is it visible from global environment or only within namespace of the package?

    Thanks,
    Alex

    • Jeff Allen says:

      Thanks for the kind words, Alex. The way I have it setup, the cache variable is in the package’s namespace, not the global environment. So it’s accessible from within my package’s code, but not elsewhere. For instance:


      > library(RCache)
      > down < - download("http://www.gutenberg.org/cache/epub/2500/pg2500.txt")
      > cacheEnv
      Error: object 'cacheEnv' not found
      > RCache:::cacheEnv

  2. Hansi says:

    What about?

    item <- something()
    options(MY.LIB.CACHEDITEM=blah)

    [snip]

    needItemAgain <- getOption("MY.LIB.CACHEDITEM")
    if(is.null(needItemAgain)){needItemAgain <- something()}

    • Jeff Allen says:

      I’d have a couple of concerns with that approach.

      1. One of my goals was to keep the cache variable/environment out of the global namespace with the intent that we don’t want the user or another package accidentally breaking our cache. You were careful to preface your variable name with MY.LIB, though, so you bypass that argument. This argument may just descend to a point at which I’m blindly subscribing to “best practice”-ism.
      2. I’d worry about the performance of this approach at scale. One of the benefits of environments being hashes is that they offer O(1) lookup regardless of size. I believe options are list-based (correct me if I’m wrong), which don’t offer such performance at scale. This may be a more theoretical argument, however, as the couple of times I’ve actually compared them, the lists outperformed the hashes until you got to a few hundred/thousand elements.

      All in all, I think this would be a perfectly functional approach (no pun intended) if you’re willing to accept some “pollution” in the global namespace.

      • Hansi says:

        1) Sure I guess it’s just a question of ownership of the cache. If releasing a public package I think your method would most likely be better received.

        2) Not 100% on this but assignment can be direct with name or as a list for multiple assignment. Internally I think it’s handled in the same/similar manner as enviroments: http://svn.r-project.org/R/trunk/src/main/options.c so there shouldn’t be any overhead.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">