Untitled

… and other benefits

reproducibility
learning R
tutorial
Author

Aaron Wenger

Published

January 24, 2024

Organizing, managing, and executing reproducible research projects is hard because there are so many possible reasons why a project cannot be reproduced.

- No data? Not reproducible.
- No code? Not reproducible.
- Software is unavailable? Not reproducible.
- Computational environment is unspecified? Maybe.
- Software versions are unknown? Eventually no.

We think that we have one of these threats to reproducibility whipped, then some unexpected development or complication throws it all back in doubt. I think this is especially true for someone like me: an intermediate R user who has no background in computer science or programming other than what I have learned on my own in the last couple years. This post reflects on my learning how to use renv as part of a reproducible research workflow and how I have handled a couple problems along the way.

Welcome renv

The renv package solves (more or less) the software versioning issue. With a couple commands all R package versions are recorded in a file, allowing the R package environment of the project to be rebuilt from scratch. I have used renv for several months now and have appreciated how much it has simplified the task of installing and staying current with package updates. It has also streamlined the storage of R packages on my computer.

In a nutshell, here’s how it works. Once renv is installed init() creates a lockfile (renv.lock) that holds package version information and details about how packages were installed. For example, if a package is installed from github, that is recorded along with the username of the repository. When the project is shared, the lockfile can be read by restore() to install all packages exactly as they are recorded. The snapshot() function adds additional packages as they are installed in the project and status() reports packages that need to be installed and/or added to the lockfile. It’s really that easy!

First Headache

Learning to use renv and incorporating it into my workflow wasn’t without a couple headaches. At the start, I couldn’t use renv::install()which works similarly to baserenv::install.packages()but is more flexible and intuitive. I would runrenv::install("somepackage") and an error would be returned, to the effect that “package ‘somepackage’ is not available” Yet utils::install.packages(), the “base” package installation function (utils is part of the R distribution), worked just fine.

Apparently many others have had this problem. What it came down to for me is that R and renv were using different download methods. These methods can be checked using getOption("download.file.method") and renv:::renv_download_method() for R and renv respectively. It seems for me (on Windows) that the two available methods are lib and libcurl. These are closely related software libraries/tools created and maintained by the cURL (Client for URLs) project which enable internet file transfers.

I resolved this problem by including one line in my Rstudio project .Rprofile file: Sys.setenv(RENV_DOWNLOAD_FILE_METHOD = "libcurl"). Being in the .Rprofile within the project, this command is always run when loading up the project in Rstudio. A more robust solution that dynamically retrieves the download method currently used by R is what I use now: Sys.setenv(RENV_DOWNLOAD_METHOD = getOption("download.file.method")) (see this stackoverflow question)

Second Headache

That first problem seemed to happen again some months later. Again, renv::install() would return an error stating that the package was not available. Apparently, the repository being targeted by renv was the issue. I don’t know what caused it, but I am guessing some update in the backend of renv or its dependencies was responsible.

The call getOption("repos") returns the repository currently being used which for me was something like https://packagemanager.posit.co/cran/. I resolved the problem by manually setting the “repos” option in my project’s .Rprofile file. Thus after solving these two headaches, I start new Rstudio projects using renv with my .Rprofile looking like this:

source("renv/activate.R")

Sys.setenv(RENV_DOWNLOAD_FILE_METHOD = getOption("download.file.method"))

options(repos = c(CRAN = "https://cloud.r-project.org"))

Unanticipated Benefits

The above problems were really not too hard to resolve even for someone with a limited software knowledge base like me. The documentation and support provided by Kevin Ushey, principal developer of renv, and the rest of the Posit team really is superb. I have no doubt that this package will remain stable and functioning going ahead.

As I have adopted renv into my normal project workflow I have discovered a few benefits beyond supporting research reproducibility. First, the renv::install() and renv::update() functions are very smooth and intuitive, much better than utils::install.packages(). In particular, I run renv::update() every week or month to automatically update all project packages - including renv itself! Both functions are faster than the utils function and provide more informative errors.

A second benefit is the efficient caching of R packages on my computer. After the first hundred or so installed packages, the disk space required becomes noticeable. Before renv, I occasionally installed packages twice in different locations and accumulated different versions over time. renv keeps a common cache for the device which it then links to in individual renv projects. This means that for a given package and version, it will only ever be installed once. My computer only has a 200GB hard drive so saving a GB here or there is very nice.

A third benefit is the quick installation of new-to-me packages. Often dependencies are already included and built in my cache so all renv has to do is link to the cache for that dependency. Thus, downloads and builds from binaries are minimized and new packages are ready to use in mere moments.

I think all scientists and researchers who use R and who are committed to research reproducibility should adopt renv into their workflow.