--- title: "An introduction to simpleCache" author: "Nathan Sheffield" date: "`r Sys.Date()`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{An introduction to simpleCache} output: knitr:::html_vignette --- # An introduction to simpleCache ## Your first cache `simpleCache` has 2 main use cases: First, it can help you pick up where you left off in an R session, and second, it can help you parallelize code by enabling you to share results across R sessions. The workhorse of `simpleCache` is the eponymous `simpleCache()` function, which in the simplest case requires just two parameters: a cache name, and a block of code. The cache name should be considered unique and its underlying object immutable, while the block of code (or *instruction*) is the `R` code that generates the object you wish to cache. But before we start creating caches, it's important to tell `simpleCache` where to store the caches. `simpleCache` uses a global variable (`RCACHE.DIR`) for caches, and provides a setter function (`setCacheDir()`) to change this. To get started, choose a cache directory, and generate some random data. ```{r Try it out} library(simpleCache) cacheDir = tempdir() setCacheDir(cacheDir) simpleCache("normSamp", { rnorm(1e7, 0,1) }) ``` Now, watch what happens when we run that same function call again: ```{r} simpleCache("normSamp", { rnorm(1e7, 0,1) }) ``` Notice that the second call to `simpleCache()` doesn't re-run the `rnorm` calculation. In fact, it doesn't even re-load the cache, because it notices that it's already in memory. If the cache weren't already in memory, this call would load it from disk. This means you can put this code in multiple scripts and pull the same randomized data, without re-doing the compute work. You can also force a cache to reload using the `reload` option. This could be useful, for example, if you've loaded a cache and then accidentally changed it, and want to reset. By default, a call to `simpleCache()` will not reload an object that already exists in your environment. But you can always force it with the `reload` parameter: ```{r} normSamp = NA # Oops broke my object in memory. # Regular call won't reload because we have an object called normSamp already: simpleCache("normSamp", { rnorm(1e7, 0,1) }) # But we can force reload and get it back with reload=TRUE simpleCache("normSamp", { rnorm(1e7, 0,1) }, reload=TRUE) ``` What if we want to start over and blow that cache, getting a new random set? Use the `recreate` flag if you want to ensure that the cache is produced and overwritten even if it already exists: ```{r} simpleCache("normSamp", { rnorm(1e7, 0,1) }, recreate=TRUE) ``` With just those parameters (cache name, instruction, recreate, and reload), you should be able to make good use of `simpleCache`. The essence is: if the object exists in memory already: do nothing. If it does not exist in memory, but exists on disk: load it into memory. If it exists neither in memory or on disk: create it and store it to disk and memory. Now you've got the basics. But there's more if you want it: read on! ## Comparison to base R save() and load() Of course, R has base functions that accomplish this (`save()` and `load()`), so what does simpleCache add? Well, `simpleCache` is essentially a convenience wrapper around the base R functions. The first advantage is that we now require only a single function: `simpleCache()` handles both saving and loading. This means your script does not need to be written differently depending on whether it's generating or loading a cache, because the same function can do either, depending on whether the cache exists or not. The second advantage is that caches are keyed by cache name instead of by filename. So instead of putting a whole path to an Rdata file into `load()`, we just pass a unique identifier for the cache, and simpleCache handles the rest. Third, `simpleCache` tries to be smart: if you already have the object in memory, it won't re-load it. For big caches, this can save you time if you accidentally call `simpleCache()` multiple times on the same cache (or if you write functions to populate an R environment with a bunch of pre-existing data). Beyond that, `simpleCache` also offers several convenient options that just make it really easy to save and re-load R objects. Let's go into a bit more detail into these features. ## Cache names By default, the object will be loaded into a variable with the same name as the cache. You can change this behavior with the `assignTo` parameter: ```{r} simpleCache("normSamp", { rnorm(1e7, 0,1) }, assignTo="mySamp") ``` After doing this command, we have both `normSamp` (from the previous calls, not from this one) and `mySamp` (loaded in this call) in the workspace, and these objects are identical: ```{r} identical(normSamp, mySamp) ``` This `assignTo` concept is useful if you want to create caches but not load them, or load caches one at a time. Which leads us to... ## Creating but not loading caches It may be that you want to create a bunch of caches that are quite memory intensive, and you don't actually need them all in this particular R workspace at the same time. If you just create each object and save it, you'll end with all those objects in memory at the same time. Instead, you can use the `noload` parameter, which will create the caches but not load them into memory (so the object will be cached, but will not persist in this R environment). I use this frequently in a setup script to build caches that I will need later in individual scripts that will run on each one individually. Let's make 5 caches but not load them: ```{r} for (i in 1:5) { cacheName = paste0("normSamp_", i) simpleCache(cacheName, { rnorm(1e6, 0,1) }, recreate=TRUE, noload=TRUE) } ``` We've now produced 5 different sample data caches. They exist on disk, but not in memory. This could, for example, be done in an initial data-generation or setup script. We then may be interested in using these (same) caches in several downstream scripts, and we could do some iterative operation on them and use `assignTo` to avoid loading more than 1 at a time into memory: ```{r} overallMinimum = 1e6 # pick some high number to start for (i in 1:5) { cacheName = paste0("normSamp_", i) simpleCache(cacheName, assignTo="temp") overallMinimum = min(overallMinimum, temp) } message(overallMinimum) ``` In this code block, by assigning the caches to the variable `temp`, we only have 1 in memory at a time, because each cache load overwrites the previous one, which is exactly what we want in this case. We keep track of the minimum value of each one independently, and we've effectively calculated an overall minimum while loading only a single cache in memory at a time. ## Loading multiple caches If you've got a bunch of caches and you want them all in memory, you could just load all the caches into memory with this convenience alias: ```{r} loadCaches(paste0("normSamp_", 1:5)) ``` The disadvantage of doing it this way is that you've lost the advantage of using the single `simpleCache()` function for both saving and loading, but this may be desirable in some cases. By the way, once a cache is created, you no longer need to provide instructions: ```{r} simpleCache("normSamp") ``` `simpleCache` will load it if it can; if not, it will give you an error saying it requires an `instruction`. ## Timing cache creating If you want to record how long it takes to create a new cache, you can set `timer=TRUE`. ```{r} simpleCache("normSamp", { rnorm(1e6, 0,1) }, recreate=TRUE, timer=TRUE) ``` ## Complicated code So far, our examples have cached the result of a very simple instruction code block: the `rnorm` call to randomly generate some numbers. But really, simpleCache can be used to cache anything. The code block can be whatever you want; whatever it returns will be cached. For example, let's cache the result of a call to `t.test()`: ```{r} simpleCache("tResult", { dat2 = rnorm(1e5, 0.05,2) t.test(normSamp, dat2) }, recreate=TRUE) tResult tResult$p.value ``` The point is that the code could be quite complicated and time-consuming. You may only want to calculate it once, and then re-use the result in another script -- or in this same script next time you run it. `simpleCache` makes that, well, simple. That's the end of the basics. There are a few more advanced options as well, such as using a shared cache directory, submitting compute requests to a cluster using `batchtools`, tweaking the loading environment with the `loadEnvir` parameter (if you need to call `simpleCache()` from within a function), and tweaking the cache building resources with the `buildEnvir` parameter. But these options are more advanced and probably not needed for 95% of `simpleCache` use cases. If you do need more information, you can find further help in the other vignettes or in the detailed R function documentation (see `?simpleCache`). ```{r Clean up} deleteCaches("normSamp", force=TRUE) deleteCaches(paste0("normSamp_", 1:5), force=TRUE) deleteCaches("tResult", force=TRUE) ```