simpleCache
has 2 main use cases: First, it can help you
pick up where you left off in an R session, and second, it can help you
parallelize code by enabling you to share results across R sessions.
The workhorse of simpleCache
is the eponymous
simpleCache()
function, which in the simplest case requires
just two parameters: a cache name, and a block of code. The cache name
should be considered unique and its underlying object immutable, while
the block of code (or instruction) is the R
code
that generates the object you wish to cache.
But before we start creating caches, it’s important to tell
simpleCache
where to store the caches.
simpleCache
uses a global variable
(RCACHE.DIR
) for caches, and provides a setter function
(setCacheDir()
) to change this. To get started, choose a
cache directory, and generate some random data.
library(simpleCache)
cacheDir = tempdir()
setCacheDir(cacheDir)
simpleCache("normSamp", { rnorm(1e7, 0,1) })
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp.RData
Now, watch what happens when we run that same function call again:
## ::Object exists (in .GlobalEnv):: normSamp
Notice that the second call to simpleCache()
doesn’t
re-run the rnorm
calculation. In fact, it doesn’t even
re-load the cache, because it notices that it’s already in memory. If
the cache weren’t already in memory, this call would load it from disk.
This means you can put this code in multiple scripts and pull the same
randomized data, without re-doing the compute work.
You can also force a cache to reload using the reload
option. This could be useful, for example, if you’ve loaded a cache and
then accidentally changed it, and want to reset. By default, a call to
simpleCache()
will not reload an object that already exists
in your environment. But you can always force it with the
reload
parameter:
normSamp = NA # Oops broke my object in memory.
# Regular call won't reload because we have an object called normSamp already:
simpleCache("normSamp", { rnorm(1e7, 0,1) })
## ::Object exists (in .GlobalEnv):: normSamp
# But we can force reload and get it back with reload=TRUE
simpleCache("normSamp", { rnorm(1e7, 0,1) }, reload=TRUE)
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp.RData
What if we want to start over and blow that cache, getting a new
random set? Use the recreate
flag if you want to ensure
that the cache is produced and overwritten even if it already
exists:
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp.RData
With just those parameters (cache name, instruction, recreate, and
reload), you should be able to make good use of
simpleCache
. The essence is: if the object exists in memory
already: do nothing. If it does not exist in memory, but exists on disk:
load it into memory. If it exists neither in memory or on disk: create
it and store it to disk and memory. Now you’ve got the basics.
But there’s more if you want it: read on!
Of course, R has base functions that accomplish this
(save()
and load()
), so what does simpleCache
add? Well, simpleCache
is essentially a convenience wrapper
around the base R functions. The first advantage is that we now require
only a single function: simpleCache()
handles both saving
and loading. This means your script does not need to be written
differently depending on whether it’s generating or loading a cache,
because the same function can do either, depending on whether the cache
exists or not. The second advantage is that caches are keyed by cache
name instead of by filename. So instead of putting a whole path to an
Rdata file into load()
, we just pass a unique identifier
for the cache, and simpleCache handles the rest. Third,
simpleCache
tries to be smart: if you already have the
object in memory, it won’t re-load it. For big caches, this can save you
time if you accidentally call simpleCache()
multiple times
on the same cache (or if you write functions to populate an R
environment with a bunch of pre-existing data).
Beyond that, simpleCache
also offers several convenient
options that just make it really easy to save and re-load R objects.
Let’s go into a bit more detail into these features.
By default, the object will be loaded into a variable with the same
name as the cache. You can change this behavior with the
assignTo
parameter:
## ::Object exists (in .GlobalEnv):: normSamp
After doing this command, we have both normSamp
(from
the previous calls, not from this one) and mySamp
(loaded
in this call) in the workspace, and these objects are identical:
## [1] TRUE
This assignTo
concept is useful if you want to create
caches but not load them, or load caches one at a time. Which leads us
to…
It may be that you want to create a bunch of caches that are quite
memory intensive, and you don’t actually need them all in this
particular R workspace at the same time. If you just create each object
and save it, you’ll end with all those objects in memory at the same
time. Instead, you can use the noload
parameter, which will
create the caches but not load them into memory (so the object will be
cached, but will not persist in this R environment). I use this
frequently in a setup script to build caches that I will need later in
individual scripts that will run on each one individually. Let’s make 5
caches but not load them:
for (i in 1:5) {
cacheName = paste0("normSamp_", i)
simpleCache(cacheName, { rnorm(1e6, 0,1) }, recreate=TRUE, noload=TRUE)
}
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp_1.RData
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp_2.RData
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp_3.RData
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp_4.RData
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp_5.RData
We’ve now produced 5 different sample data caches. They exist on
disk, but not in memory. This could, for example, be done in an initial
data-generation or setup script. We then may be interested in using
these (same) caches in several downstream scripts, and we could do some
iterative operation on them and use assignTo
to avoid
loading more than 1 at a time into memory:
overallMinimum = 1e6 # pick some high number to start
for (i in 1:5) {
cacheName = paste0("normSamp_", i)
simpleCache(cacheName, assignTo="temp")
overallMinimum = min(overallMinimum, temp)
}
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_1.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_2.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_3.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_4.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_5.RData
## -5.14063077592056
In this code block, by assigning the caches to the variable
temp
, we only have 1 in memory at a time, because each
cache load overwrites the previous one, which is exactly what we want in
this case. We keep track of the minimum value of each one independently,
and we’ve effectively calculated an overall minimum while loading only a
single cache in memory at a time.
If you’ve got a bunch of caches and you want them all in memory, you could just load all the caches into memory with this convenience alias:
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_1.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_2.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_3.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_4.RData
## ::Loading cache:: /tmp/RtmpKckpGK/normSamp_5.RData
The disadvantage of doing it this way is that you’ve lost the
advantage of using the single simpleCache()
function for
both saving and loading, but this may be desirable in some cases.
By the way, once a cache is created, you no longer need to provide instructions:
## ::Object exists (in .GlobalEnv):: normSamp
simpleCache
will load it if it can; if not, it will give
you an error saying it requires an instruction
.
If you want to record how long it takes to create a new cache, you
can set timer=TRUE
.
## ::Creating cache:: /tmp/RtmpKckpGK/normSamp.RData
## <00h 00m 0.0s>
So far, our examples have cached the result of a very simple
instruction code block: the rnorm
call to randomly generate
some numbers. But really, simpleCache can be used to cache anything. The
code block can be whatever you want; whatever it returns will be cached.
For example, let’s cache the result of a call to
t.test()
:
## ::Creating cache:: /tmp/RtmpKckpGK/tResult.RData
##
## Welch Two Sample t-test
##
## data: normSamp and dat2
## t = -7.6219, df = 105049, p-value = 2.52e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06137061 -0.03626382
## sample estimates:
## mean of x mean of y
## -0.0004819423 0.0483352708
## [1] 2.519968e-14
The point is that the code could be quite complicated and
time-consuming. You may only want to calculate it once, and then re-use
the result in another script – or in this same script next time you run
it. simpleCache
makes that, well, simple.
That’s the end of the basics. There are a few more advanced options
as well, such as using a shared cache directory, submitting compute
requests to a cluster using batchtools
, tweaking the
loading environment with the loadEnvir
parameter (if you
need to call simpleCache()
from within a function), and
tweaking the cache building resources with the buildEnvir
parameter. But these options are more advanced and probably not needed
for 95% of simpleCache
use cases. If you do need more
information, you can find further help in the other vignettes or in the
detailed R function documentation (see ?simpleCache
).
## Deleting /tmp/RtmpKckpGK/normSamp.RData
## Deleting /tmp/RtmpKckpGK/normSamp_1.RData
## Deleting /tmp/RtmpKckpGK/normSamp_2.RData
## Deleting /tmp/RtmpKckpGK/normSamp_3.RData
## Deleting /tmp/RtmpKckpGK/normSamp_4.RData
## Deleting /tmp/RtmpKckpGK/normSamp_5.RData
## Deleting /tmp/RtmpKckpGK/tResult.RData