itemprop="text">
When discussing
performance with colleagues, teaching, sending a bug report or searching for guidance on
mailing lists and here on Stack Overflow, a href="https://stackoverflow.com/help/mcve">reproducible example is often
asked and always helpful.
What are your tips for
creating an excellent example? How do you paste data structures from href="https://stackoverflow.com/questions/tagged/r" class="post-tag" title="show
questions tagged 'r'" rel="tag">r in a text format? What other information
should you include?
Are there other tricks in
addition to using dput()
, dump()
or
structure()
? When should you include
library()
or require()
statements?
Which reserved words should one avoid, in addition to c
,
df
, data
,
etc.?
How does one make a great href="https://stackoverflow.com/questions/tagged/r" class="post-tag" title="show
questions tagged 'r'" rel="tag">r reproducible
example?
A href="https://stackoverflow.com/help/minimal-reproducible-example">minimal
reproducible example consists of the following
items:
- a minimal dataset,
necessary to demonstrate the problem
- the minimal
runnable code necessary to reproduce the error, which can
be run on the given dataset
- the necessary information on
the used packages, R version, and system it is run on.
- in
the case of random processes, a seed (set by set.seed()
) for
reproducibility1
For
examples of good minimal reproducible examples, see the help files
of the function you are using. In general, all the code given there fulfills the
requirements of a minimal reproducible example: data is provided, minimal code is
provided, and everything is runnable. Also look at questions on with lots of
upvotes.
Producing a minimal
dataset
For most cases, this can be easily done
by just providing a vector/data frame with some values. Or you can use one of the
built-in datasets, which are provided with most packages.
A comprehensive
list of built-in datasets can be seen with library(help =
"datasets")
. There is a short description to every dataset and more
information can be obtained for example with ?mtcars
where
'mtcars' is one of the datasets in the list. Other packages might contain additional
datasets.
Making a vector is easy. Sometimes it
is necessary to add some randomness to it, and there are a whole number of functions to
make that. sample()
can randomize a vector, or give a random
vector with only a few values. letters
is a useful vector
containing the alphabet. This can be used for making
factors.
A few examples
:
- random values
: x <- rnorm(10)
for normal distribution, x
<- runif(10)
for uniform distribution,
...
- a permutation of some values :
x <-
sample(1:10)
for vector 1:10 in random
order.
- a random factor :
x <-
sample(letters[1:4], 20, replace =
TRUE)
For
matrices, one can use matrix()
, eg
:
matrix(1:10, ncol =
2)
Making
data frames can be done using data.frame()
. One should pay
attention to name the entries in the data frame, and to not make it overly
complicated.
An example
:
set.seed(1)
Data <-
data.frame(
X = sample(1:10),
Y = sample(c("yes", "no"), 10,
replace =
TRUE)
)
For
some questions, specific formats can be needed. For these, one can use any of the
provided as.someType
functions :
as.factor
, as.Date
,
as.xts
, ... These in combination with the vector and/or data
frame tricks.
Copy your
data
If you have some data that would be too
difficult to construct using these tips, then you can always make a subset of your
original data, using head()
, subset()
or the indices. Then use dput()
to give us something that can
be put in R immediately :
>
dput(iris[1:4, ]) # first four rows of the iris data
set
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6),
Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5),
Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L),
.Label = c("setosa",
"versicolor", "virginica"), class = "factor")), .Names =
c("Sepal.Length",
"Sepal.Width", "Petal.Length", "Petal.Width", "Species"),
row.names = c(NA,
4L), class =
"data.frame")
If your
data frame has a factor with many levels, the dput
output can
be unwieldy because it will still list all the possible factor levels even if they
aren't present in the the subset of your data. To solve this issue, you can use the
droplevels()
function. Notice below how species is a factor
with only one
level:
>
dput(droplevels(iris[1:4, ]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7,
4.6), Sepal.Width = c(3.5,
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3,
1.5), Petal.Width = c(0.2,
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L,
1L), .Label = "setosa",
class = "factor")), .Names = c("Sepal.Length",
"Sepal.Width",
"Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
4L), class =
"data.frame")
When
using dput
, you may also want to include only relevant
columns:
>
dput(mtcars[1:3, c(2, 5, 6)]) # first three rows of columns 2, 5, and
6
structure(list(cyl = c(6, 6, 4), drat = c(3.9, 3.9, 3.85), wt = c(2.62,
2.875, 2.32)), row.names = c("Mazda RX4", "Mazda RX4 Wag", "Datsun
710"
), class =
"data.frame")
One
other caveat for dput
is that it will not work for keyed
data.table
objects or for grouped
tbl_df
(class grouped_df
) from
dplyr
. In these cases you can convert back to a regular data
frame before sharing,
dput(as.data.frame(my_data))
.
Worst
case scenario, you can give a text representation that can be read in using the
text
parameter of read.table
:
zz <-
"Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2
setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4
4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4
setosa"
Data <- read.table(text=zz, header =
TRUE)
Producing
minimal code
This should be the easy part but
often isn't. What you should not do,
is:
- add all kind of data
conversions. Make sure the provided data is already in the correct format (unless that
is the problem of course)
- copy-paste a whole
function/chunk of code that gives an error. First, try to locate which lines exactly
result in the error. More often than not you'll find out what the problem is
yourself.
What
you should do, is:
- add
which packages should be used if you use any (using
library()
)
- if you open
connections or create files, add some code to close them or delete the files (using
unlink()
)
- if you change options,
make sure the code contains a statement to revert them back to the original ones. (eg
op <- par(mfrow=c(1,2)) ...some code... par(op)
)
- test run your code in a new, empty R session to make
sure the code is runnable. People should be able to just copy-paste your data and your
code in the console and get exactly the same as you
have.
Give
extra information
In most cases, just the R
version and the operating system will suffice. When conflicts arise with packages,
giving the output of sessionInfo()
can really help. When
talking about connections to other applications (be it through ODBC or anything else),
one should also provide version numbers for those, and if possible also the necessary
information on the setup.
If you are running R
in R Studio using
rstudioapi::versionInfo()
can be helpful to report your RStudio
version.
If you have a problem with a specific
package you may want to provide the version of the package by giving the output of
packageVersion("name of the
package")
.
/>
1
Note: The output of set.seed()
differs between R >3.6.0 and previous versions. Do specify which R version you used
for the random process, and don't be surprised if you get slightly different results
when following old questions. To get the same result in such cases, you can use the
RNGversion()
-function before
set.seed()
(e.g.:
RNGversion("3.5.2")
).