r - How to create example data set from private data (replacing variable names and levels with uninformative place holders)?


To provide a reproducible example of
an approach, a data set must often be provided. Instead of building an example data set,
I wish to use some of my own data. However this data can not be released. I wish to
replace variable (column) names and factor levels with uninformative place holders (eg.
V1....V5, L1....L5).

an automated way to do this

this would be done in R, taking in a data.frame and producing this anonymous

With such a data set, simply search
and replace variable names in your script and you have a publicly releasable
reproducible example.

Such a process
may increase the inclusion of appropriate data in reproducible examples and even the
inclusion of reproducible examples in questions, comments and bug


I don't know whether there
was a function to automate this, but now there
is ;)

## A function to anonymise
columns in 'colIDs'
## colIDs can be either column names or integer
anonymiseColumns <- function(df, colIDs) {
id <-
if(is.character(colIDs)) match(colIDs, names(df)) else colIDs
for(id in
colIDs) {

prefix <- sample(LETTERS, 1)
suffix <-
df[[id]] <- paste(prefix,
suffix, sep="")
names(df)[id] <- paste("V", id,

## A data.frame containing
sensitive information
df <- data.frame(

name =
rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),
hiscore =
runif(10, 99, 100),
passwd = replicate(10, paste(sample(c(LETTERS, letters),
9), collapse="")))

## Anonymise it
df2 <-
anonymiseColumns(df, c(1,3))

## Check that it worked
head(df, 3)
name hiscore passwd

1 Douglas Bates 99.96714
2 Douglas Bates 99.07243 gDOLNMyVe
3 John Chambers
99.55322 xIVPHDuEW

> head(df2, 3)
name hiscore
1 Q1 99.96714 V8
2 Q1 99.07243 V2
3 Q2 99.55322

