Data Loss and Reproducible Research

As long-time readers know, I’m a big fan of reproducible research (1, 2, 3, 4, 5, 6, 7). The aspect that I’m particularly interested in is keeping the data, software, and other supporting material together with the source for the actual paper. This is especially convenient with the Emacs/Org mode/Babel combination as many of the above posts discuss.

One thing I hadn’t considered is its role in preventing data loss. UPI is reporting that scientific data is disappearing at an alarming rate. Researchers from the University of British Columbia looked at 516 research papers and tried to locate the data underlying the papers. They found that most of the data was available two years after the papers were published but that the availability of the data dropped off by about 17% for each year after the second.

The problem seems rooted in the fact that the data stays with the researchers. Researchers move, retire, or die making it difficult or impossible to get in touch with them. Even when a researcher can be reached, the data may have been lost or trapped on old media that can no longer be read. Reproducible research by itself doesn’t help with any of those problems, of course, but it can help if the original files are held by the journals that publish the papers. The authors of the University of British Columbia study recommend that journals require authors to upload data to public servers as a condition of publication.

An even better solution is to require that authors generate reproducible research files1 and escrow them with the journal or on public servers set up for that purpose. Keeping digital files readable is a difficult problem—just ask the Library of Congress—but it’s one that’s more suited to professionals than individual researchers, most of whom lack the necessary background and skills.

I’ve long believed that journals should insist that authors submit reproducible research files as part of the publication process. That was to allow other researchers to verify and perhaps extend the original research. As the UPI report makes clear, keeping valuable scientific data from disappearing is an even better reason.

Update dissapearing → disappearing

Footnotes:

1

Ideally, this would be a single file containing the text of the paper, data, calculations, and programs used to manipulate the data. In some cases the data sets or programs may be too large to make that practical but they should all be included in the reproducible research file set and be linked to from the main file.

This entry was posted in General and tagged , , , . Bookmark the permalink.