Data Loss and Reproducible Research

As long-time readers know, I'm a big fan of reproducible research (1, 2, 3, 4, 5, 6, 7). The aspect that I'm particularly interested in is keeping the data, software, and other supporting material together with the source for the actual paper. This is especially convenient with the Emacs/Org mode/Babel combination as many of the above posts discuss.

One thing I hadn't considered is its role in preventing data loss. UPI is reporting that scientific data is disappearing at an alarming rate. Researchers from the University of British Columbia looked at 516 research papers and tried to locate the data underlying the papers. They found that most of the data was available two years after the papers were published but that the availability of the data dropped off by about 17% for each year after the second.

The problem seems rooted in the fact that the data stays with the researchers. Researchers move, retire, or die making it difficult or impossible to get in touch with them. Even when a researcher can be reached, the data may have been lost or trapped on old media that can no longer be read. Reproducible research by itself doesn't help with any of those problems, of course, but it can help if the original files are held by the journals that publish the papers. The authors of the University of British Columbia study recommend that journals require authors to upload data to public servers as a condition of publication.

An even better solution is to require that authors generate reproducible research files1 and escrow them with the journal or on public servers set up for that purpose. Keeping digital files readable is a difficult problem—just ask the Library of Congress—but it's one that's more suited to professionals than individual researchers, most of whom lack the necessary background and skills.

I've long believed that journals should insist that authors submit reproducible research files as part of the publication process. That was to allow other researchers to verify and perhaps extend the original research. As the UPI report makes clear, keeping valuable scientific data from disappearing is an even better reason.

Update dissapearing → disappearing



Ideally, this would be a single file containing the text of the paper, data, calculations, and programs used to manipulate the data. In some cases the data sets or programs may be too large to make that practical but they should all be included in the reproducible research file set and be linked to from the main file.

This entry was posted in General and tagged , , , . Bookmark the permalink.
  • Jean-Michel

    Very valuable series of post about reproducible research

    I'd just like to have a feedback : I use org-mode + babel, but when it comes to research involving 10 Mo of raw data (text), generating 10 other Mo of results, I tend to keep that in several files, because I think it's too big for one file. The other concern is that, in rare cases, it's possible to loose data in an org-mode file (I have not understood how, but it has arrived twice to me in 2 years - no irreversible loss, everything easily recovered thanks git, but not nice).

    What are your ideas about that ?

    • jcs

      What are your ideas about that ?

      That you're doing the sensible thing. If you have huge amounts of data, keeping them in your "main" file is unwieldy. The important thing from the reproducible research perspective is to think of all the files as a unit. In terms of the U of BC researchers' recommendation, that would mean uploading the total file set to the journal or public server.

      I haven't had any problems with Org mode losing data but I don't have huge amounts of data. Even here, though, you're doing the sensible thing: keeping everything in git ensures you can always back up to a good state. The Ten Simple Rules for Reproducible Computational Research paper that I wrote about previously stresses the importance of keeping everything version controlled.