Back in early July, I wrote a rant disagreeing with Robert Zaremba about retiring the use of PDFs. Zaremba believes that PDFs are no longer a good fit for today’s devices and that we should stop using them. I strongly disagreed. In the comments to my post, Mike Zamansky zinged me by noting that “PDFs are where data goes to die.” His point, of course, is that it’s pretty hard to get data out of a PDF.
Now, finally, I have a rejoinder: Tabula, a tool for extracting PDF tables into CSV data. Once you’ve got it in that format, it’s easy to convert it into others such as an Org mode table. That should be especially handy for researchers who like to write their papers in Org mode.
The problem of extracting table data from a PDF turns out to be surprisingly difficult. Follow the Tabula link to read about some of the problems. Regardless, Tabula can (usually) do it and help researchers capture data from a PDF in a relatively painless and accurate way.