Web Scraping With Org Mode

Beetle_b has an interesting short post, about scraping a Web site, that outlines a technique that may be useful to some Irreal readers. The TL;DR was that he wanted to capture and print the content from a Web page that had links to several subpages. There were about 20 subpages and they all had ads and lots of wasted space on them. That meant that he’d have to continually print pages and would have a lot of wasted space on each page.

His solution was to first capture the links (with the judicious use of a keyboard macro) and then use alphapappa’s org-web-tools to capture the content at each link (without ads) and insert it into it own Org heading.

Many of us try to remain digital as much as we can so we have no need to produce hard copy but most of us have occasionally wanted to capture and preserve the content from a series of Web pages. Beetle_b’s method is an easy way to do that but, of course, you’re going to need to load org-web-tools first. It’s definitely worth a couple of minutes to read and bookmark his post against the day when you want to capture some content.

This entry was posted in General and tagged , . Bookmark the permalink.