Simple Web Scraping in Emacs Lisp

Today, we're going to do some industrial-strength web scraping with Emacs. I believe Scrapy and Beautifulsoup are the tools of choice for web scraping nowadays, but there's an elisp library which can achieve equally impressive results. To demonstrate, we're going to print out the most recently updated bands on Encyclopaedia Metallum from elisp.

First, let's retrieve the front page of Encyclopaedia Metallum, the premier reference website for metalheads. Metallum lacks a public API, so most of the site's user-submitted information is only available in HTML form.

(with-current-buffer (url-retrieve-synchronously "https://www.metal-archives.com")
  ...)

url.el also offers an asynchronous method of retrieving network resources, but the simpler synchronous method is used here for didactic purposes.

To parse the HTML, we'll use elquery - a library which started as a joking ripoff of jQuery, but then grew a test suite and amassed a whopping three stars on Github. Simply parse the response body with elquery-read-string, use some querySelector magic to retrieve all rows of the "New Bands" table, and get their text. This whole function could've been one-lined in JavaScript, but good luck getting V8 running in Emacs.

(require 'elquery)
(with-current-buffer (url-retrieve-synchronously "https://www.metal-archives.com")
  (message "%s" (mapcar 'elquery-text (elquery-$ "#updatedBands .forceBreak a"
                                                 (elquery-read-string (buffer-string))))))

Simply evaluate those two expressions, and your messages buffer should be filled with brütality.

Elquery is a powerful library - we have an almost-full-fat recursive querySelector implementation and plenty of jQuery-inspired helpers to make HTML parsing and formatting as easy as possible from Elisp. If you need to do some scraping or printing for your next Elisp project, give Elquery a whirl.