Fetching Web Pages into NotDeft

I'm increasingly using NotDeft not only for note taking, but also for capturing information from various sources. To some extent it already acts as a lightweight substitute for the likes of Evernote.

As explained in the documentation, Org mode's built-in capture protocol can be used to send snippets of text from a page open in a web browser into one's NotDeft note collection. Sometimes, however, we already have a URL of an interesting page in our clipboard, and we would like to fetch the entire page's textual content into NotDeft with a single command. NotDeft lacks such a command, but it doesn't take too many lines of Emacs Lisp code to implement one, particularly if we have a good selection of Unix-style programs (i.e., ones that work together and “do one thing and do it well”) on hand to complement Emacs' own facilities.

For example, we can use curl to fetch the desired document by its URL. If we wanted to convert the HTML to plain text, we could for instance use Emacs' built-in shr-render-region command to do that. Or, we might translate into the Org markup language in order to preserve more of the document's semantic content, and shr appears to be capable of doing that, too, as shown by html2org.el. Alternatively, if we have the Haskell-based pandoc tool available, we can invoke it with our favorite options to perform a HTML-to-Org translation.

For example, we might implement our web page into NotDeft importing command in terms of curl and pandoc as follows:

(defun my-notdeft-import-web-page (url &optional ask-dir)
  "Import the web page at URL into NotDeft.
Query for the target directory if ASK-DIR is non-nil.
Interactively, query for a URL, and set ASK-DIR if a prefix
argument is given. Choose a file name based on any document
<title>, or generate some unique name."
  (interactive "sPage URL: \nP")
  (let* ((s (shell-command-to-string
             (concat "curl --silent " (shell-quote-argument url) " | "
                     "pandoc" " -f html-native_divs-native_spans"
                     " -t org"
                     " --wrap=none --smart --normalize --standalone")))
         (title
          (and
           (string-match "^#\\+TITLE:[[:space:]]+\\(.+\\)$" s)
           (match-string 1 s))))
    (notdeft-create-file
     (and ask-dir 'ask)
     (and title `(title, title))
     "org" s)))

In the above listing we are requesting a --standalone document in the hopes that pandoc will include a TITLE property for it, whose value we can then use for naming the output file. If a file so named already exists (perhaps because the page has already been imported previously), the notdeft-create-file function should report an error rather than overwrite the file.

The above implementation of my-notdeft-import-web-page is current as of NotDeft revision 39938fe, curl version 7.52.1, and pandoc version 1.17.2.