scraping-for-journalistsWhat basic coding/tech skills meet the following criteria?

  • They are relatively easily required with no need for a 4 year CS degree;
  • They provide journalistically relevant and useful results
  • They are reusable in a journalistic context

In my opinion, there are three skills that meet these criteria:

  • Mapping.  Most news happens in a place.  Maps are ancient precisely because they are such an expressive and powerful form of data visualization.
  • Grabbing.  Thousands of websites have data gateways called APIs (Application Programming Interfaces) that allow you free access to some or all of that site’s data — as long as you can write a relatively simple program that can grab it and return it to you in a format you can use.
  • Scraping.  This is to get the data out of all those sites that don’t require API’s — read “crappy government websites.”

(For a brief and entertaining video on these three, see, “Do I Really Have To Learn How To Program?“)

I’ve done quite a bit of mapping, and some grabbing, but my experiences with scraping have been less successful, primarily because they proved to be less generalizable.  I’d be able to pick my way (often slowly and with much frustration) through a scraping tutorial and get results.  But at the end I did not feel that I could write a script on my own to scrape other things.

After taking a class on scraping at Journalism Interactive with Michelle Minkoff,  I decided to buy Paul Bradshaw’s book “Scraping for Journalists,” and take another run at it.  If you would like to read along, my notes as I pick through the book are after the jump.   I would also like to thank Michelle and Paul for giving me the inspiration to restart this blog.  I have been very busy with my new duties at INN, a network of 90+ investigative and community newsrooms, so I have not been devoting much time to adding to my own store of code-knowledge or developing tutorials to pass on what I’ve learned to others.  But it’s something that I enjoy and believe gives back something of value to my peers in the field, so I welcome the chance to begin anew.

Read-along, “Scraping for Journalists,” Paul Bradshaw

Introduction

“Scraping is faster than FOI, provides more granular results than most advanced searches – and allows you to grab data that organisations would rather you didn’t have.”

“I was moved to write this book when I noticed many journalists were trying to learn scraping and programming but struggling to get a foothold – or losing momentum once they did.”

[Well, damn!  — LW]

“Unlike general books about programming languages, everything you learn here will have a direct application for journalism, and each principle of programming will be related to their application in scraping for newsgathering…And unlike standalone guides and blog posts that cover particular tools or techniques, this book aims to give you skills that you can apply in new situations and with new tools.”

[Perfect.  If the book delivers on this, it is precisely what I was looking for.  — LW]

Chapter 2

You can write a very basic scraper by using Google Drive, selectingCreate>Spreadsheet, and adapting this formula – it doesn’t matter where you type it:

=ImportHTML("ENTER THE URL HERE", "table", 1)

This formula will go to the URL you specify, look for a table, and pull the first one into your spreadsheet.

This does in fact work for the URL cited in the book:  “http://en.wikipedia.org/wiki/List_of_prisons”.  I got it to work on some pages, but not others:

It didn’t work here, for example:  http://dataforradicals.com/js/tabletop-to-datatables/

But it did work on this page of Massachusetts state education data: http://profiles.doe.mass.edu/state_report/selectedpopulations.aspx  (even though it was a dreaded .aspx).

Trial and error, by the way, is a common way of learning in scraping – it’s quite typical not to get things right first time, and you shouldn’t be disheartened if things go wrong at first.

Don’t expect yourself to know everything there is to know about programming: half the fun is solving the inevitable problems that arise, and half the skill is in the techniques that you use to solve them (some of which I’ll cover here), and learning along the way.

Word.

Describing the scrape-via-Google Spreadsheet-function procedure above, Bradshaw notes:

 Although this is described as a ‘scraper’ the results only exist as long as the page does. The advantage of this is that your spreadsheet will update every time the page does.

I hadn’t thought of that; that’s pretty cool.

Chapter 3

This one shows how to use Google Docs to import XML into a spreadsheet.  Try cutting & pasting the following into a cell in a blank GDocs spreadsheet:

=importXML(“http://openlylocal.com/councils.xml”, “councils/council”)

I got it to work on this URL — and the results are impressive — but I wasn’t able to find other .xml files on the web that it worked on.  I kept getting a spreadsheet with a single cell saying “N/A” that said “Parse Error” when I moused over it.  The book contains a link to an XML validator, but typing URLs in there also resulted in an error.  Bummer.

 

Leave a Reply

Your email address will not be published. Required fields are marked *