Scraping for Journalists

scraping-for-journalistsWhat basic coding/tech skills meet the following criteria?

  • They are relatively easily required with no need for a 4 year CS degree;
  • They provide journalistically relevant and useful results
  • They are reusable in a journalistic context

In my opinion, there are three skills that meet these criteria:

  • Mapping.  Most news happens in a place.  Maps are ancient precisely because they are such an expressive and powerful form of data visualization.
  • Grabbing.  Thousands of websites have data gateways called APIs (Application Programming Interfaces) that allow you free access to some or all of that site’s data — as long as you can write a relatively simple program that can grab it and return it to you in a format you can use.
  • Scraping.  This is to get the data out of all those sites that don’t require API’s — read “crappy government websites.”

(For a brief and entertaining video on these three, see, “Do I Really Have To Learn How To Program?“)

I’ve done quite a bit of mapping, and some grabbing, but my experiences with scraping have been less successful, primarily because they proved to be less generalizable.  I’d be able to pick my way (often slowly and with much frustration) through a scraping tutorial and get results.  But at the end I did not feel that I could write a script on my own to scrape other things.

After taking a class on scraping at Journalism Interactive with Michelle Minkoff,  I decided to buy Paul Bradshaw’s book “Scraping for Journalists,” and take another run at it.  If you would like to read along, my notes as I pick through the book are after the jump.   I would also like to thank Michelle and Paul for giving me the inspiration to restart this blog.  I have been very busy with my new duties at INN, a network of 90+ investigative and community newsrooms, so I have not been devoting much time to adding to my own store of code-knowledge or developing tutorials to pass on what I’ve learned to others.  But it’s something that I enjoy and believe gives back something of value to my peers in the field, so I welcome the chance to begin anew.

Continue reading

Going Beyond Static Charts & Graphs

oicweave-lowell

Not too long ago I was able to attend a demo of the data visualization toolkit Weave.  The person giving the demo was Georges Grinstein, one of the tool’s creators.  Georges hails from the University of Massachusetts at Lowell.

He showed a really amazing demo of foreclosure data from Lowell, MA.  For those of you who aren’t from Massachusetts, Lowell was one of America’s first industrial cities; massive textile mills once dominated the town.  The mill buildings are there — but the kind of jobs they once provided are long gone.  I have a lot of affection for Lowell because my grandmother lived there and my mother was raised there; my dad graduated from the University of Massachusetts at Lowell at the age of 40 with a degree in computer science.

But enough about me!  Let’s get back to dataviz:

So first off, check out this link:

Lowell Foreclosure Demo

Give it awhile to load.  I also recommend loading it in either Firefox or Chrome.

Run your mouse over anything.  Anything at all.  Everything here is highly interactive; you can drill down on almost anything.  That’s pretty exciting all by itself, but now take a look at the menus in the upper right.

It’s not just interactive; it’s generative.  You can remix this, create your own visualizations, change what the dashboard looks like and what it displays, add your own data!

When I think about teaching beginners data visualization, one of the primary questions I ask myself is:  What am I teaching students that they can’t do easier and faster in Excel or Powerpoint?  What new vista am I helping them to see?

To me, one of the primary ways to depart from the “Excel box” is interactivity, but also generativity.  That’s why this is exciting.   :)

Notes from a newsroom

 

Newsroom sketch

Photo Credit: scriptingnews via Compfight cc

Last week I was lucky enough to meet with three folks who work in the newsroom of a daily newspaper.  That’s a big deal to me, because if my work isn’t useful to people who work in a newsroom, a mission-driven nonprofit, or doesn’t work for folks who want to change the world (or even just their little piece of it), I’m wasting my time.

I asked them: “What should my next step-by-step tutorial be?  What would be really useful for a beat reporter who doesn’t think of themselves as a techie to pick up?”  Remember, it wasn’t all that long ago that shooting video was considered a specialty task that print reporters didn’t do — and now everyone just points their iPhone at it and calls it a day.  (Okay, some do a great deal more than that!  But you get my point).

The folks who were kind enough to spend some time with me gave me the following hints:

  1. The Absurdly Illustrated Guide To Your First Tableau Public ChartsnGraphsTableau Public is a downloadable app that lets users transform datasets into classic data visualizations — bar charts, pie charts, scatterplot, time series, and more.  The end results are embeddable in a web page the way that a YouTube video is, and a few are interactive.   
  2. The Absurdly Illustrated Guide To Your First Survey with Crosstabs — There are lots of survey tools out there, but only a few of them do “crosstabs” — that’s the ability to compare one survey answer against another one.  For instance, a survey that asked folks what their favorite flavor ice cream was but also asked their gender and had crosstabs could tell you that 47% of women liked black raspberry ice cream, but only 12% of men.  My job would be to pick the best and most web/mobile friendly tool out there and produce a tutorial on how to use it and serve it up on the web and mobile devices.
  3. The Absurdly Illustrated Guide To Your First ArcGIS Online Map.  ArcGIS is a “geographical information system” or GIS.  GIS predates web-based mapping systems like Google Maps by a couple of decades.  They used to be very, very expensive software used by specialists, and to some extent they still are.  But to get with the times, ArcGIS now has an online service too.  I’ve never used it, but hey, before I wrote my TileMill tutorial I never used that either!  Writing tutorials is a spectacularly effective form of learning — if I understand it well enough to explain it to a total beginner, I probably understand it pretty well.

One thing I’ve been thinking about lately that really made an impression on me is how important it will be for me to focus on “zero install” tools.  Many folks have computers and servers that their corporate IT department doesn’t allow them to install anything on for security reasons.  So my TileMill tutorial, which requires you to download an app, won’t work for folks in that situation.  But a tutorial on the data visualization platform ManyEyes would — you don’t have to download or install anything, you just work with the application from your web browser.

That’s super-useful information for me as I go forward and write more data visualization tutorials.

So, two things:

  1. If you’re reading this and you guys are the folks who were kind enough to meet with me, you know who you are :)  I didn’t use your names here because I forgot to ask if I could, and not asking would be really rude! But if you’d like the credit for giving me so many smart ideas please let me know :)  
  2. What about more tutorials?  If you’re reading this, wherever and whoever you are, and you have a burning desire to learn a specific data visualization or mapping tool or technique, please let me know.  I’m a noob like you (if you’re a noob), so I can’t guarantee I’ll do every one, but I do want to know!

Carl Malamud’s Ten Rules for Radicals

Illustration showing the face of public domain advocate Carl Malamud, with the word DATA under his face. Recently I searched for the name of my current project, “Data for Radicals,” and through that magic we know as Serendipity on the Internet, up popped:

Ten Rules for Radicals

by none other than Carl Malamud.  To be honest, before I read “Ten Rules for Radicals,” all I knew about Carl was that my friends who were investigative journalists — particularly those who did the deep data and document dives through FOIA and other means — talked about him in hushed tones of awe.

Reading the title, I could not help but think that the essay, originally an address to the WWW2010 conference, represented one of those strange messages that happen between people of ideas, even if those people are separated by centuries, or thousands of miles, or other barriers, and even if they have never met.  Haven’t you ever had that feeling of picking up a book and feeling that the author is, in an uncanny and spooky way speaking to you — directly to you?  I felt that way about this essay.  You can find it in full here, but interspersed within it are his “Ten Rules for Radicals,” which in the essay he illustrates with stories from his career.  You should do yourself the favor of reading the whole thing, but for my own edification, I am reprinting the ten rules below.

Just as I do when I am learning new code, I did not copy and paste these.  As I sit here, I am typing them word by word with my own ten fingers on my Macbook Air, sitting at my dining room table at 2:21 AM on Saturday, May 18 (What can I say?  A dream woke me up and I couldn’t get back to sleep).

Rule 1:  Call everything an experiment.
Rule 2: When the starting gun goes off, run really fast.  As a small player, the elephant can step on you, but you can outrun the elephant.
Rule 3: Eyeballs rule.  If a million people use your service, and on the Internet you can do that, you’ve got a lot more credibility than if you’re just issuing position papers and flaming The Man.
Rule 4: When the time comes, be nice.
Rule 5: Keep asking until they say yes.  Gordon Bell, the inventor of the VAX, once said that you should keep your vision, but modify your plan.
Rule 6: When you get the microphone, get to the point. Be clear about what you want.
Rule 7: Get standing.  Have some skin in the game, some reason you’re at the table.
Rule 8: Get them to threaten you.
Rule 9:  Look for overreaching, things that are just blatantly, obviously wrong or silly.
Rule 10: Don’t be afraid to fail.  It took Thomas Edison 10,000 times before he got the lightbulb right, and when he was asked about those failures, he said, “I have not failed, I’ve just found 10,000 ways that won’t work.” Fail. Fail often. And don’t forget, you can question authority.

If I could put these on stone tablets, or better yet for our era, put them on plastic tablets extruded by a 3D printer, I’d do it.  They’re a little long for a tattoo, though :)

The Absurdly Illustrated Guide To Sortable, Searchable Online Data Tables

Visual approaches to data are great — they can allow us to grasp complex issues at a glance, just the way this map from Clear Health Costs shows us the dramatic differences between what different hospitals charge for the same procedure.

But sometimes a simple, sortable and searchable table of data is all that’s really needed.  Using  code written by Chris L. Keller, I was able to create this sortable, searchable table of law enforcement agencies in Middlesex County, Massachusetts, along with the populations they serve, and how many full-time officers per capita there are.

As always, click on any image in this blog to see it full size.  I leave helpful annotations in the illustrations to these tutorials — so if ya can’t see em, click ‘em!

middlesex-county-law-enforcement

 

One of the nice things about this way of presenting data is that even though it’s simple, it can be quite revealing.  Click here to go to the live data table and use the up/down arrows to sort by the center column, which displays how many full-time officers per 1,000 people served there are.

7 out of the top ten are colleges and universities.  Tiny Lasell College has 7.8 officers per 1,000 students — though with only 1,800 students, their high ratio may have more to do with a minimum number of officers needed to staff three shifts to make sure there are officers on duty 24 hours.  But scroll down to the bottom — two of the departments with the lowest ratio are also colleges.

Public colleges.  

Surprising?  Maybe not, but even so, it was a fact I did not immediately pick out when I downloaded the origincal data from  UCRStats.com.  And the differences are dramatic; state colleges have among the lowest number of full time officers per 1,000 people served of all of the departments listed.  So one of the things that higher tuition cost buys?  Bigger campus police forces.

Now I will teach you how to make your own sortable, searchable web data tables!

Continue reading

Creating Interactive, Web-based Data Tables With Tabletop.js

middlesex-county-law-enforcement

 

Sortable, searchable table of law enforcement agencies, Middlesex County, MA

Tabletop.js is a Javascript library that lets you use Google spreadsheets as the data source for web apps.  It’s pretty neat — especially since we know there are so many simple but useful web and mobile apps we can create where setting up a full-on database is overkill.  What if you want to make a sortable, searchable list of craft breweries?  Or a schedule for a music festival?  Do you really have to bust out MySQL for that?

Well, with Tabletop.js, you don’t.  The other great thing about using Tabletop.js is that a lot more people know how to use a Google spreadsheet than know how to enter records into a database.  You can share responsibility for updating your web app with anyone who knows how to use Google Docs, or use Google Forms to let members of the public add new records to a list.

Chris Keller wired up Tabletop.js to a nifty sortable, searchable, and nicely styled web-based data table, which I use here to create a sortable, searchable table of law enforcement agencies in Middlesex County, Massachusetts, where I live. I decided to do that because I live in Watertown, Massachusetts.  Until recently, very few people outside of eastern Massachusetts knew about Watertown — where it was, what it looked like, what happened there.  That was before the Tsarnaev brothers led police on a car chase through my town, culminating in a gunfight and explosions.

So I thought I’d do my latest plunge into civic data by creating a data table showing all the law enforcement agencies in my county, how many full time police officers per 1,000 residents each police department has, and the population of each city or town.  (College police departments are also represented in this table.  I thought about taking them out, until I remembered that Sean Collier, the 26 year old police officer who died, was a campus police officer at the Massachusetts Institute of Technology (MIT)).  You can see the table here: Sortable, searchable table of law enforcement agencies, Middlesex County, MA.

I’m a noob, and I had a tough time getting my mind around Tabletop.js, but it was worth it.  Also, once I finally got it to work…I happened to be wearing my “Watertown Strong” t-shirt.  Coincidence?  Nah.

Watertown Tshirt-thumb-520x345-101018

 

 

Now that I’ve got it working,  I’ll be creating another Absurdly Illustrated Tutorial for those of you who want to try out Tabletop.js.  If you’d like to see my other tutorials, see:

The Absurdly Illustrated Guide To Your First Dynamic, Data-Driven Timeline
The Absurdly Illustrated Guide To Your First Data-Driven TileMill Map

The Absurdly Illustrated Guide To Your First Dynamic, Data Driven Timeline

timeline-js

 

Gay marriage became legal today in Rhode Island, making marriage equality the law of the land in all of New England.

The Providence Journal published a detailed timeline of GLBT history in the state – but it’s text-only.  On this happy day, what could we do to spiff things up a little?  I decided to give the Projo’s timeline a little celebratory finery by creating a vertical, interactive timeline using Timeline.js.  Check it out.

Now I will show you how to do it!

Continue reading

The Pressthink of FvckTheMedia

I first heard the word “Pressthink” from NYU journalism professor Jay Rosen, who began a blog of the same name.

I’m not sure I ever heard Jay define the term “pressthink,” but I’ll try:

Pressthink:  A set of shared, embedded and often unspoken notions that guide the actions of a group of people creating news media for public consumption.

In otherwords, the “pressthink” of a publication and the people who work at it help those people decide what is good work, or bad work; what’s worth doing and not doing, whether coverage is fair or not.  (Writing that makes me think that pressthink is inherently moral; perhaps it is the moral philosophy of a news organization).

When I saw FvckTheMedia, an online publication formed in the wake of the sudden closure of Boston alt-weekly The Phoenix by former employees of the paper,  I thought: what can you say about the pressthink of this little green shoot? Continue reading

How Much Progress Have I Made?

Word count stats for my blog

 

[As always, click any image on this blog to make it bigger.]

Since I want to write a book on data visualization for beginners, I wanted to know how many words I’d already written here at #D4R.  The book and the blog will be different, of course, but things like The Insanely Illustrated Guide To Your First Data-Driven Tile Mill Map, as well as others, are very likely to appear there.  So it’s good for me to know what I’ve got in terms of source material.

These charts are created by a WordPress plugin called Word Stats.

The count is good news.  If my book is 40-60,000 words long, I probably need 80K in source material.  But I’m well on my way.

Making Friends With The Stuck

TRUCKPUSHING-640x426

Typically, a finished product — whether it’s a simple one or something as sophisticated as “Snow Fall,” the New York Times’ immersive multimedia piece on an avalanche — doesn’t give many indications about how it was made, or what challenges the developer faced in creating it.  An apparently simple site might have taken hours while you might guess that another site took weeks when it took only hours because there were many available open-source tools to start out with that did a lot of the heavy lifting.

Looking at the finished product — or a tutorial like my Insanely Illustrated Guide To Your First Data-Driven TileMill Map — doesn’t really give you insight into whether something took a little effort or a lot of effort.  Or whether it was frustrating or a breeze for the person who created it.

That’s because everybody who’s making things with code experiences The Stuck.

The Stuck happens when you read the instructions, and everything seems like it should be working right…but it’s not.

And you can’t figure out why.

It doesn’t matter if you’ve been coding for 3 months or 30 years, The Stuck, like The Force, is always with us.   How happy we are learning to code (or coding at all) has a lot to do with how we respond to The Stuck.

Continue reading