bit.ly link for this page: http://bit.ly/catscrape

A NICAR 2012 Hands-On Session

How your browser’s built-in tool can help you get data faster, locate hidden files, download cat videos, and help you understand HTML and the web faster than any HTML/CSS 101 class.

Time: Saturday, 3PM, Jeffersonian/Knickerbocker room

Taught By: Dan Nguyen of ProPublica (Personal blog: danwin.com /Twitter: @dancow)

Intended Audience: Anyone who knows how to use a web browser. No programming required, though it may give you some ideas why you might want to learn some programming.

Cat hair
Finding a cat photo with the web inspector's panel in Chrome

Content of the class

We’ll first learn about the web inspector, a program (operated completely with the mouse) that’s already part of every major modern browser. Then we’ll walk through a variety of websites to show how the web inspector can be used to find interesting files and information about a website.

It’s similar to doing View Source, just more interactive and intuitive.

How it’s of practical use

The Web Inspector is an all-purpose tool, used mostly by web developers to figure out what’s going on with their own or others’ websites.

But non-developers can use it too.

And journalists may find opportunities and new leads with it, even with knowing very little about HTML. I’ll demonstrate a couple examples of how it helped us at ProPublica collect data for our award-winning Dollars for Docs project.

How this is possible: An organization’s website often involves two parties:

  1. The people who want to put information on the web
  2. The web developers they hired to do this

Sometimes (i.e. in many, many government websites), Party #1 has little understanding of how the web works and Party #2 just doesn’t give a damn as long as Party #1 is OK with the way it looks.

The web inspector is a handy way to find anything that unintentionally slipped through that chasm. No “hacking” involved, just using good observation skills and the handy web inspector.

Some sample sites we’ll examine

Other useful tools

  • Firefox’s Tamper Data plugin - For sites that use POST requests (i.e. you can’t just futz around with the URL to change the options and variables), this plugin allows you to alter the request before you send it to the server

  • Google Chrome Scraper plugin - Once you get the hang of “Inspect Element”, you can use this tool to “Scrape similar” elements into a spreadsheet

 

Supplementary Reading

Getting Started

All the modern browsers have the web inspector and the steps to use it are largely the same.

Here’s all you need to know to start exploring webpages as you browse.

1. Opening the Web Inspector

The easiest way is to right-click on something on the webpage and select Inspect Element.

Right click to open the browser

Shortcut: You can also either Ctrl-Alt-J (in Windows) or Cmd-Opt-J (in Macs) to bring up the inspector.

This will pop open a new panel in the bottom-half of your browser:

The inspector is in the bottom half.

2. Explore the source code interactively

After performing the Inspect Element command, the Inspector will show you the webpage’s source code. This is just the text that makes up the webpage.

Whatever you right-clicked on to Inspect Element will be highlighted in the source code.

Now move your mouse down to the inspector panel. By hovering over the source code with your mouse, the corresponding element on the webpage will light up.

This is useful for discovering the actual addresses of the files (such as images) that you see on a page. Sometimes it reveals interesting metadata.

3. Open the Web Inspector’s Network Panel

The inspector has several panels and functions. We were just on the Elements panel.

The next most useful panel is the Network panel: This shows the list of files that have been loaded into the webpage.

The network panel tab.

If you’re unfamiliar with HTML…HTML is just text. Whenever you see images, icons, videos, and other non-text elements – these are usually brought in as external files.

The HTML contains the code that describes where these elements come from. But source code can be hard to read.

So instead, we use the Network Panel to see a straightforward list of what’s brought in. Since you can sort by file type and size, this is sometimes the fastest way to locate multimedia and large data files.

Tip: If you open the Network Panel after the page has loaded, it may not show all of the files that were loaded. So Reload the page with the Network Panel open to see the full list. Sometimes you might have to shift(key)-Reload.

The network panel.

It’s that simple. But if you want ideas on what to look for, check out the guides in the Supplementary Reading section

The Examples

(This section not finished)

This cat photo (500px)

http://500px.com/photo/5002471

Goal: Use the inspector to get the address of an image (to download) even when the website tries to disable right-click-copying.

This cat video

http://www.youtube.com/watch?v=7M-jsjLB20Y&feature=relmfu

Goal: Get the address of a YouTube video file to download.

YourOpenBook

http://youropenbook.org/

Goal: Find how YourOpenBook is getting data from Facebook.

Missouri’s federal stimulus map

http://transform.mo.gov/map/

Goal: Get the raw data behind an annoying Flash map.

Cephalon’s payments to health-care providers

http://www.cephalon.com/prebuilt/aspx/2011-Expenditures-to-Healthcare-Professionals.aspx?

Goal: Get the raw data behind an annoying Flash table.

Allergan payments

http://www.allergan.com/responsibility/hcp_partnership_payments/physician_payments.htm

Florida Department of Corrections

http://www.dc.state.fl.us/ActiveInmates/search.asp

Rhode Island’s restaurant inspections

http://food.ri.digitalhealthdepartment.com/search.cfm?advanced=advanced



blog comments powered by Disqus