Inspect the Web with Your Browser's Web Inspector
bit.ly link for this page: http://bit.ly/catscrape
A NICAR 2012 Hands-On Session
How your browser’s built-in tool can help you get data faster, locate hidden files, download cat videos, and help you understand HTML and the web faster than any HTML/CSS 101 class.
Time: Saturday, 3PM, Jeffersonian/Knickerbocker room
Taught By: Dan Nguyen of ProPublica (Personal blog: danwin.com /Twitter: @dancow)
Intended Audience: Anyone who knows how to use a web browser. No programming required, though it may give you some ideas why you might want to learn some programming.
Content of the class
We’ll first learn about the web inspector, a program (operated completely with the mouse) that’s already part of every major modern browser. Then we’ll walk through a variety of websites to show how the web inspector can be used to find interesting files and information about a website.
It’s similar to doing View Source, just more interactive and intuitive.
How it’s of practical use
The Web Inspector is an all-purpose tool, used mostly by web developers to figure out what’s going on with their own or others’ websites.
But non-developers can use it too.
And journalists may find opportunities and new leads with it, even with knowing very little about HTML. I’ll demonstrate a couple examples of how it helped us at ProPublica collect data for our award-winning Dollars for Docs project.
How this is possible: An organization’s website often involves two parties:
- The people who want to put information on the web
- The web developers they hired to do this
Sometimes (i.e. in many, many government websites), Party #1 has little understanding of how the web works and Party #2 just doesn’t give a damn as long as Party #1 is OK with the way it looks.
The web inspector is a handy way to find anything that unintentionally slipped through that chasm. No “hacking” involved, just using good observation skills and the handy web inspector.
Some sample sites we’ll examine
- This cat photo (500px)
- This cat video
- YourOpenBook
- Missouri’s federal stimulus map
- Cephalon’s payments to health-care providers
- Allergan payments
- Florida Department of Corrections
- Rhode Island’s restaurant inspections
Other useful tools
-
Firefox’s Tamper Data plugin - For sites that use POST requests (i.e. you can’t just futz around with the URL to change the options and variables), this plugin allows you to alter the request before you send it to the server
-
Google Chrome Scraper plugin - Once you get the hang of “Inspect Element”, you can use this tool to “Scrape similar” elements into a spreadsheet
Supplementary Reading
-
I’ve written a multi-part guide here: Meet Your Web Inspector
-
If you’re wondering, ”what relevance does this web inspector tool have to do with journalism?”, read this guide I wrote for ProPublica and the Dollars for Docs project: Reading Data from Flash Sites
-
From HTML5 Rocks tutorials Intro to Chrome Developer Tools
Getting Started
All the modern browsers have the web inspector and the steps to use it are largely the same.
Here’s all you need to know to start exploring webpages as you browse.
1. Opening the Web Inspector
The easiest way is to right-click on something on the webpage and select Inspect Element.
Shortcut: You can also either Ctrl-Alt-J (in Windows) or Cmd-Opt-J (in Macs) to bring up the inspector.
This will pop open a new panel in the bottom-half of your browser:
2. Explore the source code interactively
After performing the Inspect Element command, the Inspector will show you the webpage’s source code. This is just the text that makes up the webpage.
Whatever you right-clicked on to Inspect Element will be highlighted in the source code.
Now move your mouse down to the inspector panel. By hovering over the source code with your mouse, the corresponding element on the webpage will light up.
This is useful for discovering the actual addresses of the files (such as images) that you see on a page. Sometimes it reveals interesting metadata.
3. Open the Web Inspector’s Network Panel
The inspector has several panels and functions. We were just on the Elements panel.
The next most useful panel is the Network panel: This shows the list of files that have been loaded into the webpage.
If you’re unfamiliar with HTML…HTML is just text. Whenever you see images, icons, videos, and other non-text elements – these are usually brought in as external files.
The HTML contains the code that describes where these elements come from. But source code can be hard to read.
So instead, we use the Network Panel to see a straightforward list of what’s brought in. Since you can sort by file type and size, this is sometimes the fastest way to locate multimedia and large data files.
Tip: If you open the Network Panel after the page has loaded, it may not show all of the files that were loaded. So Reload the page with the Network Panel open to see the full list. Sometimes you might have to shift(key)-Reload.
It’s that simple. But if you want ideas on what to look for, check out the guides in the Supplementary Reading section
The Examples
(This section not finished)
This cat photo (500px)
http://500px.com/photo/5002471
Goal: Use the inspector to get the address of an image (to download) even when the website tries to disable right-click-copying.
This cat video
http://www.youtube.com/watch?v=7M-jsjLB20Y&feature=relmfu
Goal: Get the address of a YouTube video file to download.
YourOpenBook
http://youropenbook.org/
Goal: Find how YourOpenBook is getting data from Facebook.
Missouri’s federal stimulus map
http://transform.mo.gov/map/
Goal: Get the raw data behind an annoying Flash map.
Cephalon’s payments to health-care providers
http://www.cephalon.com/prebuilt/aspx/2011-Expenditures-to-Healthcare-Professionals.aspx?
Goal: Get the raw data behind an annoying Flash table.
Allergan payments
http://www.allergan.com/responsibility/hcp_partnership_payments/physician_payments.htm
Florida Department of Corrections
http://www.dc.state.fl.us/ActiveInmates/search.asp
Rhode Island’s restaurant inspections
http://food.ri.digitalhealthdepartment.com/search.cfm?advanced=advanced
blog comments powered by Disqus