As you know if you read my little corner of the Web, we recently switched Gizmos for Geeks over to a new platform. That was quite a task, and of course, there were errors that cropped up. For example, we had to reroute a bunch of old URLs from the old platform to conform to the new ones, and we got some of them wrong. The problem is that when visitors or search engine crawlers try to access those ‘new’ URLs, they get 404 or Not Found errors. That’s bad.
Luckily, we’re using Google’s Webmaster Tools and it generates a report that gives us all of the URLs that it can’t find. Since Google first started crawling the new site, we’ve fixed a bunch of those links, but not all, and Google continues to crawl the site as there are thousands of URLs and it’s not done yet. So I don’t really want to go through the entire list by hand to figure out which ones end up in 404s.
Time to write a script!
GWT provides a link to download the entire list of broken URLs in a CSV format, so I pulled that down and stripped all of the columns except for the URLs and saved it as a plain text file, 1 URL per line. Next up is to write a simple script in bash that would loop through the file and try to access the URL. If it generates a 404, stick that URL into another file. Here’s the script; this is pretty simple and should run under the Bourne shell too (sh). It uses the very useful curl program which can be used to just grab the header from a web server (-I flag) for a particular URL without having to pull down the whole page.
#!/bin/bash
cat /dev/null > Real404s.txt
while read LINE
do
echo $LINE
curl -I $LINE > t0
T1=`grep 404 t0`
if [ -n “$T1″ ] ; then
echo $LINE >> Real404s.txt
fi
done < 404sfromGWT.txt
I ran this successfully on a WinXP box under the Cygwin environment. Yes, this is a quick and dirty script. Not elegant at all, but gets the job done. I can continue reusing this to see if Google has found any more broken links on my site.
Now, to go fix all of those URLs. *sigh*. Where’s my robot when I need him?
