Wget
From CleanPosts
wget
My blog is just a diary, with very few comments from the outside, so to make a copy of my blog I can just load the posts into my browser at once and save the whole thing to disk. But the Elephant Bar is a blog with about 70 comments per day. I've been posting on the Elephant Bar for about three years, and I wanted to retrieve all my comments. But there's more than 2000 articles, and to surf to the comments you have to follow links and I didn't want to get Mouse Finger. So I set Linux to do it for me with the following command in a terminal session:
wget -nc -w 3 -r --random-wait -l 2 -np -E --domains=2164th.blogspot.com http://2164th.blogspot.com
-nc means no clobber. It means if wget has already downloaded a page, it won't download it again.
-w 3 means wait three seconds between downloads, so I don't hammer the target website. That allows the boys at the Elephant Bar to keep posting.
-r means "recursive"...that means I want wget to surf to any links it finds and download those too.
--random wait varies the three second delay randomly so the server on the other end doesn't think I'm a robot doing this, which I am.
-l 2 means I just want to surf to a depth of 2. That way I don't probe too deeply which would fill up my hard drive real quick. I wanted the comments, but I don't want to follow any links made in those comments. Otherwise I'd be downloading the whole internet.
-np means "no parent". I just want to stay in the Elephant Bar, not go up to Blogger itself and start downloading everyone else's blog.
-E means convert funky pages ending in ".asp" to html.
--domains=2164th.blogspot.com limits my search to the Elephant Bar, so I don't follow anything on listed in the blogroll. And the last bit is the URL to the Elephant Bar itself.
Wget ran for about 24 hours and finally finished up. I now have a mirror of the EB blog on my hard drive, complete with all the comments, which I want to filter to get just my comments and repost them here on my blog. Fresh off that success, I'm stealing the entire website at www.textfiles.com. Linux is wunnerful.

