A couple weeks ago I got an email from a friend asking for some tips on writing a scraper. As I was responding, I realized it would make a good post. I think writing scrapers is one of my specialties, and I’ve yet to come across a site I haven’t been able to pillage.
Here’s a few thoughts:
I like Perl a lot for scraping, but PHP works too. Since scraping is kind of a rapid-prototype situation you should just use the tool you’re most comfortable with. The nice part about Perl for scraping is all the available modules to do various data mangling, and the ability to multi-thread it. I rarely write a scraper in PHP, but if you’re more comfortable with PHP by all means USE IT.
Scrape ALL THE DATA YOU CAN. You might think you only want one little bit of data from a site. Take everything you can and dump it into a database. Look, you’re pulling the page anyway, you might as well save everything in case you need it later. Dump everything into a database, and then pull out what you want from there into whatever you’re using it for. You’ll never regret having the full dump, but you will regret not grabbing everything when you need another bit of data and you didn’t scrape it.
Be smart about re-indexing it. Say you scrape somesite.com/id=123. Don’t put it into your database with id=123. Some day the site owner you ripped is going to notice all the IDs are the same and could use that against you. Re-index it. Going back to my previous point, keep THEIR ids in a database in case you need to rip some more data, you have their keys intact.
A lot of people worry about rotating or proxying IPs when they’re scraping. Don’t waste your time. I’ve yet to find a site (other than Google) that will actually be aware enough to block you for scraping. In the same vein, don’t piss-pound someone’s server when you’re scraping them. Put a sleep(1) in for christs sake.
Distributing your scraper? That might make sense if you’re planning on spidering or scraping a lot of sites. Use a common database and make sure your scrapers never cross paths. Its a waste of your resources as well as the site you’re pillaging.
Don’t overthink or overcode it. This is like anything else. I’ve literally seen people spend a week writing a scraper with a million bells and whistles. Just make it WORK and run it.
Be aware of cookies. Sometimes you’ll want your scraper to happily accept cookies, other times you want to make sure it ignores them. Depends on the site you’re scraping. Be aware that some sites will have interstitials or other things that will mess with your scrape. Cookies may or may not be away around those ads. Regardless, be aware of them and code accordingly.
Use Live HTTP Headers in Firefox if you’re struggling with understanding why you can access a site in a browser but your Perl script doesn’t get the same responses.
Don’t forget APIs. Sometimes you don’t even need to scrape, you can pull the data with an API. One time I wanted a database of like every CD ever made. Amazon API for the win.
Finally, remember, if a browser can access it, you can scrape it. If you can’t figure it out, you’re not trying hard enough.

Good info, I’ve recently wrote several in PHP. Good point about grabbing all of the data you can, why waste time going back.
Regarding the last line, what about flash files? Or, god forbid, java applets?
How would you go about scraping those?
1) Extremely rare.
2) Most of the time its easier. The java applet or Flash app is running on YOUR computer. That means its talking to a server and exchanging data. Sniff that traffic, figure out how to mimic it, and then connect to the server with your application that harvests data.
[...] Tipps fürs Scraper bauen - Gute Tipps von Dave! [...]
Good stuff. I’d say though that Google isn’t even that good at rooting out scraper sites. Especially programming related sites. There’s a ton of forum type programming sites in their index. Maybe the code syntax throws off their ability to detect scraped content.
Nice post
Would you keep [a] tags and style tags etc in place in the content or would you strip them out - or by taking ‘everything you can’ do you mean scraping everything in the [body] tags?
I’ve tried using perl’s HTML::TreeBuilder module to strip out content from [p] tags, but that massacres any line breaks etc a bit too much.
DO you know of any example php scripts that could help me learn. I want to learn how to grab links, images, and content. Thanks