Archive for January, 2008

Had the credit card in hand..and then..

Apple released some pretty awesome stuff at MacWorld today. The MacBook Air is wicked. I want(ed) one. I had the credit card in hand. Thing is, not having an ExpressCard slot for an EVDO card is a total dealbreaker. WiFi just isn’t everywhere you want it to be, and even when it is often times its not free. I wish Apple would just put a Verizon or Sprint EV/DO chipset in the thing and say activate it if you want to. I’d be fine with that.

Too bad :( I really wanted one.

Tips on Writing a Scraper

A couple weeks ago I got an email from a friend asking for some tips on writing a scraper. As I was responding, I realized it would make a good post. I think writing scrapers is one of my specialties, and I’ve yet to come across a site I haven’t been able to pillage.

Here’s a few thoughts:

I like Perl a lot for scraping, but PHP works too. Since scraping is kind of a rapid-prototype situation you should just use the tool you’re most comfortable with. The nice part about Perl for scraping is all the available modules to do various data mangling, and the ability to multi-thread it. I rarely write a scraper in PHP, but if you’re more comfortable with PHP by all means USE IT.

Scrape ALL THE DATA YOU CAN. You might think you only want one little bit of data from a site. Take everything you can and dump it into a database. Look, you’re pulling the page anyway, you might as well save everything in case you need it later. Dump everything into a database, and then pull out what you want from there into whatever you’re using it for. You’ll never regret having the full dump, but you will regret not grabbing everything when you need another bit of data and you didn’t scrape it.

Be smart about re-indexing it. Say you scrape somesite.com/id=123. Don’t put it into your database with id=123. Some day the site owner you ripped is going to notice all the IDs are the same and could use that against you. Re-index it. Going back to my previous point, keep THEIR ids in a database in case you need to rip some more data, you have their keys intact.

A lot of people worry about rotating or proxying IPs when they’re scraping. Don’t waste your time. I’ve yet to find a site (other than Google) that will actually be aware enough to block you for scraping. In the same vein, don’t piss-pound someone’s server when you’re scraping them. Put a sleep(1) in for christs sake.

Distributing your scraper? That might make sense if you’re planning on spidering or scraping a lot of sites. Use a common database and make sure your scrapers never cross paths. Its a waste of your resources as well as the site you’re pillaging.

Don’t overthink or overcode it. This is like anything else. I’ve literally seen people spend a week writing a scraper with a million bells and whistles. Just make it WORK and run it.

Be aware of cookies. Sometimes you’ll want your scraper to happily accept cookies, other times you want to make sure it ignores them. Depends on the site you’re scraping. Be aware that some sites will have interstitials or other things that will mess with your scrape. Cookies may or may not be away around those ads. Regardless, be aware of them and code accordingly.

Use Live HTTP Headers in Firefox if you’re struggling with understanding why you can access a site in a browser but your Perl script doesn’t get the same responses.

Don’t forget APIs. Sometimes you don’t even need to scrape, you can pull the data with an API. One time I wanted a database of like every CD ever made. Amazon API for the win.

Finally, remember, if a browser can access it, you can scrape it. If you can’t figure it out, you’re not trying hard enough.

Props Week: Friday - Chris Winfield/10e20

To be honest I wasn’t sure who I was going to give props to today. As I was firing off some emails and IMs it occurred to me. Chris Winfield has been one of the most gracious, coolest people I’ve met in the industry since the first time I met him. I think we first met at SES San Jose this summer, and we’ve always grabbed a drink at every conference. Chris (and his lady) are always super nice. Maybe more importantly though, I can always fire off an email to Chris if I need a quick favor and he ALWAYS helps out. His company is without question where I would turn for social media marketing, in other words, getting on digg. I think Chris and 10e20 (coolest freaking company name ever too) have been making waves, and I expect they’ll continue. Good people.

ADD sidenote: It never ceases to amaze me how connected we are. Within a few seconds of receiving a message from one person, I can reach out and touch 10 other people in all different parts of the US/world with different mediums (IM, email, phone). It’s pretty badass when you think back to even 10 years ago when we weren’t all connected this seamlessly.

Props Week: Thursday - SEOmoz

And then this came out of left field. I don’t always agree with everything at SEOmoz, and sometimes they make me want to bash my head through my overpriced Apple display. But often times the content they produce is ridiculously complete and fantastic. This post on internal linking from a few days back is a great example. I don’t know what any of it means, but I’m pretty sure its good. Or take Jane’s treatise on social media. Yes its premium content, but I’ve seen it, and its obscene. And the sweet, sweet hate from Rebecca is good shit. If I knew someone just dipping their toes in SEO and I could only choose one site to send them to, it would have to be SEOmoz shoemoney.com, ok, SEOmoz.

Props Week: Wednesday - SEOBlackHat

I’ve always liked Quadszilla over at SEO Black Hat, and some of his most recent posts are really good. This particular post is a great method for building a new habit. I use the exact same technique for keeping track of my cardio workouts. I’m really consistent with lifting, but cardio is harder to stick with. Having a calendar in your face with big red X’s is hard to argue with. I’m not a member of the private SEO forum but I have heard really good things if you’re into the forum thing.