Well, if you have a somehow popular website, you are eventually going to bump into that problem. For those of you who are not familiar with Website Content Scrapers, I’ll explain that those are Web Bots just the same as the Search Engine Bots, but with the main difference that the content “stolen” from your websites is published fully,partially and even auto-modified on other websites. This type of attack is usually quite successful because the new website will start ranking and taking some of your traffic.
As far as I know, Google and Yahoo are still trying to figure out a good way to filter their results by improving their Duplicate Content Filters. You can imagine how complicated it is to separate “a authorized content copy” such as the manufacturers’ description on the products you sell. In current algorithm specifications we can say that a fair amount – nearly a full paragraph of duplicate content text – can be shown without any ranking penalty.
Now it comes to us, to defend our content and block all scrapers for once and for all!
Also I’d like to mention that this script MUST NOT be used if you’re offering any feeds. Some monitoring/statistics services may not work as well! I’m using it on almost 99% of all my sites nowadays and this solved my content scraper issue.
My personal solution to this is by filtering http user-agent header (and hope its not forged) then analyze it against known/accepted user agents or kills the connection with forbidden message:
$session["user_agent"]=" ".$_SERVER['HTTP_USER_AGENT']; $browser=strtoupper($session["user_agent"]); switch($browser){ case (strpos($browser, "MSIE")>0) : { $session["browser"]="MSIE"; break; } case (strpos($browser, "FIREFOX")>0) : { $session["browser"]="FIREFOX"; break; } case (strpos($browser, "SAFARI")>0) : { $session["browser"]="SAFARI"; break; } case (strpos($browser, "CHROME")>0) : { $session["browser"]="CHROME"; break; } case (strpos($browser, "OPERA")>0) : { $session["browser"]="OPERA"; break; } case (strpos($browser, "MOZILLA")>0) : { $session["browser"]="MOZILLA"; break; } case (strpos($browser, "GOOGLE")>0) : { $session["browser"]="ENGINE"; break; } case (strpos($browser, "YAHOO!")>0) : { $session["browser"]="ENGINE"; break; } case (strpos($browser, "MSNBOT")>0) : { $session["browser"]="ENGINE"; break; } case (strpos($browser, "SITEMAPS")>0) : { $session["browser"]="ENGINE"; break; } case (strpos($browser, "SITEMAPS")>0) : { $session["browser"]="ENGINE"; break; } default : { // THIS IS UNKNOWN BROWSER. MOST PROBABLY SPAM BOT! header('HTTP/1.1 403 Forbidden'); die("THIS BROWSER VERSION IS CURRENTLY NOT SUPPORTED! "); break; } }
Please try using Internet Explorer, FireFox or Safari
You can see that this is not the most elegant way of preventing bad bots, but its the only secure one. If I find some service or frequently updated list of Google Datacenter IPs I will let you know and modify this version to work with it, but that’s all I got now. If you want to use this code then please copy it in the header part of your php script. Note that in order to send the 403-Forbidden correctly (in case this is new browser or PDA) we show error message and support email or phone!!! If you ever get such call you should add this browser to the list as we need all the quality visitors we can get!
Well I think that’s more than I need to share with you today, so I’ll call it a day!
P.S. I forgot to mention that this type of filter also prevents 99% of all spam bots, link bots, exploit bots and so on. You get the picture.
I would love to read more about your thoughts on Google’s duplicate content filter. I don’t know much about it. Do you place this in your website header?
@okinawa
Yes, you put that on top of the header so that it reads the user agent and returns the proper http header response code. If you print something before this code it won’t work!! (should give you warning or error that header was already sent!).
Venetsian
Please note that lynx and other text-browsers will be banned
too by your php script
This is not a good solution to block bots. It is common for spam bots to use the User Agent of a real browser to get past this filter. Also, although some of the mobile phone browsers use Opera or Safari, this script would block all that don’t use a standard browser.
Yes I agree to a certain level, but it still blocks a large amount of spam bots. You can always add more user agents to the allowed list which solves your problem.
I think the best way is to have a centralized spam bot detection system but I don’t think anybody will go for it since you will have to establish trust with this type of organization in order to use such service which makes it quite complicated to arrange. If somebody does make it then it will solve the spam bot issue once and for all. If you do know which IPs are attacking you then you should edit your .HTACCESS file to deny their ip address ranges. If you don’t then .. well you are not protected.
I have found that the best defense is a combination of mod_Security on the server with BadBehavior installed on the domain account(s).. I’m currently in the process of getting all the rules that BadBehavior uses running in Mod_security to try and reduce the extra overhead – but it’s not an easy task..