Hey, Would you like to work at Home ?? Just click here No need to pay, just register free and activate your account and get data Entry Work at your Home.

Wednesday, February 6, 2008

Deal with Bad Robots and Pesky Spambots

Using the Mod Rewrite URL Rewriting Engine to Deal with Bad Robots and Pesky Spambots


One of the greatest features of the APACHE server is Mod Rewrite. This optional module allows you to control URL access in an almost infinite manner of ways. Our task at hand though is to protect our server from wasteful accesses that for a variety of reasons can drag the server to its knees.


The problems with many robots and spambots can be broken down into a few areas:



  • They either ignore the robots.txt instructions file, or attempt to exploit it to find otherwise unlinked directories.



  • Due to programming errors, they can get caught in loops, attempting to access files that do not exist.



  • If they are what is called multi-threaded, they can launch an almost unlimited number of concurrent connections to your site creating a serious system load.



  • Do you really feel like paying for bandwidth when all somebody is doing is trying to get e-mail addresses out of your pages?


There was a time when I was using a browser detection in my Server Side Includes that would basically spill about 200K of garbage down the throat of any spambot that came our way. Okay, I confess that revenge felt good, but when I thought it over I realized that I was placing more strain on our server, and by providing a huge list of bogus e-mail addresses, was placing a strain on the SMTP server that the spammer would eventually hijack. It was then that I decided to start using the RewriteEngine module.

Any visiting spambot or what I feel is a problem robot is directed to:


problem.html

In this page, I explain why they ended up where they did. In the case of people attempting to capture the site for off-line viewing, I try to be of assistance. If somebody thinks enough of BNB to save it, I owe them something in return.

The elegance of this solution is that the offending 'bot never sees anything but that one small page. No matter what URL they request from our site, that is the only page they will ever see. It is handled at the server level and cannot be bypassed.









NOTE: In order to use this feature of the Apache Server, you must make sure that the server was installed with the mod_rewrite.o file. This is done by adding the line to the Configuration file before compiling the server.
AddModule modules/standard/mod_rewrite.o


THIS SOUNDS GREAT! HOW DO I DO IT?


As of this writing, here is my little rewrite instruction code:


RewriteEngine  on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport*28 [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.*$ problem.html [L]

What this code basically says, is that if the HTTP_USER_AGENT from the beginning matches any of the listed values, to redirect them to the problem.html page.

There is a performance penalty for placing RewriteEngine directives in your .htaccess file, but I recommend doing so for the following reasons.


Since you are most likely not going to be dealing with a lot of spiders at once, and since they are not going to get anyplace anyway, what is called the Chicken & the Egg Problem is not going to be much of an issue. As you identify new 'bots, you can add them to the list without having to restart your server.


Note: Do NOT place any links to your site on the page the spiders or spambots are being redirected to! You can also protect individual directories by creating an .htaccess file in the directory you would to forbid access to.

No comments:

Your Ad Here