Weaponizing Web Crawlers

Recently, we have seen a high number of web application attacks with user agents of known web bots. This alone is not that big of a deal. Changing a user agent is not complicated and using “Yahoo slurp” as a user agent can be a way of evading detection, as many will instantly dismiss this as traffic that can be trusted. The interest factor went up substantially when we realized that these attacks were coming from real web crawlers following links that included attack code.  Additionally, since some of these were single line defacements similar to ASPROX, they run a high likelihood of actually working. We were initially unable to find the site where the attack links were posted, so we assumed that this was a new method of spreading attacks without leaving an easily traceable route back to the attacker. Over the next few days, the attacks kept pouring in, sometimes even multiple attacks within a few seconds.

At this point, I decided to do some testing by loading a site with attack links leading to another site I own with attack payloads. This was done using a combination of regular links and hidden “bot trap” links with all payloads URL encoded.  It wasn't long before the web crawlers were attacking my site with XSS and SQL injections payloads. The only two drawbacks to this type of attack appear to be that the attacker has to wait for the bot to hit the attack site, and the spiders only use GET requests, which is very reminiscent of ASPROX.    

We were eventually able to locate one of the sites causing the XSS attacks. It turned out to be a posting on a hacking forum where the members posted links to sites they were able to inject that used the Scan Alert “hacker safe” logo. This type of attack has a few implications that are fairly interesting.

Blocking Yahoo might be a problem

By hiding behind the web crawlers, the attacker can create pages with the payloads, or, even easier and possibly more effective, post to established forums to spread the payloads using essentially unblock-able IP addresses.  This is because sites rely on these bots to list them in search results. Such a change could cause a site’s page rank to drop.  This also impacts advertising revenue.  Some crawlers are known for ignoring robots.txt, which can make blocking a complicated task.

Got IPS?

Automated blocking can be even more problematic. Users with IPS systems that use simplistic rule sets are also at risk for blocking the bots, and possibly the search engines themselves.  This is often the problem we find with IPS systems. For instance, these systems sacrifice accuracy for speed and are forced to either greatly limit their rule sets or frequently result in false positives. This is in part why we encourage our users to only block activity backed by correlation logic and not just a particular rule. Alert Logic Threat Manager customers have the option of using both.

So who are the offenders?

We were able to determine that MSNbot and Yahoo! Slurp along with several lesser known bots like Cuil were the major offenders.  In addition, Webtrends will happily follow malicious links. We have not seen Googlebot follow any of these links, so it is likely that they have some sort of detection to prevent this activity. Good job Google! Until all the bots start following the rules, this will continue to be a possible risk.

So what is the moral of the story?

Input validation matters. This is something that is heavily scrutinized when developing applications that users interact with directly, but is often ignored when developing applications such as crawlers. User defined input should always be validated before passing it along, regardless of use.

Johnathan Norman
Monday 14, Jun 2010
Posted by Johnathan Norman


Write a comment

  • Required fields are marked with *.

If you have trouble reading the code, click on the code itself to generate a new random code.