| « Happy July 4th 2008 | Firefox Three Is Not Two » |
Originally published April 23, 2008: Updated and Republished May 18, 2008; Updated and Republished June 27, 2008:
UPDATE 06/27/2008:Twiceler is still behaving, entering the site at reasonable intervals by reading robot.txt; crawling like a spider—not an elephant; and has begun leaving helpful notes explaining its crawlers' intention and duration:
- crawl-8.cuill.com 64.1.215.164 27/Jun/2008 00:24:37 200 GET / HTTP/1.0 - Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html) One-time, weeklong image crawl
- crawl-9.cuill.com 64.1.215.165 27/Jun/2008 00:05:44 200 GET / HTTP/1.0 - Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html) One-time, weeklong image crawl
UPDATE 05/18/2008:Twiceler is better behaving, entering the site at reasonable intervals by reading robot.txt and exiting for an extended period on encountering the first 403 header return:
- 64.1.215.165 - - [18/May/2008:13:21:47 -0700] "GET /robots.txt HTTP/1.0" 403 1041 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
- 64.1.215.164 - - [18/May/2008:15:14:22 -0700] "GET /robots.txt HTTP/1.0" 403 1041 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
Cuill, a new Silicon Valley search engine start up, is running the rude, misbehaving, and rogue robot, Twiceler.
Twiceler is unregistered, undocumented, ignores robots.txt, and modifies its name variable {HTTP_USER_AGENT} in response to a regular expression blocking.
Cuill asserts Twiceler runs from IP address ranges:
It does not seem like a wise strategy for a start up search engine company (or anyone for that matter) to aggressively flaunt the directives of website administrators—particularly when your running an unregistered and undocumented (rogue) robot.
Some important factors in judging if a bot is SPAM:
Res: