« Happy July 4th 2008IBM Roadrunner Takes Number One Slot »

Twiceler Is A Rude, Misbehaving, And Rogue Robot

06/27/08

Permalink 06:24:00 pm, Categories: SpamScam, Indexers , Tags: cuill, robots, twiceler

Originally published April 23, 2008: Updated and Republished May 18, 2008; Updated and Republished June 27, 2008:

UPDATE 06/27/2008:Twiceler is still behaving, entering the site at reasonable intervals by reading robot.txt; crawling like a spider—not an elephant; and has begun leaving helpful notes explaining its crawlers' intention and duration:

UPDATE 05/18/2008:Twiceler is better behaving, entering the site at reasonable intervals by reading robot.txt and exiting for an extended period on encountering the first 403 header return:

Cuill, a new Silicon Valley search engine start up, is running the rude, misbehaving, and rogue robot, Twiceler.

Twiceler is unregistered, undocumented, ignores robots.txt, and modifies its name variable {HTTP_USER_AGENT} in response to a regular expression blocking.

Cuill asserts Twiceler runs from IP address ranges:

  • 38.99.13.121-38.99.13.126
  • 38.99.44.101-38.99.44.106
  • 64.1.215.162-64.1.215.166
  • 208.36.144.6-208.36.144.10

It does not seem like a wise strategy for a start up search engine company (or anyone for that matter) to aggressively flaunt the directives of website administrators—particularly when your running an unregistered and undocumented (rogue) robot.

Some important factors in judging if a bot is SPAM:

  • Is bot registered with robottxt.org?
  • Is bot well documented and contact information provided?
  • Does bot read robot.txt file?
  • Does bot adhere to robot.txt file directives?
  • Is bot well behaved on site
  • Does bot use a reasonable crawl/index rate and times?
  • Is bot part of a university or student research project?
  • Does bot add or subtract transparency:opaqueness?
  • Does bot enable or disable, directly or indirectly, censorship:surveillance?
  • Is bot, nation-state; third party nation-state; private; corporate; ngo; personal?
  • Is bot for commercial gain?

Res:

6 comments

Comment from: Mo [Visitor] · http://bookams.com
This bot has been hammering my websites, I have denied access in cpanel. GO AWAY CUILL
05/19/08 @ 07:58
Comment from: PeterG22 [Visitor] · http://www.mildewhall.com
This accursed software mess has hit the same page on my web site over 9,000 times in 5 days. I've emailed CUILL several times with no response and have today faxed a strong letter of complaint to their office. I've also sent a copy to the registrars that manage their IP addresses in the vain hope that I can get this feral thing stopped. If I get no satisfaction I intend to invoice them $1 per hit..
05/27/08 @ 06:04
Comment from: admin [Member] Email
Peter,

I like your idea for invoicing (or just suing) them for any hit after their robot/spider/crawler or other's automated indexer encounters its first 403, preferably with a redirect to a file containing a message confirming to the offending robot/spider/crawler it's trespass1 on your computer.

If your ISP provides access to mod_rewrite (assuming you do not have direct server access) you might try something like this in .htaccess:
RewriteRule ^YourRudeRobotIsTresspassingMyCompter\.txt - [L] #Don't want to loop
RewriteCond %{HTTP_USER_AGENT} ^[\(\).]+[Tt][Ww][Ii][Cc][Ee][Ll][Ee][Rr][\-\(\).]+ [OR]2
RewriteCond %{HTTP_USER_AGENT} ^[Tt]wiceler
RewriteRule ^.* YourRudeRobotIsTresspassingMyCompter.txt [R=301,L]

Eventually we should automate this along the lines of the up and coming machine-readable privacy. Then browsers and bots can just access the Webmaster's policy and unlike robot.txt, automatically enforce it.

Of course this needs to be thoroughly discussed so the law of unintended consequences does not inadvertently add unwanted or unnecessary friction to Internet information flows!

We also want to balance stopping the rudeness with encouraging the incredible innovative minds, such as those behind Twiceler. Who knows maybe Twiceler can ensure its nextgen indexer incorporates such a machine readable and enforceable access policy? (Oppressive, repressive, and thug governments and some "mature" democracies everywhere are lining up for a centralized machine readable and enforceable access policy)

Hope it helps, stay cool, and think technically, I have reason to believe the folks at Cuill are well, cool people3. Some of Toocan’s postings, particularly in controversial areas have caused horrendous and persistent attacks, ostensibly to change the post (the posts never change as a result of such abuses). After you get the hang of it, it’s actually fun analyzing the attacks, if sometime irritating and always a waste of time and bandwidth.

-----notes-----

1. If you sue it's always good to take notice, particularly when using a novel approach—you could even include a counter++ so the message read "Twiceler has trespassed my computer 9,321 times in 17 seconds".

You could also just deny from [All Twiceler IPs]; the above rewrite is less rude since it just silently and politely returns a text message which I like to be a big white page of nothing. Also if browsers happen upon the text page, the modern ones will eventually just return a 304 and get the white page from their cache, saving bandwidth.

2. There are more elegant regexs but I always use one that helps me remember what I blocked, believe it or not you can forget even a Twiceler. Most virtual irritants, like their physical counterparts quickly go away when ignored.

3. It also helps to remember that some of these people are the kind of people that think society is somehow bettered because they move from California to New York to shave 30 nano-seconds off the time it takes the computer to execute a buy or sell order!
05/27/08 @ 10:55
Comment from: Samuele [Visitor] Email · http://www.ktcn.it
Hi!
It's been monthes that Twiceler breaks down our servers scannig our sites, and we have spent days trying to understand why we passed from 256kb/s to 72Mb/s of bandwitdh usage!
Our hoster invoices us 6,000 EUROS for the bandwidth usage, our clients says "No more hosting with you!", our business is compromised. And we are a new small company trying to start with lots of difficults.
Ok, new companies that wants to research on a new search engines are welcome, but why such a abnormal attack? Who is going to pay for the bandwidth usage?
We had to ban their IP adresses last night, our major client dosen't have anymore faith in our work.
06/11/08 @ 05:17
Comment from: webman1000 [Visitor] Email
Reading between the lines its quite obvious twicelers screwy design wasnt by accident. You have an Irish Techie who convinced VC's to dump 33 million in small change into his concept, which was based mainly on their massive index size. But guess what, when the VC auditors came to see the index size for real Mr. Tom had to fill it with something. So he came up with a spur of the moment idea to run the system dictionary agaisnt every sites web directory, and presto, Cuil now has generated millions if not billions of indexes to unique ip addresses, ofcourse they are all 40(1,2,3) errors, but who cares, since he was selling index size to his VC's not content. He passes the auditors test, they open up the bank account, and its muffins and chocolates forever (or at least until the money runs out and the VC's try to sell a goldmine full of fools gold.
09/24/08 @ 20:08
Comment from: Tech Magazine [Visitor] Email · http://techmagazine.eu
This bot has been hitting my server like crazy.
12/25/08 @ 07:45

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
July 2009
Sun Mon Tue Wed Thu Fri Sat
 << <   > >>
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Search

XML Feeds

blog software