July 31, 2008
On Monday a new search engine has been launched, it's named Cuil but pronounced Cool (huh?) and it got a lot of really bad reviews which made me really happy.
So why do I hate Cuil?
Well, a few months ago we started noticing more and more traffic on our company websites, caused by a crawler called Twiceler. Twiceler was run by a company called "Cuil" and claimed to be some kind of an experimental search engine robot. A few days later the same crawler also started affecting my personal websites.
The Twiceler bot is probably the most stupid crawler I've ever seen, it just downloads everything it can find and it seems that it just won't ever stop. If there's a page using dynamic input in a URL (a calendar for example) it will download the same page 100,000 and more times, simply by following all kinds of dynamic links it can find without using any kind of intelligent limitation.
By downloading thousands of pages per hour on each website it can cause an incredible traffic on a server, and dynamic scripts (written in Perl, Python or PHP for example) start causing an immense CPU load that may even take your entire server down (as reported by several webmasters). Twiceler is really harmful and can cost both money and downtime. A well written crawler such as Googlebot or Slurp (Yahoo) would never affect a website in such a malicious way.
After googling for Twiceler we found out that many webmasters experienced such problems with Cuil. Of course we thought that such a crappy crawler - which doesn't seem to care about similar content, website performance, bandwidth and traffic costs - had to be some kind of a malicious spam bot.
As the stupid Cuil/Twiceler bot just won't stop the first thing you'll do as a webmaster or system administrator is setting up a robots.txt file which tells Twiceler not to index any more pages (or at least blocks some of the directories that shall not be indexed, such as dynamic scripts for example).
Cuil claims that their Twiceler crawler respects the robots.txt file, but even days after setting it up nothing changed, the damn bot continued indexed anything it could get and completely ignored all robots.txt rules (google for Twiceler and you'll see that this is what other webmasters are experiencing too).
So finally we blocked the entire Cuil bot on our servers, just as many other people recommend in webmaster forums. On our company servers we blocked all incoming connections that could be identified as a Cuil/Twiceler bot, on my personal websites I blocked all of Cuil's IP addresses using .htaccess files.
It was a funny moment when the Cuil search engine went live on Monday and they claimed to have the world's biggest index. Of course they have! Their damn bot seems to be indexing each dynamic web page a million times, no matter if it's always the same content of if you're clearly saying that this page should not be indexed at all (via robots.txt).
Maybe this also explains the poor quality of their search results - their index may be the largest on this planet, but it's probably full of crap and duplicates.
If you're a webmaster/website owner and you're currently experiencing high bandwidth or traffic problems, then you should check your access_log because there's a good chance that your problems are caused by Cuil. If this is the case I can just recommend to block all of Cuils IP addresses on your server because that seems to be the only thing that really works.
To finish I'd like say that I think Cuil should start focusing on the quality of their algorithms and their content instead of completely relying on the marking of doubtful numbers.
Alexander Higgins said on August 10, 2008:
Malicous indeed... but it gets worse. They don't stop at dynamic links. They start poking around looking for things that do not exists. They took one of my servers down and others have reported having their hosting accounts terminated.
oscar said on August 7, 2008:
I think the stupid behaviour of this robot is fully intentional. When they started promoting the launch of the site, they needed any evidence of the number of indexed pages and achieve the gratest number possible. They want to go for the "quantity" (even if they state that their search engine is not based on quantities but on relevance and importance of the content)
JohnSmith said on July 31, 2008:
Interesting report. I am quite suprised that Google ever hired the people now at Cuil. Based on your comments, the Cuil folks do not seem to be as talented as the other software engineers at Google.
Rob said on July 31, 2009:
I've had two sites stripped of their data by this robot. Some how they are able to get into my admin scripts and follow all of the delete links on all of the records. There is no way to access these pages, as they are protected by server side code. My only choice is to add a new directory and protect it with .htaccess. I'm curious to see if anyone else has seen this behaviour and if they have been successful in any measures to protect themselves from it.
Duncan said on May 19, 2009:
Just seen this stupid robot trying to index hundreds of non existant pages, which have never existed and never will.
I've sent them some feedback and am actively blocking them on the firewall now.
Daniel said on November 7, 2009:
This Cuil guys managed somehow to own one of my users accounts. Then their crawler started sending GET request every 2/3 minutes to urls that deleted my vulnerable user data (and should only be accesible for the logged in user).
I block them all by using this hard but effective htaccess
allow from all
deny from 216.129.119.*
deny from 38.99.13.*
deny from 67.218.116.*
deny from crawl-3c.cuil.com
And obviosly contact a lawyer since I have all the proves I need to establish that their crappy crawlers violated my security layer and started deleting my user info.
What a patetic search engine ...
jimmy said on December 14, 2009:
I agree completely with you all guys.
This is the worst crawler I have ever seen. I don't understand why this one is crawling my website 24x7...
It is really costing me a lot.
Now i have deliberately removed all IPs from Cuil..
paul said on January 26, 2010:
Just found out why my website is been down for 5 days - it appears Cuil crawlers have been responsible, according to my web designer. What a crap search engine.
brad said on January 31, 2010:
Agreed, it's either mallicious or retarded, aaah search bots these days! no respect i tells ya!
I thought Cuil was defeated and sent off into no-mans land to atone for their sins against our great Search-Father, Google. Why are they STILL stealing my cpu cycles!
Thanks for the IP advice Daniel, much appreciated, and into the firewall they go!
Koos said on March 15, 2010:
Their bot is in fact one of the worst things on the net. I've noticed they fetch my robots.txt up to 3 (three!) times a day, but completely ignoring its content. Yesterday, I've seen three requests for robots.txt, at 15:38:05, 15:46:21 and 15:58:05. Three times within 20 minutes. WTF? There were three chances to notice "Disallow: /admin/", but it didn't. Of course, this directory is password protected and they do get a 401 error. However, that's not stopping them. I assume their bot is a 2 line shell script. Oh dear...
It does remind me of MSN, their bot obviously ignores robots.txt, too. Both services are blocked now.
netorignator said on April 25, 2010:
Asking them to go away via robot.txt will not work, that is only for reputable companies that follow the rules.. You have to close the door to them via .htaccess
They still try to get to my sites several times a day..