Technology, business and change
80legs and democratizing web crawling: It’s game-changing

I read a post about 80Legs today on Mashable and the only words that came to my mind when I read the article was “they’re changing the game”. They are. Here’s what they do:
80Legs is a service platform for web crawling and processing web content. We put over 50,000 computers to work for you to deliver exceptional crawling performance at incredibly low costs. Our service is easy to use and completely customizable, so you can crawl and process web content however you want, whenever you want.
To summarize: you get your very own web crawler at a more than affordable price: 2.00$ / million pages crawled.
Why it’s game changing
Web crawling is a very complex and expensive process: crawling a web page, extracting content and scale the process to million of pages is far from being trivial and cheap. You not only need to build your crawler or use an existing one, you must have the infrastructure in place to support it all.
Now, anybody willing to build a niche search engine, extract specific information from a series of websites or build any web app involving crawling the Web at a large scale can do it at an affordable cost, without much technical knowledge.
It’s not 80Legs that’s game-changing, it’s the whole concept of democratizing web crawling and giving a chance to anybody with a great idea but without the money or the technical skills, to just make it happen.
It’s similar to blogging: it’s not about the platform you use (WordPress, blogger, TypePad, etc), it’s the fact that anybody can now have a voice and have an impact. Platforms die, ideas don’t.
It’s dead simple
I don’t want to talk too much about 80Legs because I haven’t really tested it (except creating an account and playing around) and I don’t like talking about stuff I haven’t tested, but it really looks dead simple. The process may look simple, but yet it seems to offer some decent advanced features.

I won’t go too much into details, but it allows a lot of customization so you can extract the information you want. Of course, you will need some technical knowledge if you want to use the crawled data and build a web app around it, but the hard part involving scaling issues is covered by a tool like 80Legs.
Spam is now more affordable than ever
As much as I’m enthusiast about democratizing Web crawling, there’s a huge downside: spam and content scrapping is now more affordable than ever. It’s an easy way for amateur spammers to build email lists by crawling websites and extracting email addresses. I’m sure we can find dozen of other spamming issues, but I much prefer to focus on the positive aspects.
Overall, the concept is extremely interesting and anybody willing to build a web app involving web crawling should be looking into this.
10 important considerations when choosing the right CMS
Choosing the right CMS is a complex and long process, mainly because of the hundreds of CMS out there that promise they’ll change the way you do business online and that they’ll make your life so much easier.
Fair enough, but the vast majority of CMS don’t even deserve your attention, that’s as simple as that. I hear you: how can all these CMS survive if they don’t deserve my attention? Well, that’s quite simple: not knowing the important considerations when choosing a CMS leads to adopting a CMS that doesn’t fit your needs and then you have to stick with it because the cost of changing platform is too important.
On top of that, we have to face people obsessed by a single CMS like WordPress. This is dangerous because different CMS are required for different kind of websites. Take WordPress for example: it’s great for building blogs, but for building a multilingual corporate website, it’s definitely not the right tool. If you are obsessed by WordPress, I know what you’re thinking at this very moment: “Yes, but what about this or this plugin to give you this or this functionality”. I say native support is way better than most plugin.
So, what are the important considerations when choosing a CMS?
1. What you need and what you’ll need
The first thing you should do is write down the feature set and functionalities you need right now and what you’ll need. What you’ll need is the game, not what you need.

What you’ll need is definitely the most important aspect in choosing a CMS because a web project is always in development. It’s important to think about the next phases of your project as you don’t want to invest a lot of money in building features you knew right from the start you were going to need.
2. Multilingual support
If a you need to support multiple languages, it’s important to consider it from the start. If multilingual support is really important to your business, you want a CMS that support it natively, not by installing some third party plugin. CMS that support multiple languages natively are usually way better than any third party plugin you’ll install.
It’s important to be able to translate the content you write easily, but it should also be easy to translate the “static” part of the website like menus without having to touch the core files of the CMS.
The same goes with the backend of the CMS. Some CMS will offer a multilingual backend and it could turn out to be very important for your business if the people in charge of your website aren’t English native speakers.
3. SEO optimization

We should never have to rely on search traffic, but search engines are there and bring traffic, so it would be stupid to ignore them. Even though the vast majority of CMS are getting better and better at this, you want a CMS that will let you:
- Rewrite URLs to get search engine friendly URLs
- Full control over the titles of the pages or articles
- Full control over the keywords and description meta tags (even though it won’t affect your ranking, you want control over these tags)
- Auto generated code is XHTML compliant as much as possible
- Automatic sitemap generation
A CMS that cover these aspects will at least ensure that your website is search engine friendly and that it follows the basic SEO guidelines.
4. User friendliness

Depending who is going to use the CMS to generate the content, this might or might not be an issue. If a tech-savy user or simply yourself is going to take care of your website, then user friendliness may not be a prerequisite. On the other side, if you’re developing the website for non tech savy users, a user friendly CMS is extremely important. You want to consider:
- Wysiwyg editor to easily create content
- Easy enough to create new pages
- Clear site structure in the backend
- Clear separation between the features most users will need and advanced features
5. Users, roles and permissions
Chances are that multiple persons will have to edit the content of the website. The last thing you want is to authorize all users to modify every aspect of your websites. For this specific reason, assigning roles and permissions to the different users is important. You want control over who is going to be able to edit what.
Most CMS offer this possibility so it shouldn’t be that much of a problem, but certain CMS offer a much deeper level of control over what users are allowed to do. In the end, it all depends what you need to control.
6. Versioning

This feature is by far one of the most important feature to consider for your future CMS. Versioning will allow you to revert back to a previous version of a page or article and will avoid some huge headaches to your web team. Everybody make mistakes and you know someone if going to screw up one of your pages at one point, so why not consider this in your CMS choice?
Again, some CMS will allow you do go deeper and will allow you to restore not only pages and articles, but also other important aspects of your website like CSS.
7. Multiple website support
Nobody wants to install three CMS because three different websites live under the same root domain. Unfortunately, with certain CMS, you don’t have a choice and have to install multiple copies and maintain these copies separately. In fact, the painful process is not the installation, it’s the support afterwards and the upgrades. Multiple copies of the same CMS means you also have to replicate users, roles and permissions accross all your different installations. You don’t want that.
If you know you will be building different websites with different functionality under the same root domain (.com), then you need to consider a CMS that will allow you this with a single installation.
8. Painless upgrades
Running the latest version of a CMS is important as upgrades often include important security patches. Unfortunately, a lot of CMS make the upgrade process so painful that most people and businesses simply decide not to upgrade. I understand.

The solution really isn’t not to upgrade, the solution is to choose a CMS that will allow you painless upgrades. Certain CMS like WordPress for instance will allow you to upgrade your installation with the click of a button. It is a bit more complex for some other CMS, but really what you’re looking for is an upgrade process that won’t screw your website for days everytime.
9. Open source and community support
Going with an open source solution versus a commercial solution is important to consider. A lot of Web businesses will try yo sell their in-house commercial CMS and it’s not necessarily a bad things, but keep in mind that it will probably cost more money the more you need specific features.
Going with an open source solution might cost you just as much money to customize the solution to your needs, but it will probably be cheaper in the long run as you will be able to benefit from third party plugins and modules. Open source solutions will also allow you to get free support from the community instead of having to pay 125$ an hour or more for a consultant.
Both options are good, it simply depends what your needs and budget are. If you want more freedom over what you’ll be able to do by yourself, an open source solution might be a better choice. Keep in mind that open source solutions will also allow you to test before “buying”, which might be impossible for commercial solutions
10. Plugins and modules

There is no perfect CMS with all the features you need and will need. But that’s not a problem, what’s important is that you can easily develop or install modules for the features you need. Not only it’s important to be able to install and develop modules, it’s also important to have a look at the existing database of plugins and modules available for your future CMS.
A mature CMS will have tons of modules and plugins already tested by other users and that can be a huge advantage over a new player in the CMS industry.
So that’s one more thing to make your decision even harder: if the CMS doesn’t have all the features you need, are there some great plugins available that you could use? Then again, this is assuming you are using an open source solution as the choice of plugins will probably be smaller for commercial solutions. On top of that, an open source solution will allow you to test before “buying”.
Web 2.0 and Web 3.0 : More than just timestamping the Web
I don’t like the terms Web 2.0 and Web 3.0 for a number of different reasons. First of all, while Web 2.0 is a widely used term, you would be surprised by the number of different answers you would receive by asking 15 different persons: “What does Web 2.0 mean?” or “What is Web 2.0?”. This simply means that everybody has a different perception of what Web 2.0 represent and this is why I don’t like the term: no one really agree on a definition. Of course we can go with the Wikipedia’s definition that I personally like, but you can be sure that not everyone agree.
“Web 2.0” refers to a perceived second generation of web development and design, that facilitates communication, secure information sharing, interoperability, and collaboration on the World Wide Web.
Then, I’ve always seen the Web as a platform evolving every minute to become smarter and more useful to the users. The way we use the Web changes every day and that’s why the term Web 2.0 was only a buzz word I avoided to use until recently. I refused to timestamp the Web. (more…)
IntenseDebate Review – Let’s comment
I recently tested IntenseDebate on my other blog because I have a crazy post with over 900 comments that needed a decent comment system. I had the choice of course to go with WP 2.7 threaded comments or with a third party plugin, but honestly I didn’t want to spend a day tweaking the CSS and all that to make it work properly. Also, I wanted to introduce “reputation”, so that people could rate comments. On top of that, I had the problem that I couldn’t possibily fit 900 comments on one single page, they had to be splitted on different pages. All of that my friends, is IntenseDebate. (more…)
Google for webmasters by Google
Today, my friend Rarst who I am going to review on Monday because he won my quick contest the other day wrote an interested tweet. He basically recommended the following presentation by Google themselves:
The guy/girl doing the presentation are a little boring, but honestly it covers A LOT of things and also answers A LOT of the most common questions. It covers topics such as PageRank, duplicate content, ranking, Google Webmaster tools, etc.
Have a look, it’s worth it. The presentation is divided in topics so you can just have a look at the topics you’re interested in.
Google loves fast hosts
I talked about the Google bot recently and how the bastard killed one of my website even on a decent reseller account. Seriously, I love the Google bot, it helps me get indexed
What I want to talk about today is how a new host positively impacted my website and how it can positively impact your website.
Resources hungry
I talked about it, the reason I bought a reseller account is because one of my website is terribly resources hungry. Before I got on the new host, a page could easily take 4-5 seconds to load and I thought this was a pain. In fact, I’m pretty sure I lost some traffic because of that: people would just go away! The main bottleneck was the mysql performance which is SO much faster on the new host: it now only takes 1-2 seconds to load a page. To my great surprised, not only my visitors are happier, but the Google bot seems to like me a little more!
Google loves it
Let’s have a look at these two charts from Google Webmaster tools:
- Number of pages crawled per day
- Average time to download a page
The red arrow marks the point where I switched host. See how the average time spent to download a page significantly dropped AND the number of pages crawled per day significantly increased! I mean, Google spend less time downloading a single page, so it uses the same total time available to download more pages! Isn’t it great?
How does that helps your website?
This will help your website because even if Google spends the same total time on your website, it actually does a lot more during that time. Not only your new pages will get indexed faster, but your other pages will get updated more often.
Some stats?
The site went from 500 uniques a day to 1000 uniques a day in a single week, and it’s increasing a little everyday since them. That’s what I call a good result.
Do I have to go with a better host?
It all depends the type of website you run. If you run a blog with not much traffic and you don’t update very often, that probably won’t make a difference. On the opposite, if you feel your website is really slow to load, you are getting some decent traffic and you update quite often then I’d say go for it!
Be aware that a reseller hosting is something around 25$ a month, so if you’re website doesn’t make 25$ a month, don’t do the upgrade!
Google killed it!
What a day. I’m running a website with one of my friend and the project is going very very well. Within three months, we managed to get a steady 500+ uniques everyday and it’s going up everyday. We almost reached 1,000 uniques today and we are quite happy about the results we are getting for not too much work! We recently (2 days ago) switched to a hostgator reseller account at 25$ a month for this project because the database is so big (100,000+ entries), performance and server load is now an issue. So we switched without any major issues and the awesome tech support team at hostgator helped us resolve some minor problems, but today, it went completely crazy!
Google Crawl rate
If you run a website updated very often, you will notice a new option in Google webmaster tools: you will be offered to accelerate the Google Bot crawl speed for your site! This is the exact option:
We’ve detected that Googlebot is limiting the rate at which it crawls pages on your site to ensure it doesn’t use too much of your server’s resources. If your server can handle additional Googlebot traffic, we recommend that you choose Faster below.
A faster crawl will enable us to crawl your site quickly, but may put more load on your server.
How tempting is that? Being indexed faster, the dream of every webmaster! With that fresh new reseller account I decided to turn the faster crawl rate on.
What a bad idea
What a bad idea that was to turn the faster crawl rate on. Note that the website we run is very very heavy on resources, so what happened to my website may not happen to your website if you enable the option. It took a couple of hours before the Google Bot decided to crawl my site at top speed, but when it did, boom! No more website!
I checked my statcounter account around 5pm to notice no new visitors came to the website within the last 30 minutes. That’s really unusual when you get 500-1000 uniques a day, so I typed my domain name to see if there was anything wrong. The result:
500 – Internal server error
The evil 500 internal error! The error that tells you something bad has happened, but we don’t tell you what it is and there’s no way to find out! So I emailed HostGator and received an answer within a couple of minutes. The problem was that all 25 allowed processes were used, so no more request could come in. The rep killed the processes and guess what? 5 minutes later, same thing! Eventually, it went back up and I made sure to check the normal crawl rate.
Be careful
So be careful if you check that option. Make sure your server can handle the evil Google bot!
Impact of Google’s mistake with Chrome
You guys are all aware of the mistake Google made with their license agreement for Chrome. Everybody blogged about it and I’m no exception, I also wrote an article last week about it. So, it was a big mistake and kind of a stupid one for a big company such as Google. Some might argue it was done on purpose just for the thing to go viral, but I’m not sure Google would do such a thing. We all agree that from the legal team, just doing a plain copy paste from the traditional license agreement template was really dumb, but everybody still downloaded Chrome without asking too much questions, so I guess it didn’t turn out to be a big mistake for end users in the end. The real problem is with corporations and It’s a mistake that will take time to fix.
Google Chrome banned
I work for a quite big consulting/software company (7,000+ employees) and we are strictly forbidden to download and install Google Chrome on our computers to protect the company’s intellectual property. I know, Google isn’t claiming rights to what you do with Chrome, they fixed the EULA, so why ban the browser? Well, the day Chrome was released, you can imagine that in a software consulting company everybody went totally mad and downloaded the new browser from Google just to test it. Somebody noticed the quite disturbing EULA mentioning Google was getting the rights to almost anything done with the browser and forwarded this to the legal department. Of course, it’s a big problem for a company when you transfer confidential and copyrighted material over a browser that automatically gets the copyrights. The legal department answered within a couple of minutes and of course they advised not to install the browser to protect the company’s intellectual property. We then received a confirmation from higher management not to install Google Chrome.
Businesses are important
You see how easy companies are on the trigger. Even if Google changed the license agreement, we didn’t receive anything mentioning it was OK to install Google Chrome from now on. The company simply don’t care: the browser was a threat, that threat is eliminated, now let’s move on. It will take some time before things get fixed and we are allowed to download the browser. Now, why is it such a big mistake? Where Microsoft succeeded and where Firefox failed is in the business market. Almost every business use Internet Explorer as the standard and every intranet/company portal/web application within these businesses has to be compatible with Internet Explorer for that reason. Firefox is extremely popular with end users/computer geeks, but failed to establish itself as a business browser and this is a problem. For a browser to completely dominate the market, it has to be popular with end users and also with companies. A lot of users use Internet Explorer at home because that’s what they use at work, it’s as simple as that. This is one thing I noticed, I like to test new software and download new stuff, but for most users it’s a pain!
So that’s it, I’m pretty sure Google Chrome is forbidden in a lot of businesses because of that first day license agreement. This is kind of bad and will take some time to fix. Google Chrome had an OK start with techies, but already has a bad reputation within businesses and that might turn out to be a problem in the future.
Browsershots – Test your web design for different browsers
First of all, I know it’s a shame, I last posted 4 days ago. That being said, I want to let you know about a tool I use a lot when designing websites. That little tool is called browsershot and what it does is taking screenshots of your website in a lot of different browsers. You know how hard it can be to test your design on IE6, IE7, Firefox 2, firefox 3, opera, konqueror, etc. This will make your life so much easier and you won’t have to run 30 different browsers on 3 different machines or call your friends to ask them how your website looks on their machine!
How it works
First, head to Browsershots website obviously.
Then, it’s pretty straight forward, you enter the URL you want to test and the OS/Browsers you’d like to test with.
Then, you just have to wait for the system to take the screenshots. The time to take them can vary, but it’s usually pretty fast. I selected a dozen of browsers for seohorror.com and the estimated time was between 3 and 12 minutes.
Refresh, see the results and back to work
You should get something like this in the end:
You can click on the screenshots to see real life size images and it is a great way to identify problems with other browsers.
Perfection isn’t possible
You will usually notice some glitch with some browsers, but don’t try to be compatible with all of them. I usually concentrate on being compatible with IE 6.0 & 7.0, Firefox 2.0 & 3.0, and Safari 2.0 & 3.0. These are the main browsers and it’s important to be compatible with all of them. You could also analyse your traffic and see what browser most of your users use, this might help you identify important browsers to be compatible with.
bbPress : A forum for your wordpress blog
I recently integrated a forum into one of my WordPress blog and it really is a great thing to add if you have a “hot topic” people are talking about a lot. I will talk about a forum platform that integrates very well with wordpress for a good reason: it was developed by WordPress developers!
Tell me!
I’m actually surprised no one really talked about it because it deserve the spotlight. The web application I’m talking about is called bbPress. bbPress can be installed without a WordPress blog, but for that we already have some good forum platforms like vBulleting and phpBB. The real cool thing about bbPress is that it integrates with your WordPress blog’s database. You can share the same users and login so you don’t really have to manage two different things, it’s really an extension of your blog. It supports what made WordPress a success: Templates, easy administration interface and plugins.
Full of features?
bbPress doesn’t have the look of traditional forums running phpBB or vBulletin, it looks a little bit lighter but it doesn’t mean less features. Users can have their own avatar or Gravatar, you can have users with different levels of security, they can also add signatures, smiles and all that good stuff you would expect from a forum. If it doesn’t have the feature you want, there’s probably a plugin doing it. It’s really like wordpress. When you look at wordpress without any plugins it’s a good platform, but nothing that impressive. It’s really when you start customizing your template and adding plugins that you understand its power. If you want to have a look at what a forum might look like, you can have a look here:
http://bbpress.org/about/examples/
In the examples you will find the wordpress.org support forums and also the technorati.com support forums. These are two pretty big and important forums so even if bbPress looks light at first sight, it’s packed with features and can scale to a very large forum very easily.
Here’s the official features list from bbPress.org:
-
Fast and light
We keep our code lean so that you get the best experience possible.
-
Simple interface
One of our biggest goals is to keep things simple and make things intuitive. Our dream is that you forget you’re even using the software.
-
Customizable templates
Not everybody likes the same pair of pants, so we allow you to dress up your forums however you like.
-
Highly extensible
bbPress can’t toast your bagels, but a plugin for it sure could!
-
Spam protection
A bundled Akismet plugin offers you an amazing weapon against spam.
-
RSS Feeds
You want feeds? We get ‘em; they’re everywhere.
-
Easy integration with your blog
WordPress and bbPress are siblings, and they get along together a lot better than you and your brother did when you were kids!
So there you go, you can now integrate a nice forum into your wordpress blog without too much pain!






twitter