Geek Thoughts: Search wars - which is the best?

Friday, 12 November 2004

Search wars - which is the best?

Microsoft just released the beta version of their new MSN Search engine, in an attempt to compete with the likes of Google, Yahoo or Ask Jeeves. As a result, the BBC decided to do a comparative test. The Register did a test run as well. What comes out of both is no big surprise: Microsoft has innovative features in terms of user interface to set advanced search parameters but falls short on actual execution of the query by returning results that are less relevant than its competitors. The interesting aspect of this is that the funky UI provided by MSN is just that: a fancy UI that makes it easy to add special keywords to the search box to create complex searches easily. In practice, it should be quite easy for Google to learn from this and improve their advanced search feature to make it more user friendly. But the best piece of advice is given by Tom Geoghegan in his BBC's article: "All three [search engines] could take a leaf out of the butler's book. Ask Jeeves gave a great classification of raleigh into its different definitions." At the end of the day, people now expect a search engine to return lots of hits on fairly generic subject and what most people want is to be able to refine this original search easily.

Another interesting point, made by The Register, is that most search engines, when given a search string like "John Leyden"+"blaster worm", will return a lot of information about John Leyden but very few articles written by him, which is probably what the user is interested in. This is probably due to the fact that web pages are written using HTML, which is a pure page layout language. An HTML page does not give any information as to what the content means. It represents a lot of characters and presentation elements that when read by a human being mean something (or not as the case may be) but it holds no information as to whether this collection of characters and presentation elements resolves into an article about John Leyden or an article written by John Leyden. In practice, the article about probably contains the name John Leyden several times whereas the article written by contains his name at most once or twice. As a result, the article about will be considered more relevant by a search engine but less so by a human being who understands the content of the pages he/she reads. The only way this could ever change would be if web sites start separating content from presentation explicitly, through the use of technologies like XML and XSLT, and search engine take advantage of it. If we were able to reliably perform XSLT rendering in the browser, you could build web sites that, for a given page, provide you with meaningful XML content and a link to a rendering stylesheet. A browser would apply the stylesheet to render the page on screen, while a search engine would discard the stylesheet and only take into account the XML content, thus being able to (at least partially) understand the content and present more relevant results to the user. Maybe in the future, we'll actually be able to find what we're looking for on the Internet?

1 comment:

Anonymous said...: MSN seem to be sucking up the entire internet in no time at all. I've had more hits from "msnbot" in the last 12 days than the next three User Agents my stats have listed. What's with that? Get the latest stuff at any cost?

If it's a money bucket for MS, I'm all for it ;-)

...Coofer Cat; 12 November, 2004 22:33