home  news  about us  jobs  contact us 

 advertise with: www.domain-b.com  l   www.prdomain.com     

web solutions  software  media & content  design   consulting   clients   resources  

 
 
RESOURCES
Keyword searches and labelling: the relevance of it all
Kiron Kasbekar*
Words are not enough to say how thankful we should be to search engines, which have made it so much easier to find information than ever before. They search through billions of Web pages to get us a choice of results we can sift through to find the information we need.

 

Having said that, we need to move on. For business executives searching for business information on the Web, scouring the Web has become a scourge.

Analysts have found that in a typical mid- to large-size company, millions of dollars are wasted each year trying to find critical information. According to a Washington Post Survey, 17 per cent of decision-makers spend 5 hours per workday and 7 per cent of them spend 5 hours per weekend day on the Internet.

Consider this: how often do we really need to search billions of pages? More often than not, the information we seek resides in a handful of websites. But since a search engine-based search is so wide-ranging, the relevant results get mixed up with a lot of irrelevant ones.

If we had been paying for these searches in cash, we would have said we were doing an over-kill. Since we don't have to pay, nobody bothers about the cost. But there is a cost to pay.

Wasting company time
We end up wasting a lot of time checking through many search results that seem relevant but actually are not. Going through even the first few pages of the results can take an entire afternoon.

The problem is compounded by human nature. We never stick to the straight and narrow. More often than not we get sidetracked by some interesting new information that has nothing to do with our original purpose. We may acquire a more holistic view of many things through such searching and browsing, but we may have become inefficient executives in the process.

Some search engines tell you they will arrange the results by order of relevance. Some searches list the results by 'relevance', even going to the extent of giving percentage score of relevance. When you actually open the document, you wonder how on earth the score was given.

The truth is that relevance is often determined in a mechanical way; for example, by the number of times your keywords occur in the 'description' of the document. The trouble is that this by no means assures that the main content of the document is relevant.

That is because software programmes are not yet intelligent enough to judge relevance. You need expert systems to judge relevance. And there is nothing yet to beat the expert human touch.

Limitations of keyword searches on search engines
To give you an example, you may see a 500-word article that discusses various things about a product, including its price, its distribution, its promotion, and so on, and not use the word 'marketing' even once. Yet, when you glance through the article, you know intuitively that the article is about marketing. So, if you have a system that allows you to label documents, you attach the 'marketing' label to the article so that when you next look for articles on marketing you will not miss this one.

The search engine faithful will exclaim, "Ah! But this is not the way to do a keyword search. If you want articles on marketing, you must enter not just the keyword 'marketing', but also other related keywords, such as 'product', 'advertising' and 'promotion'."

That is, of course, a wise thing to do; but it doesn't ensure relevance. You could, for example, find articles like this one: about a manager of an 'advertising' agency who is a 'product' of the Harvard Business School, who got a 'promotion' for his smart work in the finance department. When you actually scan the article and label it, you are unlikely to attach the 'marketing' tag to it. But the smartest keyword search will show it in the results for 'marketing'.

We have got so carried away by volume that every time we hear that a search engine now crawls through more billions of pages to get us results, we say, "Wow!" That wow factor can be debilitating. We are adding more hay to the haystack in which we must find the needle.

Search engines of course have their use, even when they conduct wide-ranging searches. But you often do not need to do a wide-ranging search.

Google, the largest, and arguably the best of them, currently indexes over 4 billion web pages. Compare that with the more than 84 billion pages that exist in what is called the 'deep Web' (mostly web pages served from databases). It is unlikely that you will get better results if Google indexes, say, 10 billion pages. Relevance is not a numbers game.

The ideal situation
My proposition is that if we want a really relevant search, we should not search an ever-growing volume of pages; instead, we should search only a relatively small number of relevant websites. If the base of the search (the list of websites to be searched) has greater relevance, the results of the search will have greater relevance.

The ideal situation would be to need to look at a single website for your entire information requirement. If you are in the business of, say, making cars, wouldn't it be great if all your information requirements were met by a website that provided news on all the cars in the world, all the materials and components that go into the car, the companies that make all these things, the dealers who sell the vehicles, on car marketing and logistics, car finance, capital market trends related to car and auto component companies, recruitment and salary trends in the industry in different countries, and so on?

Unfortunately, we do not live in an ideal world. But if we cannot find a single website that meets all our requirements, do we need to go to the other extreme and search a million websites? The practical thing to do is to find a number that is somewhere in between — and closer to one than to a million. Maybe a hundred websites? Two hundred? A thousand?

You gain nothing from duplication, except verification. When you are doing serious research, reading the same news in two or three different sources helps to identify discrepancies and mistakes. Other than that, going through multiple versions of the same information merely wastes management time and money.

The key then is to optimise on a number to ensure that you do not miss out on anything important. You need to select websites for their relevance to your requirements rather than by criteria like 'popularity', used by search engines. I am not saying that search engines are wrong in using popularity as an important criterion for sorting search results. Popularity acts as a proxy for credibility. If more people use, say, Reuters or Bloomberg for news than, say, a local city publication, then the users of the search engines are more likely to find the search engine ranking more satisfactory.

This applies to search engine users, who are an extremely diverse lot. The approach for business users must be different. They are in a position to decide the relevance of websites.

Count on experience
How do we decide which are the most relevant websites to check out? You need neither rocket science nor artificial intelligence to answer that question. The answer is simple: experience.

Experience tells each of us - at least those of us who are accustomed to doing research — where to search when we seek certain kinds of information. We may simply check out the Reuters, Bloomberg or Financial Times websites if we are looking for some recent business news (and The Wall Street Journal if we are paid subscribers to it). Or we may open the BBC, CNN or Guardian websites if we are looking for political news.

Now that may sound very limiting. But it's not always so. You will often find, especially with breaking news, that most of the obvious sites are carrying more or less the same information, based on a news release or disclosure.

If you need to delve deeper, look for background, read about trends, analysis and comments, you may need to expand your search through more websites, including more media sites, corporate sites of companies in and related to your industry, some stock exchange sites, maybe some regulatory agency sites. But you still do not need to run through billions of pages of information, including football club, and travel and tourism websites.

It might be argued that an individual's experience is limited - and that's why an impersonal keyword-based search across the Web is superior. The correct answer to the individual's limitations is not a Web-wide search; it to make the broad selection of websites to search on the basis of the experience of many people. And users should be allowed to keep adding to the list of sites to be searched.

Proper approach
It is unlikely that we will find a single, simple and unique product that will solve our information search needs. But we can certainly adopt the right approach to how we look for, store and retrieve information.

The order of search (and related management) should be this:
1. Routinely get news and other information feeds that get downloaded either as the most recent full content where possible or as links to the latest information directly to your server (either a web server or a local network server). Ensure that all the latest information from the desired sites is downloaded. Routinising and automating this process will straightaway save a massive amount of time and salary costs.
2. Allow labelling of the downloaded content and links by a central librarian or knowledge officer. Allow every user to add his or her personal labels to the documents and links. Personal labelling brings search in line with subjective, personal preferences. Three people may label a single article in three or more different ways - one may attach the 'marketing' label to it, another may attach 'marketing' and 'people', and the third may label it 'notes for next week's strategy conference'. Different people see different subjective relevance in content, which a mechanical keyword search can never fathom.
3. Allow users to bookmark documents and links to give them priority status.
4. Then allow users to first search through their book-marked documents and links by using single or multiple labels as search criteria. This means that the search is done on a small but relevant number of documents, and the search results are likely to be most relevant. This will also mean a reduced load on the server and on the network bandwidth.
5. If the search does not yield satisfactory results, let users search through all the downloaded content on the server.
6. If that too does not get the necessary results, go for a search engine search across the Web. Chances are that you will rarely need to do so.

Try labelling
Labelling is a good alternative to inefficient keyword searches across the Web. The question is: Somebody has to create all those labels — how do you create labels efficiently?

Labelling works through databases, which allow query-based searches that are flexible - you can narrow them down or expand them, but the search is disciplined by the way in which the content has been categorised through labels, or other fields.

Labelling requires a certain amount of experience and expertise, but not all that much. You can use external labelling services like Informachine from The Information Company, which is into various knowledge management solutions. Or you may deploy one or two people (or more, depending on the volume of content your company needs to download) to do the job. Or you can use a combination of an external service and your own internal people.

A new look at labelling
How should the labellers go about their job?

The mechanical thing to do is to take every proper and common noun in the text and make it a label. But that is not labelling. That is indexing, which your operating software can do anyway to allow a keyword search through the directory or database.

Labelling must use an understanding of a hierarchy of relevance. To apply this we must distinguish between two types of documents:
1. 'Short' documents, such as news reports and articles in the media, press releases, brochures, case studies, analyst and product presentations, white papers, office memos and invoices or purchase orders.
2. 'Long' documents, such as the World Bank's World Development Report, the Indian government's Economic Survey, government budget documents, corporate annual reports, sustainability reports and suchlike.

Short documents: A quick scan of short documents can tell you what the document is about. Such documents will often have a primary subject and theme, and there will be secondary and tertiary subjects. The primary subject and theme are what determine the greatest relevance of the document; the secondary and tertiary subjects indicate lower levels of relevance. The labelling must be aligned with this order of relevance.

For example, take a news report about the world's biggest steel producer, Mittal Steel, making an unsolicited bid for the world number two, Arcelor. The primary theme is M&A, the primary subjects are steel, Mittal Steel and Arcelor. The report may further discuss how trading in the stocks of these two companies was suspended, what has happened to the shares of other steel companies, and give background information on Mittal's earlier acquisition of LNM Steel and Arcelor's recent acquisition of Dofasco. These are of secondary or tertiary relevance, depending on whether your enterprise is more interested in stock markets or in steel.

You may not need to capture every single proper name in the report as a label. Users can always use a keyword search to find the minutiae.

Long documents: Long documents usually have an executive summary or overview that gives a bird-eye view of the document. It's not hard to decide that the World Development Report is about 'development' and the 'world' or 'all countries'. You can also quickly figure out the secondary labels, such as 'poverty', 'employment', 'literacy', and so on.

Where you really get stumped with long documents is the tertiary labelling. It would be a waste of a librarian or knowledge officer's time to ask them to find and label every single subject, such as individual welfare projects referred to in the report, or the names of towns and villages cited, or the names of people who find mention.

The simple and sensible approach to this problem is to not even try to do such tertiary subject labelling. Just leave it to automated indexing by standard software, such as IIS, and let users find such information through keyword searches.

That, of course, assumes that such long documents have already been downloaded and placed on your server, and you have a document management system that assures good housekeeping.

If you don't, then tens or hundreds or thousands of people in your organisation are probably wasting time visiting the Internet to get the same information multiple times, saving it in multiple places, and ending up not finding it when they desperately need it — and searching the Web all over again and wasting more time.

You can save millions of dollars by using the right software system and content provision services, which will not only drastically reduce the wastage of time and free up executive time for more important work, but also give the organisation many more degrees of freedom and flexibility in utilising the information that it already has, and adds every day, and in adding value to it to serve your corporate objectives.

* Kiron Kasbekar is Managing Director of The Information Company Pvt Ltd. A former Editor of The Economic Times, Bombay, Business Editor of The Times of India and Managing Editor of Business India, he is also a pioneer in creating business databases.


   A novel solution

 
 
The Informachine Information Management System

Do more things more efficiently:
Free your staff of the compulsion to spend an inordinate amount of time on surfing the Web.
Assure regular daily downloads from all websites of interest, irrespective of how busy they are with other work. First thing in the morning; with the option of an afternoon update.
Do centralised labelling, allow personalised labelling,
Add value through:
  o Annotation
  o Forwarding with comments
  o Forwarding with other attachments
  o Individual and group/community-wise alerts through desktop stickies within the local network and through emails outside
  o Creation of discussion groups around documents
  o Automatic generation of newsletters for groups and communities within the organisation
Label-based and keyword-based search, including advanced search through all the captured documents or through individually book-marked documents.


 

The Information Company Private Limited
 information creation, delivery, management, presentation