Enterprise Search Concepts: Implementing Stemming with Autonomy IDOL

Feature Description:
Purpose of lemmatizatiion/stemming is to enable a query with one word form to match documents that contain a different forms of the word.
In languages, some words have a common morphological root. Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard, a set of stemming algorithms for the most commonly used languages. IDOL applies stemming after it discards stop words, both at index time (when content is stored in IDOL server) and at query time (IDOL removes stop words and stems query text before matching).

Solution approach:

There could be two approaches while implementing stemming through Autonnomy -

Using default stemming rules provided by Autonomy.
Create a Custom Stem File for a Language: You can override the default stemming rules for certain words in a given language by creating a language-specific stemming file.

Steps:
      a)   Create the file. This file is a list of words and their stems. Ex:
             [UTF8]
             mice mouse
             mouse mouse
            children child

     b) Open the IDOL server configuration file. In the [MyLanguage] section for the
         stemming file language, set the StemmingFile configuration parameter to
         the name of your stemming file. For example:
       [english]
       Encodings=ASCII:englishASCII,UTF8:englishUTF8
       Stoplist=engish.dat
       Stemming = true
       StemmingFile=english_stem.dat

Enterprise Search Concepts

Thursday, 4 August 2011

Implementing Stemming with Autonomy IDOL

1 comment: