Thursday, 4 August 2011

Implementing Stemming with Autonomy IDOL

Feature Description:
Purpose of lemmatizatiion/stemming is to enable a query  with one word form to match documents that contain a different forms  of the word.
In languages, some words have a common morphological root.  Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard, a set of stemming algorithms for the most  commonly used languages. IDOL applies stemming after it discards stop  words, both at index time (when content is stored in IDOL server) and at query  time (IDOL removes stop words and stems query text before matching).

Solution approach:

There could be two approaches while implementing stemming through Autonnomy -
  1. Using default stemming rules provided by Autonomy.
  2. Create a Custom Stem File for a Language: You can override the default stemming rules for certain words in a given language by creating a language-specific stemming file.
Steps:
      a)    Create the file.  This file is a list of words and their stems. Ex:
             [UTF8]
             mice mouse
             mouse mouse
            children child

     b) Open the IDOL server configuration file. In the [MyLanguage] section for the
         stemming file language, set the StemmingFile configuration parameter to
         the name of your stemming file. For example:
       [english]
       Encodings=ASCII:englishASCII,UTF8:englishUTF8
       Stoplist=engish.dat
       Stemming = true
       StemmingFile=english_stem.dat

1 comment:

  1. how do you tell IDOL to not include stemmed words when doing hit highlighting with a view action?

    ReplyDelete