Showing posts with label Autonomy IDOL. Show all posts
Showing posts with label Autonomy IDOL. Show all posts

Saturday, 13 August 2011

Access control list (ACL)

Access Control List:
A data set which grants permissions, or access rights, to each user or group for a specific system objects, such as a directory or file.

FAST ESP, Autonomy IDOL or any other leading search product is able to utilize ACL information from the content repositories so that the same permissions apply to search results. This means that a user is only able to see the query results that he/she is entitled to view, based on his/her permissions towards the source content repository.

Friday, 12 August 2011

Entity Extraction

Entity Extraction:

Entity extraction means detecting, extracting, and normalizing entities, such as names of people or companies, from documents. This adds more structure to the data and enables navigation or relevancy enhancements based on specific entities.

In FAST ESP is shipped with predefined entity extractors and in Autnomy IDOL it is implemented using grammar file and processed through eduction module via indextasks.

Offensive Content Filter

Offensive Content Filter:
The Offensive Content Filter is a document analysis tool to filter content regarded
as offensive.

The offensive content filter is implemented as a separate document processor that can be added to an ESP
pipeline and In Autonomy IDOL can be implemented using eduction module.

How it works:
Document content is generally run through filters and compared to pre-defined dictionary. the terms can be added, replaced, removed or even entire document can be rejected.

The output of the filter is an overall score that provides an indication of the likeliness that a document is offensive.

Friday, 5 August 2011

Synonym with Autonomy IDOL

Synonym
A synonym based search returns results which are conceptually similar to the query terms.

Solution Approaches:

  1. Enable synonym search in Autonomy IDOL
  2. Create a synonym database
Enable Synonym search in Autonomy IDOL: Autonomy IDOL recommends this method if synonym matching is required for approximate a few 100 terms.

It is a 3 step process 1. Set up a synonym file. 2. Configure the IDOL server to use the synonym file. 3. Execute the Synonym query.

1. Set up a  synonym file:
1. Create a text file and save it in IDOL server's IDOL/content directory using the custom file name (manually created by the User) specified in the IDOL server configuration file [SynonymType] section.
2. Create sections for each language type defined in the IDOL server configuration file.                            
For example:
[EnglishASCII]
[GermanUTF8]
3. In each section, create a line for each word for which user want to list synonyms (using encoding used for the associated language type).                                                                                                                          Example:
[EnglishASCII]
cat
dog

[GermanUTF8]
Katze
Hund

4. List synonym strings next to each word and save the file. Separate the word and each string with commas (there must be no space before or after a comma). The individual terms can contain spaces but must not contain any punctuation.
For example:

[EnglishASCII]
cat,feline,grimalkin,moggy,mouser,puss,pussy,tabby dog,bitch,cur,hound,mans best friend,mongrel,mutt,pooch,puppy

[GermanUTF8]
Katze,Mietze,Mietzekatze,Mietzekater,Kater,Mulle,Kätzchen                     Hund,Wau Wau,Hündin,Töle,Kläffer,Hündchen,Welpe

To configure IDOL server to use a synonym file

1. Open the IDOL server configuration file in a text editor.
2. In the IDOL server configuration file's [FieldProcessing] section, set up a synonym process. This process allows IDOL server to determine when it must apply synonym settings.
For example:

[FieldProcessing]
0=SynonymMatch

3. Create a section for the listed synonym field process to create a property for the process (synonym properties always point to a defined synonym job). Identify the required fields to associate with the process.

For example:
[SynonymMatch]
Property=ApplySynonymMatch
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT

In this example, IDOL server returns only documents for synonym queries if their DRETITLE or DRECONTENT field values match the query.      
(When identifying the fields, use the format /FieldName to match root-level fields, */FieldName to match all fields except root-level, or /Path/ FieldName to match fields that the specified path points to).

Note: - This should be implemented in [FieldProcessing] section of the IDOL config.

4. Create a section for the property to set the SynonymType parameter to the name of the synonym job that specifies which settings IDOL server must apply to synonym queries.

[ApplySynonymMatch]
SynonymType=Synonym_job

Note: - This should be implemented in [Properties] section of the IDOL config.

5. In the IDOL server configuration file [Synonym] section, list the synonym job whose settings need to apply when a synonym query send to IDOL server.  Multiple jobs can be set up in [Synonym] section. However normally only require one.
For example:

[Synonym]
0=Synonym_job

6. Define a section for the synonym job to specify the settings that required applying to synonym queries. The section must have the same name as the synonym job.
For example:

[Synonym_job]
File=animals.txt
MaxExpandLevel=1

Note: - Information on “ MaxExpandLevel ” :

Description
How many levels (0-3) of synonyms to display. Allows specifying how many levels of the synonym tree you want to show in the links field for query results. Enter 0 to display only direct synonyms, 1 to display direct synonyms and synonyms of the direct synonyms, and so on.

Example
The synonym file contains:
girl, young woman, lass, gal, schoolgirl, young lady, maiden, damsel
maiden, budding, fresh, pristine, new, raw, undeveloped, virgin
pristine, disinfected, germ-free, immaculate, pasteurized, purified, spotless, sterilized
Depending on the MaxExpandLevel level setting, a synonym query for the word "girl" is processed as follows:
MaxExpandLevel=0
Only directly related synonyms are added to a synonym query. If a synonym query, for example, contains the word "girl", the words "young woman", "lass", "gal", "schoolgirl", "young lady", "damsel" and "maiden" are added to it.
MaxExpandLevel=1
If a synonym query contains the word girl, direct synonyms for "girl" are added to the query ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden") as well as synonyms of these direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine").
MaxExpandLevel=2
If a synonym query contains the word girl, direct synonyms for "girl" ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden"), synonyms of the direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine") and synonyms of these synonyms are added to the query ("disinfected", "germ-free", "immaculate", "pasteurized", "purified", "spotless", "sterilized").

7. Save the configuration file and restart IDOL server.

Execute Synonym Searches
After creating a synonym file and configure IDOL server to use it, turn any Query action that send to IDOL server into a synonym query by adding &Synonym=true to it.

For example:
http://localhost:5552/action=Query&Text=Felix is a great mouser&Synonym=true

This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms listed as synonyms for the term mouser in the synonym file.   

Implementation of Approach 2:-      Set up an Additional Synonym IDOL Server

 Key Process to set up an additional IDOL server

              1> Install the Synonym IDOL server.
              2> Create a synonym file and index it.
              3> Execute a synonym query.

Process to Install the Synonym IDOL Server
1. Create and Index a Synonym File
Install the IDOL server component following the installation instructions. If installation of the Synonym IDOL server is to be done on the same machine as your existing IDOL server, ensure that the servers use different ports.
You can obtain the synonym file you are going to store in your Synonym IDOL server by spidering a Thesaurus site (using HTTP Connector) or by creating the file manually. A synonym file must be a text file that contains these fields:

For example:

#DREREFERRENCE Syn1.txt
#DRECONTENT cat feline grimalkin moggy mouser tabby siamese kitten
#DREENDDOC

#DREREFERRENCE Syn2.txt
#DRECONTENT dog cur hound mongrel mutt pooch puppy
#DREENDDOC

Note: - If HTTP Connector is use to create the synonym file, connector can be used to index the file. The manually created file can be indexed using a DREADD index action.

Execute Synonym Searches
The procedure to execute a synonym search.

To execute synonym searches
1.    Send a query to the Synonym IDOL server.
For example: http://synonymServerHost:synonymServerPort/action=Query&Text=mouser
2.    When the Synonym IDOL server returns the synonym results, add the results to the query string and send the newly formed query to Content IDOL server (normally a front end is set up to do this).
For example:                                       http://IDOLhost:port/action=Query&Text=mouser+(cat feline grimalkin moggy mouser tabby siamese kitten)
This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms that the Synonym IDOL server lists as synonyms for the term mouser.


Thursday, 4 August 2011

Implementing Stemming with Autonomy IDOL

Feature Description:
Purpose of lemmatizatiion/stemming is to enable a query  with one word form to match documents that contain a different forms  of the word.
In languages, some words have a common morphological root.  Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard, a set of stemming algorithms for the most  commonly used languages. IDOL applies stemming after it discards stop  words, both at index time (when content is stored in IDOL server) and at query  time (IDOL removes stop words and stems query text before matching).

Solution approach:

There could be two approaches while implementing stemming through Autonnomy -
  1. Using default stemming rules provided by Autonomy.
  2. Create a Custom Stem File for a Language: You can override the default stemming rules for certain words in a given language by creating a language-specific stemming file.
Steps:
      a)    Create the file.  This file is a list of words and their stems. Ex:
             [UTF8]
             mice mouse
             mouse mouse
            children child

     b) Open the IDOL server configuration file. In the [MyLanguage] section for the
         stemming file language, set the StemmingFile configuration parameter to
         the name of your stemming file. For example:
       [english]
       Encodings=ASCII:englishASCII,UTF8:englishUTF8
       Stoplist=engish.dat
       Stemming = true
       StemmingFile=english_stem.dat

Wednesday, 27 July 2011

Spell check with Autonomy IDOL.

Spell-Check:
Autonomy IDOL uses Term Distancing algorithm to find correct spellings and suggests them. In term distancing algorithm IDOL server determines the number of edits (Each edit representing an insertion, deletion and replacement operation of a single character) to find the nearest matching terms.

Following is the minimum set of configurations that is required to activate spell check in Autonomy IDOL.

Index side
The following ConfigurationParams have to be included in the [server] section of the IDOL server configuration file:
  • SpellCheckMaxCheckTerms: It is the maximum size of the query (in number of terms), up to which a query may be considered eligible for spell check.E.g. SpellCheckMaxCheckTerms = 200.
  • SpellCheckIncorrectMaxDocOccs: Maximum number of docs a term can appear in and be considered a misspelling. E.g. SpellCheckMaxCheckTerms = 1 
  • SpellCheckCorrectMinDocOccs: Minimum number of docs a term must appear in order to be a spellcheck suggestion (or to be matched by a wildcard term.).
We can also use the config parameter UnstemmedMinDocOccs for this purpose. It represents the Minimum number of documents a term must appear in order to be a spellcheck suggestion or to be matched by a wildcard term.
E.g.  SpellCheckMaxCheckTerms = 1
        UnstemmedMinDocOccs = 1
There are a few other config parameters related to spell check. These are:
SpellCheckAlphaNumeric: Omits input terms containing numbers from being spellchecked. It is   either true or false.
E.g. SpellCheckMaxCheckTerms = true

SpellCheckCacheMaxSize:  Maximum number of spelling corrections that IDOL server can store.  The spell corrections are stored in IDOL>content>main>prx.db file.
E.g. SpellCheckMaxCheckTerms = 6666

Query Side

Include spellcheck=true in the queries in order to instruct the IDOL server to check the spelling of the query terms and provide suggestions for any misspelled term.