Enterprise Search Concepts: 2011

Saturday 13 August 2011

Access control list (ACL)

Access Control List:
A data set which grants permissions, or access rights, to each user or group for a specific system objects, such as a directory or file.

FAST ESP, Autonomy IDOL or any other leading search product is able to utilize ACL information from the content repositories so that the same permissions apply to search results. This means that a user is only able to see the query results that he/she is entitled to view, based on his/her permissions towards the source content repository.

Friday 12 August 2011

Entity Extraction

Entity Extraction:

Entity extraction means detecting, extracting, and normalizing entities, such as names of people or companies, from documents. This adds more structure to the data and enables navigation or relevancy enhancements based on specific entities.

In FAST ESP is shipped with predefined entity extractors and in Autnomy IDOL it is implemented using grammar file and processed through eduction module via indextasks.

Offensive Content Filter

Offensive Content Filter:
The Offensive Content Filter is a document analysis tool to filter content regarded
as offensive.

The offensive content filter is implemented as a separate document processor that can be added to an ESP
pipeline and In Autonomy IDOL can be implemented using eduction module.

How it works:
Document content is generally run through filters and compared to pre-defined dictionary. the terms can be added, replaced, removed or even entire document can be rejected.

The output of the filter is an overall score that provides an indication of the likeliness that a document is offensive.

Lemmatization

Lemmatization:

The purpose of lemmatization is to enable a query with one word form to match documents that contain a
different form of the word.

In English, lemmatization can occur for:

singular or plural forms for nouns.
positive, comparative, or superlative forms for adjectives.
tense and person for verbs.

For other languages, lemmatization also allows search across case and gender forms and other form
paradigms, depending on the grammatical features for the word forms.

Lemmatization allows a user to search for a term like car and get both documents that
contain the word car and documents that contain the word cars.

Lemmatization, stemming and wildcard search:
Lemmatization differs from stemming or wildcard search by being more precise. Different word forms are
mapped to each other by using a language specific dictionary, not by applying simple suffix chopping rules
(stemming) or partial string matches (wildcard search).

Friday 5 August 2011

Synonym with Autonomy IDOL

Synonym
A synonym based search returns results which are conceptually similar to the query terms.

Solution Approaches:

Enable synonym search in Autonomy IDOL
Create a synonym database

Enable Synonym search in Autonomy IDOL: Autonomy IDOL recommends this method if synonym matching is required for approximate a few 100 terms.

It is a 3 step process 1. Set up a synonym file. 2. Configure the IDOL server to use the synonym file. 3. Execute the Synonym query.

1. Set up a synonym file:
1. Create a text file and save it in IDOL server's IDOL/content directory using the custom file name (manually created by the User) specified in the IDOL server configuration file [SynonymType] section.
2. Create sections for each language type defined in the IDOL server configuration file.
For example:
[EnglishASCII]
[GermanUTF8]
3. In each section, create a line for each word for which user want to list synonyms (using encoding used for the associated language type).                                                                                                                          Example:
[EnglishASCII]
cat
dog

[GermanUTF8]
Katze
Hund

4. List synonym strings next to each word and save the file. Separate the word and each string with commas (there must be no space before or after a comma). The individual terms can contain spaces but must not contain any punctuation.
For example:

[EnglishASCII]
cat,feline,grimalkin,moggy,mouser,puss,pussy,tabby dog,bitch,cur,hound,mans best friend,mongrel,mutt,pooch,puppy

[GermanUTF8]
Katze,Mietze,Mietzekatze,Mietzekater,Kater,Mulle,Kätzchen                     Hund,Wau Wau,Hündin,Töle,Kläffer,Hündchen,Welpe

To configure IDOL server to use a synonym file

1. Open the IDOL server configuration file in a text editor.
2. In the IDOL server configuration file's [FieldProcessing] section, set up a synonym process. This process allows IDOL server to determine when it must apply synonym settings.
For example:

[FieldProcessing]
0=SynonymMatch

3. Create a section for the listed synonym field process to create a property for the process (synonym properties always point to a defined synonym job). Identify the required fields to associate with the process.

For example:
[SynonymMatch]
Property=ApplySynonymMatch
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT

In this example, IDOL server returns only documents for synonym queries if their DRETITLE or DRECONTENT field values match the query.
(When identifying the fields, use the format /FieldName to match root-level fields, */FieldName to match all fields except root-level, or /Path/ FieldName to match fields that the specified path points to).

Note: - This should be implemented in [FieldProcessing] section of the IDOL config.

4. Create a section for the property to set the SynonymType parameter to the name of the synonym job that specifies which settings IDOL server must apply to synonym queries.

[ApplySynonymMatch]
SynonymType=Synonym_job

Note: - This should be implemented in [Properties] section of the IDOL config.

5. In the IDOL server configuration file [Synonym] section, list the synonym job whose settings need to apply when a synonym query send to IDOL server. Multiple jobs can be set up in [Synonym] section. However normally only require one.
For example:

[Synonym]
0=Synonym_job

6. Define a section for the synonym job to specify the settings that required applying to synonym queries. The section must have the same name as the synonym job.
For example:

[Synonym_job]
File=animals.txt
MaxExpandLevel=1

Note: - Information on “ MaxExpandLevel ” :

Description
How many levels (0-3) of synonyms to display. Allows specifying how many levels of the synonym tree you want to show in the links field for query results. Enter 0 to display only direct synonyms, 1 to display direct synonyms and synonyms of the direct synonyms, and so on.

Example
The synonym file contains:
girl, young woman, lass, gal, schoolgirl, young lady, maiden, damsel
maiden, budding, fresh, pristine, new, raw, undeveloped, virgin
pristine, disinfected, germ-free, immaculate, pasteurized, purified, spotless, sterilized
Depending on the MaxExpandLevel level setting, a synonym query for the word "girl" is processed as follows:
MaxExpandLevel=0
Only directly related synonyms are added to a synonym query. If a synonym query, for example, contains the word "girl", the words "young woman", "lass", "gal", "schoolgirl", "young lady", "damsel" and "maiden" are added to it.
MaxExpandLevel=1
If a synonym query contains the word girl, direct synonyms for "girl" are added to the query ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden") as well as synonyms of these direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine").
MaxExpandLevel=2
If a synonym query contains the word girl, direct synonyms for "girl" ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden"), synonyms of the direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine") and synonyms of these synonyms are added to the query ("disinfected", "germ-free", "immaculate", "pasteurized", "purified", "spotless", "sterilized").

7. Save the configuration file and restart IDOL server.

Execute Synonym Searches
After creating a synonym file and configure IDOL server to use it, turn any Query action that send to IDOL server into a synonym query by adding &Synonym=true to it.

For example:
http://localhost:5552/action=Query&Text=Felix is a great mouser&Synonym=true

This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms listed as synonyms for the term mouser in the synonym file.

Implementation of Approach 2:-      Set up an Additional Synonym IDOL Server

Key Process to set up an additional IDOL server

              1> Install the Synonym IDOL server.
              2> Create a synonym file and index it.
              3> Execute a synonym query.

Process to Install the Synonym IDOL Server
1. Create and Index a Synonym File
Install the IDOL server component following the installation instructions. If installation of the Synonym IDOL server is to be done on the same machine as your existing IDOL server, ensure that the servers use different ports.
You can obtain the synonym file you are going to store in your Synonym IDOL server by spidering a Thesaurus site (using HTTP Connector) or by creating the file manually. A synonym file must be a text file that contains these fields:

For example:

#DREREFERRENCE Syn1.txt
#DRECONTENT cat feline grimalkin moggy mouser tabby siamese kitten
#DREENDDOC

#DREREFERRENCE Syn2.txt
#DRECONTENT dog cur hound mongrel mutt pooch puppy
#DREENDDOC

Note: - If HTTP Connector is use to create the synonym file, connector can be used to index the file. The manually created file can be indexed using a DREADD index action.

Execute Synonym Searches
The procedure to execute a synonym search.

To execute synonym searches
1.    Send a query to the Synonym IDOL server.
For example: http://synonymServerHost:synonymServerPort/action=Query&Text=mouser
2.    When the Synonym IDOL server returns the synonym results, add the results to the query string and send the newly formed query to Content IDOL server (normally a front end is set up to do this).
For example:                                       http://IDOLhost:port/action=Query&Text=mouser+(cat feline grimalkin moggy mouser tabby siamese kitten)
This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms that the Synonym IDOL server lists as synonyms for the term mouser.

Thursday 4 August 2011

Implementing Stemming with Autonomy IDOL

Feature Description:
Purpose of lemmatizatiion/stemming is to enable a query with one word form to match documents that contain a different forms of the word.
In languages, some words have a common morphological root. Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard, a set of stemming algorithms for the most commonly used languages. IDOL applies stemming after it discards stop words, both at index time (when content is stored in IDOL server) and at query time (IDOL removes stop words and stems query text before matching).

Solution approach:

There could be two approaches while implementing stemming through Autonnomy -

Using default stemming rules provided by Autonomy.
Create a Custom Stem File for a Language: You can override the default stemming rules for certain words in a given language by creating a language-specific stemming file.

Steps:
      a)   Create the file. This file is a list of words and their stems. Ex:
             [UTF8]
             mice mouse
             mouse mouse
            children child

     b) Open the IDOL server configuration file. In the [MyLanguage] section for the
         stemming file language, set the StemmingFile configuration parameter to
         the name of your stemming file. For example:
       [english]
       Encodings=ASCII:englishASCII,UTF8:englishUTF8
       Stoplist=engish.dat
       Stemming = true
       StemmingFile=english_stem.dat

Who Moved My Cheese???

I was recently suggested Who Moved My Cheese? as fantastic read, my thoughts post reading the book

It is indeed a fantastic read. In simple, realistic and effective manner the author has explained how change is an integral part of our life and the best part is he has explained how to deal with change and each one of us will definitely relate to one of the central four characters Sniff, Scurry, Hem and Haw!!!!

I recommend it has a must read for everybody!!!

Wednesday 27 July 2011

Spell check with Autonomy IDOL.

Spell-Check:
Autonomy IDOL uses Term Distancing algorithm to find correct spellings and suggests them. In term distancing algorithm IDOL server determines the number of edits (Each edit representing an insertion, deletion and replacement operation of a single character) to find the nearest matching terms.

Following is the minimum set of configurations that is required to activate spell check in Autonomy IDOL.

Index side
The following ConfigurationParams have to be included in the [server] section of the IDOL server configuration file:

SpellCheckMaxCheckTerms: It is the maximum size of the query (in number of terms), up to which a query may be considered eligible for spell check.E.g. SpellCheckMaxCheckTerms = 200.
SpellCheckIncorrectMaxDocOccs: Maximum number of docs a term can appear in and be considered a misspelling. E.g. SpellCheckMaxCheckTerms = 1
SpellCheckCorrectMinDocOccs: Minimum number of docs a term must appear in order to be a spellcheck suggestion (or to be matched by a wildcard term.).

We can also use the config parameter UnstemmedMinDocOccs for this purpose. It represents the Minimum number of documents a term must appear in order to be a spellcheck suggestion or to be matched by a wildcard term.
E.g. SpellCheckMaxCheckTerms = 1
UnstemmedMinDocOccs = 1
There are a few other config parameters related to spell check. These are:
SpellCheckAlphaNumeric: Omits input terms containing numbers from being spellchecked. It is either true or false.
E.g. SpellCheckMaxCheckTerms = true

SpellCheckCacheMaxSize: Maximum number of spelling corrections that IDOL server can store. The spell corrections are stored in IDOL>content>main>prx.db file.
E.g. SpellCheckMaxCheckTerms = 6666

Query Side

Include spellcheck=true in the queries in order to instruct the IDOL server to check the spelling of the query terms and provide suggestions for any misspelled term.

Monday 25 July 2011

Ranking

Ranking determines the quality of a match between query and candidate document.
Search products consider the following parameters to determine the appropriate rank value

Freshness- It determines the age of the document to the point in time the query is issued.
Authority- Authority denotes the importance of document as determined by links from other document.
Quality- It determines the assigned importance of a document
Proximity- Proximity denotes the distance between and location of, query terms in the documents.When a query contains multiple terms that are not detected as known phrases, the ranking process takes the relative position of the terms and determines the most relevant results based on the proximity the matching terms in the document have to each other.
Context- Different document fields, for example title, body, description, price, or type, may be assigned different relevance weight. This allows you to specify for example that a match in the title field of a document contribute more to the document's ranking value than a match in the body field of a document.

The releavancy of the document is represented by ranking value.

Search Relevancy

In search relevancy is the measure of how well the returned result set addresses the intent of user query.

Search products takes in to consideration the following concepts for effective relevancy

Linguistics.
Ranking.
Navigation.
Sorting.

In future posts I will address the above concepts and their use for effective relevancy.

Saturday 23 July 2011

Federated Search

Problem statement
An employee in any organization looks for relevant information. This relevant information could be present on internal search engine/engines and public portal like Google, MSN, Yahoo etc. In order for an employees quest to find relevant information he has to search seperately through different internal/external search portals which is very inconvenient.

Solution
Federated Search provides solution to this problem. Federated search facilitates user with single search form to enter search query.The search query is then submitted simultaneously to all search engines and various result set are combined back in to a unified result set.

Key Issues to be taken care while implementing federated Search

Organization rules for combining unified result set.
Security - Mapping security across all the internal enterprise search engines.
Duplicates detection and removal
Managing Facets
Taxonomy

Friday 22 July 2011

Enteprise Search

70%-80% of organizations data is available in form of unstructured data e.g. word documents, spreadsheets, email, web pages to name a few. The content may be located on file servers, content management system or websites and remaining in structured data sources like database.

Enterprise Search aggregates data from all the unstructured and structured sources and facilitates the findability of the relevant information within an organization in an unified manner.

Enterprise search product implemented for one of our client feature following

Integrates information from Web, internal, and external sources – providing a 360° view of market conditions.
Monitors events and information – alerting analysts and decision-makers to the latest, actionable intelligence.
Performs rapid search, discovery and advanced content analysis.
Enables Web 2.0 style collaboration with enterprise level features and security

General issues to be addressed while implementing Enterprise search are

Appropriate visualizations powered by facets
Organization Taxonomy
Security
Multiple systems
Timezones
Content sources

Hence appropriate data analysis of all the content sources is required to facilitate useful information retreival.

Search products available in Market Microsoft FAST ESP, Autonomy IDOL, Endeca, Attivio, Google search appliance, Solr etc.

Out of the above listed enterprise search products I have worked and consulted Autonomy IDOL, Microsoft FAST ESP, Google Search Appliance, Attivio and Solr.