Showing posts with label Search Concepts. Show all posts
Showing posts with label Search Concepts. Show all posts

Saturday, 13 August 2011

Access control list (ACL)

Access Control List:
A data set which grants permissions, or access rights, to each user or group for a specific system objects, such as a directory or file.

FAST ESP, Autonomy IDOL or any other leading search product is able to utilize ACL information from the content repositories so that the same permissions apply to search results. This means that a user is only able to see the query results that he/she is entitled to view, based on his/her permissions towards the source content repository.

Friday, 12 August 2011

Entity Extraction

Entity Extraction:

Entity extraction means detecting, extracting, and normalizing entities, such as names of people or companies, from documents. This adds more structure to the data and enables navigation or relevancy enhancements based on specific entities.

In FAST ESP is shipped with predefined entity extractors and in Autnomy IDOL it is implemented using grammar file and processed through eduction module via indextasks.

Offensive Content Filter

Offensive Content Filter:
The Offensive Content Filter is a document analysis tool to filter content regarded
as offensive.

The offensive content filter is implemented as a separate document processor that can be added to an ESP
pipeline and In Autonomy IDOL can be implemented using eduction module.

How it works:
Document content is generally run through filters and compared to pre-defined dictionary. the terms can be added, replaced, removed or even entire document can be rejected.

The output of the filter is an overall score that provides an indication of the likeliness that a document is offensive.

Lemmatization

Lemmatization:

The purpose of lemmatization is to enable a query with one word form to match documents that contain a
different form of the word.

In English, lemmatization can occur for:
  1. singular or plural forms for nouns.
  2. positive, comparative, or superlative forms for adjectives.
  3. tense and person for verbs.
For other languages, lemmatization also allows search across case and gender forms and other form
paradigms, depending on the grammatical features for the word forms.

Lemmatization allows a user to search for a term like car and get both documents that
contain the word car and documents that contain the word cars.

Lemmatization, stemming and wildcard search
:
Lemmatization differs from stemming or wildcard search by being more precise. Different word forms are
mapped to each other by using a language specific dictionary, not by applying simple suffix chopping rules
(stemming) or partial string matches (wildcard search).

Monday, 25 July 2011

Ranking

Ranking determines the quality of a match between query and candidate document.
Search products consider the following parameters to determine the appropriate rank value
  1. Freshness- It determines the age of the document to the point in time the query is issued.
  2. Authority- Authority denotes the importance of document as determined by links from other document.
  3. Quality- It determines the assigned importance of a document
  4. Proximity- Proximity denotes the distance between and location of, query terms in the documents.When a query contains multiple terms that are not detected as known phrases, the ranking process takes the relative position of the terms and determines the most relevant results based on the proximity the matching terms in the document have to each other.
  5. Context- Different document fields, for example title, body, description, price, or type, may be assigned different relevance weight. This allows you to specify for example that a match in the title field of a document contribute more to the document's ranking value than a match in the body field of a document.
The releavancy of the document is represented by ranking value.

Search Relevancy

In search relevancy is the measure of how well the returned result set addresses the intent of user query.

Search products takes in to consideration the following concepts for effective relevancy
  1. Linguistics.
  2. Ranking.
  3. Navigation.
  4. Sorting.
In future posts I will  address the above concepts and their use for effective relevancy.

Saturday, 23 July 2011

Federated Search

Problem statement
An employee in any organization looks for relevant information. This relevant information could be present on internal search engine/engines and public portal like Google, MSN, Yahoo etc. In order for an employees quest to find relevant information he has to search seperately through different internal/external search portals which is very inconvenient.

Solution
Federated Search provides solution to this problem. Federated search facilitates user with single search form to enter search query.The search query is then submitted simultaneously to all search engines and various result set are combined back in to a unified result set.




Key Issues to be taken care while implementing federated Search

  1. Organization rules for combining unified result set.
  2. Security - Mapping security across all the internal enterprise search engines.
  3. Duplicates detection and removal
  4. Managing Facets
  5. Taxonomy

Friday, 22 July 2011

Enteprise Search

70%-80% of organizations data is available in form of unstructured data e.g. word documents, spreadsheets, email, web pages to name a few. The content may be located on file servers, content management system or websites and remaining in structured data sources like database.

Enterprise Search aggregates data from all the unstructured and structured sources and facilitates the findability of the relevant information within an organization in an unified manner.

Enterprise search product implemented for one of our client feature following
  1. Integrates information from Web, internal, and external sources – providing a 360° view of market conditions.
  2. Monitors events and information – alerting analysts and decision-makers to the latest, actionable intelligence.
  3. Performs rapid search, discovery and advanced content analysis.
  4. Enables Web 2.0 style collaboration with enterprise level features and security
General issues to be addressed while implementing Enterprise search are
  1. Appropriate visualizations powered by facets
  2. Organization Taxonomy
  3. Security   
  4. Multiple systems
  5. Timezones
  6. Content sources
Hence appropriate data analysis of all the content sources is required to facilitate useful information retreival.
Search products available in Market Microsoft FAST ESP, Autonomy IDOL, Endeca, Attivio, Google search appliance, Solr etc.

Out of the above listed enterprise search products I have worked and consulted Autonomy IDOL, Microsoft FAST ESP, Google Search Appliance, Attivio and Solr.