Friday, 1 May 2009

Oracle Text Indexing

Text Indexes


Oracle Text Indexing


There is a mechanism in the Oracle database to do more than just index a column. These are called domain indexes, one of which is a text index. (Others include Spatial and XML).



Limitations we Know About.


The limitations of the ordinary character column index are all too apparent to many database developers. The current indexes allow for a match from the first position in the column (surname LIKE ‘SMI%’). The match will find all surnames that start with the initial letters ‘SMI’ and have any number of characters following. To gain any benefit from the index the match must be exact in the initial few characters, including case and spacing. To find a match for a column where only the word ending is known (surname LIKE ‘%ING’ or the word does not start at the beginning position in the column (name LIKE ‘%JOHN%’).



Why?


Context indexes can search millions of rows of documents faster than most LIKE operations can run over 10000 rows of character data. You may (will) notice several new tables (created by and maintained by Oracle) for each text index, these hold the reverse tree blobs used to match tokens (search terms) to documents (columns, URLs, files etc....).



The user CTXSYS maintains all the context operations and metadata.



Starting with Context Indexes


At some stage it becomes necessary to search for a word in a column of text. This is where a domain index shows its strengths.



Superficially the index is created as would a normal index on a text column but includes the following :



CREATE INDEX idx_test_index ON mytab(mycol)
INDEXTYPE IS CTXSYS.CONTEXT;



There are many options available here to do with the storage and treatment of the column, but beyond the scope of this paper.



So there is a text index and what has changed.


For a start the way the data is queried has changed.


SELECT *
FROM mytab
WHERE contains (mycol, 'dummy') > 0;



This will return all rows from mytab where the column mycol has the word dummy anywhere contained anywhere along its length.



Example

MYTAB : MYCOL
This is the first row for test
This is the second row for test



Searching for rows with the word ‘this’ in would look something like:



SELECT *
FROM mytab
WHERE contains (mycol, 'this') > 0;



And the result...... both rows are returned (case is ignored, though there is the option in the contains clause to ask for a case match).



What if you need to narrow things down more and have more than one word to use in the search but you are unsure of their order or if there are words between your search terms?



SELECT *
FROM mytab
WHERE contains (mycol, 'this and second') > 0;




And the result would be the second row.



This is the second row for test



Keywords


From the above query you can see that keyword can be included in the search string to and the list of words is extensive but the most common words are

· AND

· OR

· NOT

· NEAR

· FUZZY

· %

· ?



If you are searching for a keyword you can wrap the word in curly braces {and} and it will not be parsed as an instruction, so you could have a search string that looks like



‘{cats} and {and} and {dogs}’




Scoring


Scoring helps to find the most relevant document. This is similar to the mechanism used by Google etc. to return the most relevant results.


It works by ‘counting’ the number of ‘hits’ in each document and ranking the documents in order of hits, unfortunately this means that – like Google – you may not get the results you want merely the results with the most matches of search terms.



Progressive Relaxation
You may be asked to return results, regardless of the accuracy, and this mechanism allows you to ensure that the ‘bad’ results appear last.

The search term is progressively relaxed – according to rules you set up – and the results gathered and returned according to the level of match they achieved.


SELECT /*+FIRST_ROWS(10)*/
*
FROM my_search_table
WHERE contains
(searchcolumn,
'odeon cinematransform((TOKENS, "", "", " "))transform((TOKENS,"", "", " and "))transform((TOKENS, "", "", " , "))transform((TOKENS,"$", "", " and "))'
,1) > 0;



In the query above we are searching for odeon cinema and the rules are

Search for the phrase ‘odeon cinema’
Search for the words ‘odeon’ and ‘cinema’ anywhere in the document.
Search for the words ‘odeon’ or ‘cinema’ anywhere in the document.
Search for words ending like ‘odeon’ and ‘cinema’


The query can be rewritten as


SELECT /*+FIRST_ROWS(10)*/
*
FROM my_search_table se
WHERE contains
(searchcolumn,
'odeon cinemaodeon and cinemaodeon, cinema$odeon and $cinema',1
) > 0;



Which may or may not be more apparent.



Advanced features:


Multi table queries:

So far we have only looked at single column indexes, but it is possible to have one index over multiple columns in the same table, multiple columns across multiple tables and master detail table / columns.



Ignoring words

You can create a personalised stoplist where you do not want the word ‘the’ to be indexed. An index created using this stoplist would return no results if ‘the’ was the only search term.



Thesaurus

You can create a thesaurus that holds all the abbreviations / expansions of words e.g. st – street , rd – road etc..... You would store either abbreviation or expansion and can query on either.



Fuzzy matching

In much the same way that SOUNDEX works you can ask for fuzzy matching either accepting the defaults by using the ? (question mark) operator, or a finer control using the fuzzy operator where you specify the ‘degree’ of fuzziness you require and how many steps removed from the original you are willing to look.



These are probably the most commonly used features of text searching, but there are many more, but as with all domain indexes can be tricky (infuriating) to implement.

No comments: