Here is an update on the work being done to integrate OpenFTS with OpenACS:
OpenFTS is a PostgreSQL-based search engine that makes use of the GiST interface available in PostgreSQL. It provides online full text indexing of data and relevance ranking for database searching. It can be used to find documents containing terms with the same linguistic root as the specified word and it can also be used for indexing/searching of multi-lingual and non-text documents. Currently, OpenFTS is implemented as a collection of PERL-scripts.
We are working with Oleg Bartunov and Teodor Sigaev (XWare) to help them open source their search tools under the GPL license. Dan, is currently porting the PERL-scripts into TCL and he is going to move some of the functionality into an aolserver module.
OpenFTS uses PostgreSQL as a database backend where documents are stored as arrays of integers. The index access structure for the array of integers is constructed as an RD-Tree which is implemented using the GiST interface that is available in PostgreSQL. The RD-Tree is a variant of the R-Tree, a popular access method for spatial data. RD stands for "Russian Doll", which describes the transitive containment relation that is fundamental to the tree structure. The RD-Tree data structure implementation provides three predicates between sets: superset, subset, and overlap.
For indexing, a parser is used that reads the document and converts it into a stream of lexemes. Then, morphology or stemming is applied in order to get the base form and finally, an algorithm calculates an ID for each of the lexemes. The resulting array of integers is stored into the database.
When a search query is received, the parser converts it into a stream of lexemes and morphology or stemming is applied to get the base form. Then, each lexeme is assigned an integer ID and finally, SQL queries are generated and executed.
A prominent feature of OpenFTS is the ability to rank documents according to proximity between the words of the search query -- this is accomplished by maintaining coordinate information of the lexemes of each document. For example, if the query is "full text search", documents containing the phrase "full text search" will be ranked higher than documents where words "full", "text", "search" occur in different places.
Information about GiST support in PostgreSQL can be found here (http://www.sai.msu.su/~megera/postgres/gist/).
Request notifications