Forum OpenACS Development: Full Text Search

1: Full Text Search

Posted by Neophytos Demetriou on 06/25/01 12:17 PM

Here is an update on the work being done to integrate OpenFTS with OpenACS:

OpenFTS is a PostgreSQL-based search engine that makes use of the GiST interface available in PostgreSQL. It provides online full text indexing of data and relevance ranking for database searching. It can be used to find documents containing terms with the same linguistic root as the specified word and it can also be used for indexing/searching of multi-lingual and non-text documents. Currently, OpenFTS is implemented as a collection of PERL-scripts.

We are working with Oleg Bartunov and Teodor Sigaev (XWare) to help them open source their search tools under the GPL license. Dan, is currently porting the PERL-scripts into TCL and he is going to move some of the functionality into an aolserver module.

OpenFTS uses PostgreSQL as a database backend where documents are stored as arrays of integers. The index access structure for the array of integers is constructed as an RD-Tree which is implemented using the GiST interface that is available in PostgreSQL. The RD-Tree is a variant of the R-Tree, a popular access method for spatial data. RD stands for "Russian Doll", which describes the transitive containment relation that is fundamental to the tree structure. The RD-Tree data structure implementation provides three predicates between sets: superset, subset, and overlap.

For indexing, a parser is used that reads the document and converts it into a stream of lexemes. Then, morphology or stemming is applied in order to get the base form and finally, an algorithm calculates an ID for each of the lexemes. The resulting array of integers is stored into the database.

When a search query is received, the parser converts it into a stream of lexemes and morphology or stemming is applied to get the base form. Then, each lexeme is assigned an integer ID and finally, SQL queries are generated and executed.

A prominent feature of OpenFTS is the ability to rank documents according to proximity between the words of the search query -- this is accomplished by maintaining coordinate information of the lexemes of each document. For example, if the query is "full text search", documents containing the phrase "full text search" will be ranked higher than documents where words "full", "text", "search" occur in different places.

Information about GiST support in PostgreSQL can be found here (http://www.sai.msu.su/~megera/postgres/gist/).

2: Response to Full Text Search (response to 1)

Posted by Rafael Calvo on 06/26/01 07:05 AM

Neophytos,
Thanks for the update.
I would like to read a bit more on the text indexing procedure but the papers seem to be more on general indexing issues. Do you use any weighting scheme? so a term that appears often is heavier than otherwise?

For indexing, a parser is used that reads the document and converts it into a stream of lexemes. Then, morphology or stemming is applied in order to get the base form and finally, an algorithm calculates an ID for each of the lexemes. The resulting array of integers is stored into the database.

which indexes? the ID or the weights (ocurrences?) of the terms? Do you use your own stemming algorithms? any stopwords?
cheers

3: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 06/26/01 10:20 AM

OpenFTS stores each document as an array of integers (lexeme IDs). The index access structure of the array of integers is constructed as an RD-Tree which is implemented using the GiST interface that is available in PostgreSQL

Additionally, OpenFTS maintains 10 (this is configurable) indexing tables where it stores the frequency and the position of each of the lexemes in a document. The frequency of each lexeme is calculated as:

         occurences of the lexeme in the document
freq = ----------------------------------------------
            number of all lexemes in the document

The position is counted in number of lexemes from the beginning of the document. The position field can have negative values that weight more in ranking of the results.

OpenFTS supports stopwords and stemming. The default method for stemming uses Porter's algorithm.

4: What about content stored in the filesystem rather than in PG? (response to 1)

Posted by Jonathan Marsden on 07/04/01 01:50 PM

This sounds really useful, but OpenACS 4.x seems to be heading towards
increasing use of the filesystem for some kinds of content.

Has anyone given thought yet to how to full text search any such 'database-external' content as well as the PG content?

If not, how optional is the use of filesystem storage for content under the new scheme?

5: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 07/04/01 02:09 PM

OpenFTS already supports indexing of content stored in the filesystem as well as content stored in the database.

6: Response to Full Text Search (response to 1)

Posted by Don Baccus on 07/04/01 05:37 PM

And the content repository should be the hook for searching dynamic content regardless of where it's stored, DB or filesystem. There's no reason why static content couldn't be mapped via the content repository, too, now that it knows about stuff stored in the filesystem ...

7: Response to Full Text Search (response to 1)

Posted by Jonathan Marsden on 07/04/01 07:20 PM

> There's no reason why static content couldn't be mapped via the
> content repository, too, now that it knows about stuff stored in
> the filesystem ...

Sounds good to me :-)

OK. I have a fairly sizeable project for a non-profit to bid on, for which (from initial reading of the RFP) OpenACS 4.x using CMS and acs-workflow will be a good basis for - if I can get good searching to work too.

Question #1: Any idea when search will become usable (beta level OK) using OpenFTS? Would adding some (TCL? Perl? other?) developer hours help speed that up, or is the issue mostly licencing-related at this stage?

Question #2: How easy/hard/trivial/works already (!) is it for searches to only return results that a user actually has permission to view? And how horribly does doing that extra check on each result hurt in terms of perceived search performance?

Question #3: Are we really going to have a Beta1 of OpenACS 4.x on 01 September 2001? And do we expect it will have OpenFTS searching included in it? What can I usefully do to make that more likely?

I don't want to have to recommend ACS 4.2 and Intermedia, if I can realistically get a working system using OpenACS 4.x and OpenFTS!

8: Response to Full Text Search (response to 1)

Posted by mike blair on 07/05/01 07:56 PM

Can you point me to the perl or tcl scripts that actually index the text into the gist tables?

Thanks,

Mike

9: Response to Full Text Search (response to 1)

Posted by Carl Coryell-Martin on 07/05/01 07:57 PM

Is the code for OpenFTS available anywhere, I would like to play with it and perhaps help with the porting effort.

Thanks,

10: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 07/05/01 08:47 PM

OpenFTS code is not available yet. We are moving forward to release OpenFTS under the GPL license in the next few days. We have not decided on a date yet but it will be really soon (hopefully, end of next week). OpenFTS is already being used to index fts.postgresql.org which serves more than 160,000 messages and www.rambler.ru which serves more than 500,000 documents.

For an OpenACS search solution, we are going to port the site-wide-search package and replace Intermedia functionality with OpenFTS (Dan's TCL scripts). Dan has converted most of perl to tcl and he has integrated in the nstcl package, so that we have an aolserver type db interface to talk to pg without having to involve aolserver. We have written an OpenFTS primer which explains most of its functionality and contains other interesting information (e.g. how to write your own parser for OpenFTS, if the default parser does not suit your needs). Hopefully, the OpenACS search solution based on OpenFTS will be available by the end of July (alpha).

Thanks everyone for offering their help. As soon as OpenFTS is released you can contribute your own parsers or dictionaries, write documentation, etc. We might also need help with the integration of OpenACS with OpenFTS. I will post more details in this thread in the next few days.

11: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 07/10/01 09:39 AM

The ranking function uses methodology proposed by Andrew Kovalenko and Nickolay Kharin. Many thanks to Andrew and Nickolay who gave their permission to use their methodology in the ranking function of OpenFTS. We have registered OpenFTS at SourceForge and in less than 72 hours will be available for public use under the GPL license. An announcement will follow.

12: Response to Full Text Search (response to 1)

Posted by Jerry Asher on 07/11/01 06:58 PM

Neophytos,

OpenFTS looks promising, but can you tell me,

What do you mean by PostgreSQL-based and what is the domain of OpenFTS? In other words, is OpenFTS "limited" to searching only Postgres databases? If not, what is the effort required to enable OpenFTS to crawl other databases?

I ask as it is my understanding that OpenACS is intended to be somewhat database neutral, and I am trying to understand when an OpenFTS solution is appropriate and when it is not.

Thanks,

13: Response to Full Text Search (response to 1)

Posted by Don Baccus on 07/11/01 07:39 PM

If you'll read Neophytos' post which started this thread, you'll see that PG is used to store the indexing data used by the full-text search, and that it uses a sophisticated statistics-based indexing methodology known as GiST, which is available in PG but AFAIK no other
popular RDBMS.

The text can come from anywhere, but given that the guts require Postgres one can safely presume it will be used universally by those using the PG version of OpenACS and not at all by those using another RDBMS.

14: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 07/11/01 07:42 PM

Currently, OpenFTS is "limited" to indexing/searching PostgreSQL databases. In other words OpenFTS uses PostgreSQL specific features/packages -- for example, the intarray package under contrib. At this point I can not say how much effort would be needed to port OpenFTS to other RDBMS. However, I can say that most of the code can be reused as it is. We have written a primer which explains how OpenFTS works and includes links to interesting/related research papers. It should be available soon at SourceForge. The project site is now public (OpenFTS) but we have not upload the code yet.

15: Response to Full Text Search (response to 1)

Posted by Neophytos Demetriou on 08/09/01 06:55 PM

OpenFTS-Perl is ready for release. Here is Oleg's post to pgsql-hackers:



Hi,

I'm about to announce a release of OpenFTS search engine.

Short blurb:

  OpenFTS (Open Source Full Text Search engine)
  is an advanced PostgreSQL-based search engine
  that provides online indexing of data and relevance
  ranking for database searching. Close integration
  with database allows use of metadata to restrict
  search results.

This is actually what we use at fts.postgresql.org and
several other sites. Perl version is available for download,
TCL version (with AOL server support) will be available soon -
actually it just needs to wrap a release.

The OpenFTS project web site - http://openfts.sourceforge.net/

OpenFTS team:
  Oleg Bartunov       - Project Manager
  Teodor Sigaev       - Principal Developer
  Daniel Wickstrom    - Developer of TCL Version
  Neophytos Demetriou - Documentation Writer

Copyright:

  OpenFTS is Copyright 2000-2001 XWare and
  licensed under the GNU General Public License,
  version 2 (June 1991). This means you can use it
  and modify it in any way you want. If you choose to
  redistribute OpenFTS, you must do so under the
  terms of the GNU license.

        Regards,
                Oleg