Forum OpenACS Development: Response to OpenACS wish-list

Collapse
Posted by Louis Zirkel on
In response to your last point about Word documents Don, swish++ comes with a program called extract which can be used to index binary type documents such as Word documents. From the README file:

  6. Index non-text files such as Microsoft Office documents
     A separate text-extraction utility "extract" is included to
     assist in indexing non-text files. It is a essentially a
     more sophisticated version of the Unix strings(1) command,
     but employs the same word-determination heuristics used for
     indexing.

It's not the most elegant solution, but it seems to me that it would be something that would be workable. I would think that you could also use something such as antiword to convert a Word document to text and then process it using the normal text indexing features.