Forum OpenACS Development: Re: document to text conversion in search indexer

Collapse
Posted by Roger Metcalf on
Have you come up with a solution for the problem of re-converting documents during display?  I'm in need of document conversion for the purpose of indexing esp. for msword and pdf files.  Can you share your search_content_filter for doing these conversions?
Collapse
Posted by Tilmann Singer on
No solution for that problem yet, sorry. As mentioned in the discussion above it would be necessary to store a text version in parallel to the binary somewhere, and the project I would have needed it for went another route before that was sorted out.

Below is my version of the search_content_filter, which does pdf and word conversion, but it is just in a testing state. Among other things the executables for the conversion need to be parametrized. Also you need to switch off the abstracts display upon search results obviously, otherwise the conversion is triggered on every results page for each pdf and word document found.

ad_proc search_content_filter {
    _txt
    _data
    mime
} {
    @author Neophytos Demetriou
} {
    upvar $_txt txt
    upvar $_data data

    ns_log notice "!> search_content_filter +++ $mime"

    set file_ending(application/msword) doc
    set conversion_code(application/msword) {exec catdoc -d utf-8 $tmp_orig > $tmp_txt}

    set file_ending(application/pdf) pdf
    set conversion_code(application/pdf) {exec pdftotext -enc UTF-8 $tmp_orig}

    switch $mime {
        {text/plain} {
            set txt $data
        }
        {text/html} {
            set txt $data
        }
        {application/pdf} - 

        {application/msword} {
            # convert to text

            # get tempfile name
            set tmpnam [ns_tmpnam]
            set tmp_orig "$tmpnam.$file_ending($mime)"
            set tmp_txt "$tmpnam.txt"

            # write original data to tmpfile
            set tmp_orig_fp [open $tmp_orig w]
            fconfigure $tmp_orig_fp -encoding binary
            puts $tmp_orig_fp $data
            close $tmp_orig_fp

            # call conversion program
            eval $conversion_code($mime)

            # read temporary text file
            set tmp_txt_fp [open $tmp_txt]
            fconfigure $tmp_txt_fp -encoding utf-8
            set txt [read $tmp_txt_fp]
            close $tmp_txt_fp

            # delete tmp files
            file delete $tmp_orig
            file delete $tmp_txt
        }
    }
}