Forum OpenACS Development: Searching dynamic content by date range. Non-sequitur?

1: Searching dynamic content by date range. Non-sequitur?

Posted by Jerry Asher on 06/23/01 10:13 PM

So I'm playing with htDig and other site indexing tools, and have had pretty good results (http://www.theashergroup.com/demos/openacs/).

But I face a basic quandry and I'd like your help. Part of the quandry is AOLserver quirk, and part is fundamental to sites that offer dynamic content. But based on the thought that what I am doing my be useful to others in the ACS environment, I'd like to hear your thoughts on search metaphors, design, and APIs. I would like to offer visitors the ability to search for content based on the date of the content. Now that's a completely understandable and useful capability when searching your garage, your tax records, and when searching a static site. I know something happened last year between April Fools Day and Guy Fawkes Day. What was it? It doesn't make as much sense on a site that's more application than content (or it's not nearly as easy or even doable by typical site index engine): find me the Amazon quote as of last April. Nice idea - - can't be done by most site indexers which are geared more towards capturing and presenting the latest snapshot of a site. It does make sense for certain ACS elements: find me the thread on javascript security holes that occurred around Halloween. Problem is, is that's hard to do with the site index tools I've looked at so far when combined with our current bboard metaphor and it's hard to do in general within AOLserver. The site index tools I've seen (htDig mostly) only can deal with one date: the last modified date of the document. Yet on a page that contains dynamic content, there's an assembly of content each of which has a different "last modified" date, and most of those dates are unknown.

AOLserver does the cheap and mostly correct thing: for adp pages (tcl pages too?) the current time and date is reported as the last modified time and date; while html pages have the actual last modified time and date as recorded in the file system returned.

Our bboard metaphor makes it hard, because, well, what is the date of a thread? Is it the date of the first post, or the date of the last post? It's really a date range.

So what's the answer?

Is the search by date metaphor just meaningless in a world of dynamic content?

Should we formally support an ACS interface so that each page can set it's last modified date time if it needs/wants to?

In an application like bboard, what should the last modified date be set too?

Should content assemblies like bboard have a special search indexing engine mode (presumably useful to more than just htDig but unknown) that can expose individual elements when they each have a meaningful last modified date time that would be of interest to folks searching a site?

What kind of an interface would you like to see?

2: Response to Searching dynamic content by date range. Non-sequitur? (response to 1)

Posted by Dave Bauer on 06/23/01 10:23 PM

For bboard it depends on what you are serching. For an entire thread the last modified date is the date the last post was entered. If you are searching individual postings, each one has a date. So there are two ways to search a bboard like this. Both are probably valid in different situations. This is why a search should probably search the database.

Perhaps we need each module to expose its content to a search in a certain format, and each module can decide what and how to offer that content. So we need an two sided search API. The search mechanism needs an API so a module can search the database. And another API for the module to offer up content to the indexer. Unfortunately I have no idea how to actualy build this.

3: Response to Searching dynamic content by date range. Non-sequitur? (response to 1)

Posted by Janine Ohmer on 06/24/01 03:43 PM

aD's design for site-wide search simplifies things in what I feel
is a useful way. They define triggers on each table that they want
to have searched. Any time data is inserted, updated or deleted
from one of those tables, these triggers perform the same action
on the search table. So when it comes time to search, you only
have to do it in one (indexed) table. There are support tables
which supply information such as how to construct links for hits
on data from a particular table.

This makes it fairly trivial to control what gets indexed and what
doesn't. The drawback is duplication of data, of course, but
personally I'm ok with that since the alternative is trying to "teach"
an indexer about the structure of your particular database - Not
Fun.

Conceptually, I think what's needed is to take a sophisticated
search engine like htDig and translate it's search algorithm into
SQL. So instead of generating a regexp or whatever it does now
to turn your query into something to be executed, it would have to
write SQL instead. Then it shouldn't be *too* hard to run that
SQL against the database and return the results. Of course I'm
probably overlooking something horrendously complicated or
this would have been done already! :)

BTW I agree with Dave, searching static pages just isn't good
enough for the kind of sites we are building here. Not to mention
that writing all the dynamic pages to disk represents even more
duplication than the search table method does!