Home
The Toolkit for Online Communities
15936 Community Members, 0 members online, 2291 visitors today
Log In Register

Forum OpenACS Development: Searching dynamic content by date range. Non-sequitur?

OpenACS Home : Forums : OpenACS Development : Searching dynamic content by date range. Non-sequitur?

Icon of Envelope Request notifications

So I'm playing with htDig and other site indexing tools, and have had
pretty good results (http://www.theashergroup.com/demos/openacs/).

But I face a basic quandry and I'd like your help.  Part of the
quandry is AOLserver quirk, and part is fundamental to sites that
offer dynamic content.  But based on the thought that what I am doing
my be useful to others in the ACS environment, I'd like to hear your
thoughts on search metaphors, design, and APIs.

I would like to offer visitors the ability to search for content
based on the date of the content.  Now that's a completely
understandable and useful capability when searching your garage, your
tax records, and when searching a static site.  I know something
happened last year between April Fools Day and Guy Fawkes Day.  What
was it?

It doesn't make as much sense on a site that's more application than
content (or it's not nearly as easy or even doable by typical site
index engine): find me the Amazon quote as of last April.  Nice idea -
- can't be done by most site indexers which are geared more towards
capturing and presenting the latest snapshot of a site.

It does make sense for certain ACS elements: find me the thread on
javascript security holes that occurred around Halloween.

Problem is, is that's hard to do with the site index tools I've
looked at so far when combined with our current bboard metaphor and
it's hard to do in general within AOLserver.

The site index tools I've seen (htDig mostly) only can deal with one
date: the last modified date of the document.  Yet on a page that
contains dynamic content, there's an assembly of content each of
which has a different "last modified" date, and most of those dates
are unknown.

AOLserver does the cheap and mostly correct thing: for adp pages (tcl
pages too?) the current time and date is reported as the last
modified time and date; while html pages have the actual last
modified time and date as recorded in the file system returned.

Our bboard metaphor makes it hard, because, well, what is the date of
a thread?  Is it the date of the first post, or the date of the last
post?  It's really a date range.

So what's the answer?

Is the search by date metaphor just meaningless in a world of dynamic
content?

Should we formally support an ACS interface so that each page can set
it's last modified date time if it needs/wants to?

In an application like bboard, what should the last modified date be
set too?

Should content assemblies like bboard have a special search indexing
engine mode (presumably useful to more than just htDig but unknown)
that can expose individual elements when they each have a meaningful
last modified date time that would be of interest to folks searching
a site?

What kind of an interface would you like to see?

For bboard it depends on what you are serching. For an entire thread the last modified date is the date the last post was entered. If you are searching individual postings, each one has a date.  So there are two ways to search a bboard like this.  Both are probably valid in different situations. This is why a search should probably search the database.

Perhaps we need each module to expose its content to a search in a certain format, and each module can decide what and how to offer that content. So we need an two sided search API.  The search mechanism needs an API so a module can search the database. And another API for the module to offer up content to the indexer. Unfortunately I have no idea how to actualy build this.

aD's design for site-wide search simplifies things in what I feel
is a useful way.  They define triggers on each table that they want
to have searched.  Any time data is inserted, updated or deleted
from one of those tables, these triggers perform the same action
on the search table.  So when it comes time to search, you only
have to do it in one (indexed) table.  There are support tables
which supply information such as how to construct links for hits
on data from a particular table.

This makes it fairly trivial to control what gets indexed and what
doesn't.  The drawback is duplication of data, of course, but
personally I'm ok with that since the alternative is trying to "teach"
an indexer about the structure of your particular database - Not
Fun.

Conceptually, I think what's needed is to take a sophisticated
search engine like htDig and translate it's search algorithm into
SQL.  So instead of generating a regexp or whatever it does now
to turn your query into something to be executed, it would have to
write SQL instead.  Then it shouldn't be *too* hard to run that
SQL against the database and return the results.  Of course I'm
probably overlooking something horrendously complicated or
this would have been done already! :)

BTW  I agree with Dave, searching static pages just isn't good
enough for the kind of sites we are building here.  Not to mention
that writing all the dynamic pages to disk represents even more
duplication than the search table method does!