Michael:
Let's start with your first two ideas:
1) Query optimization
2) Improved presentation of query result sets
It seems to me that these two are the simplest to think about
first, since they need only to interface with search and not any
other part of OpenACS.
The query optimiztion requires taking the search request and
passing it through OpenCyc to try to figure out the context of the
request. Java is a good example: does the requestor want to
know about the programming language, coffee, or the island in
Indonesia.
One way to assist this contextualization is through the
localization of the originating context. This can help with point 2.
On the OpenACS site, for example, it is most likely that people
are interested in the programming language. On Photo.Net,
newbies are probably interested in the Island, but people who
are familiar with Philip and Alex's Guide and its ties to photo.net
might be interested in the programming language. In neither
case is coffee a likely context.
Step 1 requires no knowledge of the contents of the
collection. Localization, as described above, does. In order to be
able to localize, it's necessary to do Step 3, which is have
OpenCyc process the entire contents of a collection so that it
"understands" what is in the collection.
In 1998 and 1999 I was looking for money to build a search
engine with some very unusual characteristics -- the main ones
were collaboration, the inclusion of business rules engines to
drive processes like spider scheduling, and incorporation of
WordNet to establish context. One example I used was Java.
After entering the search request, the user would be presented
with something like:
Found 3 different meanings for Java:
- a computer programming language (2,000,00 pages, 5967
sites, 19 categories)
- a synononym for coffee (116,000 pages, 423 sites, 3
categories)
- an island in Indonesia (23,000 pages, 111 sites, 9
categories)
The pages, sites, and categories references were links to
presentations of results. The order of the presentation, above,
was dependent on the number of pages returned. By having a
knowledge of the context of the query, for example, the user is
searching for Java from a travel site, then results about the
island are almost certainly the ones that were wanted and these
would be presented first (or, perhaps, exclusively).
A simple category structure would enable localization, too. If
the user searched from within the "travel" category, then the
scope of the search would automatically localized to travel.
Category organizers, in my way of thinking, should spend as
much time thinking about metadata as they do about the
category structure. For example, I have a personal interest in
chocolate. If I was responsible for the chocolate category at a
site that incorporated both categories and a page-based search
engine, I would want to create a thesaurus of concepts that
describe chocolate to aid a classification engine. With OpenCyc,
I would use the OpenCyc tools to do this work using assertions
like:
is a type of: white, ivory, milk, dark, bittersweet, truffle,
bonbon, ganache
is a manufacturer of: Callebaut, Valrhona, Nestle, Hershey
has(?) holidays: Easter, Valentine's Day
is not related: labrador retriever
I think that this is enough to give you a glimpse of some of
the ways I have been thinking about using this.
likewise, how do you picture the "chunking" of results into
"contexts"? How is that like or unlike search results pages that
are out there now? For example, I'm not really clear yet on how,
from a user's perspective "a collection of phrases and keywords
grouped into the smallest possible number of headings" differs
from Yahoo! or from Northern Light.
One of the challenges with taxonomies like Yahoo!'s is that
there is a tendency to want to describe things using a single
word. When that happens, related words tend to get separated
and entries naturally get segregated. Example, Associations,
Organization, & Clubs. To my way of thinking the distinctions
between these concepts are not useful when organizing the
contents of the Internet -- at least to most users. However, to
professional categorizers, they are different, and, while
linguistically correct and precise, are a nightmare from a
usability standpoint.
My main problem with Northern Light is that, after using it for
a while, you'll notice that the range of concepts it uses for its
categories is quite limited, is often self-referential (e.g., there is a
chocolate "custom search folder" within the chocolate results
set), and the confidence ratings are often nonsensical (e.g., the
first result within the chocolate custom search folder for
chocolate has a confidence rating of only 85%).
From my POV the purposes of chunking are to quickly
eliminate the contexts you know don't apply, and to present
contexts you may not have been aware of (serendipity). NL
doesn't do this for me, nor does Teoma or WiseNut.