|
What are the parts of an IR system?
In general, IR systems need to do the following:
- Manage document collections. A logical document collection might
consist of, for example, a set of URLs with associated rules for
spidering, transformation, etc. Collection change management and
notification is critical in keeping collections and systems which
depend on those collections up-to-date.
- Indexing and searching. A collection is
typically preprocessed, or "indexed", to aid in rapid
and relevant search and retrieval of information extracted from
the base collection.
- Query processing. A structured or unstructured query may be expanded
via thesaurus lookup or some other method of analysis. Distributed
search of multiple indexes requires a merge of relevant results
from each source in a reasonably consistent manner.
What parts of IR can run simultaneously and what
parts must run serially?
Although contention can be a problem (i.e., to insure
that two newly identified terms are not accidentally given the same
term ID), nearly all components of an IR system can run in parallel.
What new models for IR are enabled by the Grid?
Most
IR systems have a single method for indexing and query processing.
These systems treat all documents, from all sources,
equally.
GIR will enable federations of document collections, each
tuned for
optimal performance. GIR's security model will also allow
document collections to be "published" to the grid,
but with access control lists that might limit who can query
the collection or to
whom results may be shown.
What's wrong with WAIS or similar
and Z39.50?
WAIS (Wide Area Information Servers) was an
implementation of course-grained networked information retrieval
from the early
1990s. WAIS, which is
no longer actively maintained, relied on Z39.50. This is
a bibliographic standard for exchange of queries and results
among systems.
GIR will leverage the vast body of
experience in abstract interfaces for IR defined by the Z39.50 community.
Though
the granularity of
GIR interfaces is expected to be finer, there are
many common areas as
well.
Isn't Google good enough?
Google uses a monolithic
model that provides for no tuning for specific sub-collections,
no security model, and
no alternate retrieval methods.
GIR will make use of many of the IR methods that
Google utilizes
but, unlike Google, will be open source, standards-compliant,
and more geared towards the virtual organization
(VO) model of dynamic,
transient
computing resources.
Doesn't Oracle already do this?
Oracle, Sun, IBM and
other companies have grid computing products. These companies also
have IR products.
We are unaware of efforts
to implement GIR, however, and existing
IR products from these and
other companies tend to follow a monolithis
model.
What is relevance feedback?
Relevance feedback is a
process which takes a resultant document or partial document
from an
end user and
using different techniques,
re-queries
the database to locate documents closely
related to the input document.
What is consensus-based ranking?
In GIR, multiple
algorithms or IR approaches might be applied to the same data
set. For example, a company
might
decide
to index a set
of documents using the Vector Space
Model, but use different term weighting methods
in three
different
indexes. GIR
would query
the three indexes
and get three ranked sets of documents.
GIR will provide merging and
ranking methods to
determine the overall ranking
from the
merged results. We propose consensus-based
ranking as one method for this,
in which documents that are agreed
upon as highly ranked by the different systems
will
be highly
ranked
in the
merged output
set.
How do harvesters/spiders work,
and how does GIR help?
On the
Web, programs are used to identify and retrieve files (HTML,
text, PDF
and other formats
as desired).
These programs
are called
crawlers, harvesters or spiders.
GIR will include a Collection
Management
service, which
provides
a standardized set of services
and access APIs
which will, in practice, wrap
spiders and will also include
the ability to read documents
from local or networked disks, databases,
etc.
GIR will help spiders
by making them distributed. Rather
than a
single network
bottleneck, grid
elements will be
able to build
their
own indexes with "near" documents.
What is
personalized IR and how does
GIR help?
We envision a
future where people will all have their
own IR systems
that are
attuned
to their
ongoing
information needs, preferences
and interests. On the
grid, it's feasible to implement
grid elements
that
have knowledge of the
history of a
user and her needs, and
take these needs into
account
using
merging and
ranking
algorithms
already in
place for combining sets
from federated collections.
|