GIR


 
   
Frequently asked questions

What are the parts of an IR system?

In general, IR systems need to do the following:

  1. Manage document collections. A logical document collection might consist of, for example, a set of URLs with associated rules for spidering, transformation, etc. Collection change management and notification is critical in keeping collections and systems which depend on those collections up-to-date.
  2. Indexing and searching. A collection is typically preprocessed, or "indexed", to aid in rapid and relevant search and retrieval of information extracted from the base collection.
  3. Query processing. A structured or unstructured query may be expanded via thesaurus lookup or some other method of analysis. Distributed search of multiple indexes requires a merge of relevant results from each source in a reasonably consistent manner.

What parts of IR can run simultaneously and what parts must run serially?

Although contention can be a problem (i.e., to insure that two newly identified terms are not accidentally given the same term ID), nearly all components of an IR system can run in parallel.

What new models for IR are enabled by the Grid?

Most IR systems have a single method for indexing and query processing. These systems treat all documents, from all sources, equally. GIR will enable federations of document collections, each tuned for optimal performance. GIR's security model will also allow document collections to be "published" to the grid, but with access control lists that might limit who can query the collection or to whom results may be shown.

What's wrong with WAIS or similar and Z39.50?

WAIS (Wide Area Information Servers) was an implementation of course-grained networked information retrieval from the early 1990s. WAIS, which is no longer actively maintained, relied on Z39.50. This is a bibliographic standard for exchange of queries and results among systems.

GIR will leverage the vast body of experience in abstract interfaces for IR defined by the Z39.50 community. Though the granularity of GIR interfaces is expected to be finer, there are many common areas as well.

Isn't Google good enough?

Google uses a monolithic model that provides for no tuning for specific sub-collections, no security model, and no alternate retrieval methods. GIR will make use of many of the IR methods that Google utilizes but, unlike Google, will be open source, standards-compliant, and more geared towards the virtual organization (VO) model of dynamic, transient computing resources.

Doesn't Oracle already do this?

Oracle, Sun, IBM and other companies have grid computing products. These companies also have IR products. We are unaware of efforts to implement GIR, however, and existing IR products from these and other companies tend to follow a monolithis model.

What is relevance feedback?

Relevance feedback is a process which takes a resultant document or partial document from an end user and using different techniques, re-queries the database to locate documents closely related to the input document.

What is consensus-based ranking?

In GIR, multiple algorithms or IR approaches might be applied to the same data set. For example, a company might decide to index a set of documents using the Vector Space Model, but use different term weighting methods in three different indexes. GIR would query the three indexes and get three ranked sets of documents. GIR will provide merging and ranking methods to determine the overall ranking from the merged results. We propose consensus-based ranking as one method for this, in which documents that are agreed upon as highly ranked by the different systems will be highly ranked in the merged output set.

How do harvesters/spiders work, and how does GIR help?

On the Web, programs are used to identify and retrieve files (HTML, text, PDF and other formats as desired). These programs are called crawlers, harvesters or spiders. GIR will include a Collection Management service, which provides a standardized set of services and access APIs which will, in practice, wrap spiders and will also include the ability to read documents from local or networked disks, databases, etc.

GIR will help spiders by making them distributed. Rather than a single network bottleneck, grid elements will be able to build their own indexes with "near" documents.

What is personalized IR and how does GIR help?

We envision a future where people will all have their own IR systems that are attuned to their ongoing information needs, preferences and interests. On the grid, it's feasible to implement grid elements that have knowledge of the history of a user and her needs, and take these needs into account using merging and ranking algorithms already in place for combining sets from federated collections.

 

About GIR-WG.org