GIR


 
   
Documents

GIR-WG Charter

Administration

Name

Grid Information Retrieval (GIR or GridIR)

Chairs

Greg Newby, PhD
Arctic Region Supercomputing Center
gbnewby at arsc.edu

Nassib Nassar
Etymon Systems, Inc.
nassar at etymon.com

Yangwoo Paul Kim, PhD
Dongguk University
ywkim at dongguk.edu

Secretary

Sousan Karimi
MCNC
sousan at mcnc.org

Mailing List

Subscription details and user interface available at http://www.gir-wg.org/wg_list.html .

Description and Objectives

Purpose

The GridIR WG will focus on establishing the requirements, specifications, reference implementations and best practices in supporting Information Retrieval (IR) services on the Grid. Grid IR services will be needed by users, applications and portals to provide documents, document extracts, answers or other data items to satisfy information needs.

Goals

The GridIR WG will focus on the following:

1. Establish the requirements for Grid IR services:

GridIR will be defined as a set of grid services which, together, constitute a complete an IR system, including:

  • Harvesters, to gather network-based documents
  • Indexers, to build data- and file-structures for retrieval
  • Index processors, to determine post-indexing term and document weights
  • Query processors, to take user queries and gather results
  • Integrators, for ranking results from different sources
  • Renderers, to take results and organize or present them
  • Many other sub-systems and control systems

GridIR will also need to impose requirements on the IR service specific to the Grid, including:

  • Rapid update schedules for datasets
  • Federation of datasets from multiple sources
  • Enabling local policy for dataset content access, based on Grid security infrastructure
  • Sophisticated localized indexing and query processing appropriate for each dataset
  • Sophisticated post-hoc results ranking
  • Efficient use of computational resources (e.g., multiple harvesters feeding one indexer)
  • Multimedia capabilities (incorporation of special-purpose IR systems into one meta-system)
  • Rapid rendering and context-switching, including data visualization of results and multiple 'views' of data b ased on different user profiles
  • Consensus-based results generation from multiple retrieval algorithms to select best-of-breed algorithms

2. Define a set of GridIR specifications:

The Open Grid Services Architecture (OGSA) along with technologies such as the Web Services Flow Language (WSFL) provide a framework for linking loosely coupled grid services together to form more advanced services. Though these technologies provide the infrastructure, each service description must be created by stakeholder communities to ensure required functionality. The GridIR WG will develop an overarching IR architecture, will detail service-level requirements, will establish independent service models, and develop interface specifications for the various independent IR-related services, all with an eye towards tying those services together into an integrated whole. The WG will work to develop a plug-and-play type architecture for GridIR where the Grid infrastructure enables rapid integration of standards-compliant IR modules. In many cases, GridIR will allow communication between modules (e.g., for multiple harvesters feeding an indexer).Anticipated individual services include crawlers, indexers, search and presentation engines.

3. Support and Evaluate GridIR Reference Implementations

There are numerous investigation areas for the reference implementation for GridIR specifications. The reference implementations will address many of the following IR considerations:

  • Extremely large collections (billions of documents)
  • Documents in plain text, HTML, XML
  • Multimedia documents (video, audio, other non-text formats)
  • Documents in multiple languages; queries in multiple languages
  • All variety of harvesting methods
  • Numerous fundamental IR algorithm components (Boolean; Vector Space Model, probabilistic IR, Page Rank, Latent Semantic Indexing...)
  • Flexible local policy for what documents are allowed
  • Sub-document retrieval, linguistic approaches, question answering
  • Long and short queries; document filtering

Solutions for most of the IR techniques are available, although some do not scale well or are less amenable to the distributed processing of the Grid.

GridIR will benefit from past experiences in networked IR. For example, Z39.50 offers the ability to send a query to multiple IR engines. GridIR will take Z39.50 further by layering IR on the Grid security and authentication infrastructure, and by providing sophisticated techniques for merging and ranking the results from the engines.

To support the evaluation of the reference implementations, the GridIR WG will promote the development of test suites that can be used to validate an implementation and provide the basis for comparing them.

4. Establish Best Practices for GridIR

The GridIR WG will establish best practices for GridIR implementations and use by collecting and disseminating experiences. Furthermore, the GridIR WG will ensure that the best practices conform with the other Grid Services groups which define services that will be needed to implement GridIR as well as the Portals, Users, and Applications groups that will use the GridIR services.

Milestones

  • GridIR Requirements Document - Stakeholder-driven list of service-level requirements for building a grid-based IR system. Revised draft by GGF7, finalize by GGF8.
  • GridIR Architecture Document - Describes overall system comprised of integrated grid services, scenarios, etc. First draft by GGF7, revisions by GGF8, finalize by GGF9.
  • GridIR Services Document - Describes each service in detail, with an emphasis on WSDL interface specification. First draft by GGF7, revisions by GGF8, finalize by GGF9.

Website

http://www.gir-wg.org/

About GIR-WG.org