GRID IR - GRID Information Retrieval
Matthew J. Dovey
History and Background
Information retrieval (IR) is the science and practice of identifying documents or sub-documents that meet information needs. Usually, IR deals with textual documents in semi-structured (e.g., HTML, XML) or unstructured (plain text) format. However there are branches of IR now working on multimedia formats such as pictures, audio or video, both in raw and structured formats (such as MPEG7). Information needs might include questions to answer, a topical description, specifications dealing with time and space, or other needs with varying levels of specificity. Information needs are expressed as queries. Thus, at its base level, IR consists of matching and ranking the best documents (or document fragments) to queries.
IR systems are not the same as typical database systems since these emphasize organization and retrieval from very structured data (tables, different data types), whilst IR focuses on less structured data or documents. Databases often include frequent updates, and can have complex relations among data, albeit of a limited type. IR systems on the other hand usually have relatively infrequent updates and are interested in utilising and identify the many types of relationships between data. Also whilst databases focus on the retrieval of sets and lists corresponding to fairly strict criteria, IR focuses on relevance ranking based on weighting factors, so that only part of the (potentially huge) response set is presented to the user.
Quite simply, GridIR is IR on the Grid! Grid computing is the accomplishment of computational tasks on a set of computers connected by a network. This is similar to distributed computing, except with a more finely grained implementation for task assignment and coordination among the grid elements. Grid computing also has a security model built in, such that a desired level of security and access control may be implemented at all levels of the grid infrastructure. GridIR is a newly proposed initiative to implement a specific architecture for realizing IR on the OGSA grid computing platform. Traditional IR models are broken into constituent pieces and described as OGSA Grid services. A model for interaction among these services describes the GridIR system.
GridIR offers the opportunity for improved approaches to network-based federated databases In its current conception, GridIR fundamentally allows user information needs to be matched to documents by document collections, indexes and query engines which all exist as Grid services Most IR systems have a single method for indexing and query processing and such systems treat all documents, from all sources, equally. However, nearly all components of an IR system can run in parallel. GridIR will enable federations of document collections, each tuned for optimal performance. GridIR's security model will also allow document collections to be "published" to the grid, but with access control lists that might limit who can query the collection or to whom results may be shown.
Overall, use of the GRID can offer the following benefits to IR systems: the use of divide and conquer approaches to collections, indexing and querying, thus allowing IR on larger collections or use more complex techniques would be too slow or complex on non-GRID systems; the ability to tune collections for better retrieval; event triggered re-indexing via push or pull architectures; security at the collection, query and document level; mechanisms for collecting, weighing and ranking results from different sources - multiple algorithms or IR approaches might be applied to the same data set, and then provide merging and ranking methods to determine the overall ranking from the merged results. IR would also be of use the GRID in general in that it could provide: the embedded and integrated indexing of content which are agnostic about content type, markup, size, location; the publication of local documents or other collections in indexed and searchable formats; a Framework for updating collections by push or pull mechanisms; a resource discovery infrastructure for changeable index-able content.
The GRID IR architecture is still very much a work in progress and will be the primary output of a new Working Group proposed to the Global Grid Forum. It is based on OGSA Grid Services and satisfies the distributed and asynchronous requirements. It uses standard, simple formats for exchange of queries and documents and satisfies the need to be content and query type independent. It reuses previously standardized query and content types where possible, e.g. Z39.50 RPN query, attribute sets, etc. and takes advantage of existing work to move existing IR standards (such as Z39.50) to the WebService platform . GRID IR breaks down into the following basic GRID Services:
Within GRID IR, GRID technologies will offer new techniques and powerful information retrieval systems, whilst IR will offer the GRID new ways of handling and discovering information. We envision a future where people will all have their own IR systems that are attuned to their ongoing information needs, preferences and interests. On the grid, it's feasible to implement grid elements that have knowledge of the history of a user and her needs, and take these needs into account using merging and ranking algorithms already in place for combining sets from federated collections. This can offer news ways of researching IR whilst also allow new ways of handling information both within GRID research projects and in general.
 GRID-IR. http://www.gridir.org