GIR


 
   
Minutes

GGF7, March 4-7, 2003, Tokyo, Japan

GIR WORKING GROUP
GGF7, Tokyo, Japan
March 5, 2003

Discussants:
Kevin Gamiel (MCNC/CNIDR)
Gregory Newby (Arctic Region Supercomputing Center)

Participants:
Steve Fisher (RAL)
Tom Jackson (University of York)
Kazuo Kojo
Domenico Laforenza (Information Science and Technologies Institute)
Patricio Ortiz (Dep. Physics & Astronomy, University of Leicester)
Nassib Nassar (Etymon)
(and others)

Mr. Gamiel began by asking the meeting participants to record their name on the attendance list. He then directed the participants' attention to the GGF's IP policy statement and mentioned that the statement could be found linked from the GGF web site. Next he presented the agenda for the meeting, which he said had been posted for about a month [ed.: on the GridIR web site]. Following this he asked the participants how many of them were familiar with GridIR. About half indicated that they were. He then asked if there were any questions about the agenda. None responded.

Mr. Gamiel proceeded to discuss the status of the working group. He presented the "GIR WG Status" and "GIR WG Charter" slides.

Dr. Newby next presented several slides: "What is GridIR?," "What is GridIR, part 2," "Where are we now?," and "Old Business: Relationships to other WG/RGs."

Dr. Laforenza asked whether the group had any people from Google or Yahoo involved with GridIR?

Dr. Newby replied, mentioning first that he did not expect very much involvement from the search engine vendors. He stated that he had talked with someone from Google and that Google do not see the problem [ed.: that is, the problem(s) addressed by GridIR]. He continued that Google's view is, "monolithic systems are good." He said that Google do not see GridIR as "competition" but that this might change if the working group were to be successful. He also said that GridIR is in a difficult situation because Google's system provides high quality results, while GridIR in the reference implementations would not return results quite so fast as Google.

Mr. Gamiel added to this by pointing out that GridIR has open source systems. He mentioned that CNIDR have the Isite and Isearch software, and that Etymon have the Amberfish software. He stated that the working group welcome other vendors as well. He said also that there had been initial communications with OCLC and that they had shown tremendous interest. He continued that OCLC have been around and involved for years with standards such as Z39.50, and that they seem to be interested in pushing GridIR. He said that the working group look forward to having them join us.

Dr. Fisher stated that monolithic systems do well with static documents. He added that in particular communities there are data sets not available to Google, etc., that GridIR might be able to help make available.

Dr. Newby replied that, yes, this is exactly the working group's point. He also mentioned that GridIR is not specifically concerned with databases (although it could interact with them) and would include XML structured and unstructured data as well.

Mr. Gamiel added that GridIR enables what is sometimes called "deep web searching" and searching of metadata that would lead the searcher to data.

Dr. Newby proceeded to the next slide, "Old Business: Relevant Research and Products to GridIR." He pointed out that GridIR is not a database such as the Oracle and DB2 systems. Those systems, he explained, address problems separate from those addressed by GridIR and in particular do not focus on the problems related to GridIR's collection manager and query type functionality.

Mr. Gamiel presented the "Reference Implementation Status" slide. He requested that participants contribute suggestions for IR systems to be interfaced with the GridIR reference implementation currently under development. He stated that CNIDR would like to work with people to get systems integrated with the reference implementation. Dr. Newby requested that participants contribute data collections as well.

Dr. Newby next presented the "Use Scenarios" and "Draft Requirements Document" slides. He then opened a discussion on the Requirements document.

Mr. Gamiel followed with an overview and some discussion of each requirement listed in the document, beginning with the "Distributed," " Asynchronous notification," and "Event-driven operation" requirements.

Dr. Fisher pointed out that the "Distributed" support requirement should be classified as a "must" rather than "should." Mr. Gamiel confirmed that the correction was accurate.

Mr. Gamiel discussed the "Service-level authentication," "Index-level authentication," "Record-level authentication," and "Encryption" requirements. Dr. Fisher asked if "authentication" on some of the requirements should really be "authorization." Mr. Gamiel replied, yes. Dr. Newby added that there was a component of both: authentication at a high level, and maybe authorization at a low level.

Mr. Gamiel discussed the "Data transport mechanism and protocol agnostic," "Service persistence," and "Query persistence" requirements. Dr. Newby said that the query processor is one of the most interesting parts of GridIR, and that it is very hard to come by in other IR systems. He pointed out that it would enable standing queries, etc., and that this is very different from typical IR systems. He added that it would rely very much on Grid push-pull capabilities.

Mr. Gamiel continued with a discussion of the "Metadata services" requirement. Dr. Ortiz stated that different communities use different words for things that are conceptually similar, and he asked whether GridIR will use metadata in that sense. Mr. Gamiel replied that this would typically take place in the query processor, which is able to address query expansion. Dr. Newby pointed out that people might not realize why this does not already happen when in theory it seems that it should. The problem, he explained, is that it is difficult (i.e. computationally expensive) to do query expansion, synonyms, etc. He added that this is a known research problem, and GridIR does not attempt to solve it; however, it does change the IR framework so that it would be very acceptable to have a smaller collection with vocabulary mapping and perform more expensive queries, by doing event-driven asynchronous queries. Mr. Gamiel said that from the beginning the intention has been to split up the traditional monolithic IR system, freeing researchers to concentrate on particular components.

Mr. Gamiel continued with a discussion of the "Data collection description services," "Data collection scheduling services," "Data collection delivery services," and "Document content type independent" requirements. Dr. Newby mentioned that mime types would be used to communicate document types.

Mr. Gamiel then proceeded with the remaining requirements: "Document transformation capabilities," "Index generation services based on input collections," "Index query services generating result sets," " Query type independent," "Result type independent," "Result set delivery services," "Data presentation services from result sets," " Index scan services," "Peer-to-peer communication model," " Client-server communication model," "Merging," and "Multi-lingual
capable."

Dr. Newby next discussed the "Requirements Overview" slide. He said that the group would discover what is missing or needs elaboration via the reference implementation process. He added that it would be very important to work with others on the reference implementation in order to test out the requirements.

Dr. Newby posed two questions to the participants to gauge consensus: (1) "Based on what you have read and seen, does GridIR seem to be on the right track? Have we made a reasonable and clear case?" (2) " Does the Requirements document seem to be heading in the right direction?" The majority indicated a positive response to both questions.

Dr. Newby asked the participants to subscribe to the mailing list [ed.: if they have not already] and said that the mailing list has had fairly low traffic so far. He also presented the "Further new business?" slide and asked if anyone had new items of business to bring up. (There were none.)

Mr. Gamiel asked if there were any questions or concerns at this point about the current state of things or where the group is headed.

Dr. Kojo stated that the project is very interesting. He said his only concern was that it was too huge to be handled by the limited resources of the people present. Dr. Newby said that Google seems very easy, like it is a solved problem. He said that the group's experience is that it is not only difficult to create a single search engine but that there are significant challenges with query processing, etc. Therefore he agreed that it could be huge but that the group is trying to pare down the problem. Mr. Gamiel replied that the point was well taken. He said that when the group broke the problem down, it felt that, for example, although the collection manager could be split off into other working groups, at the moment it was not being handled in the GGF.

Dr. Ortiz said that different communities use different terms and that he wants to use one word to search for something. He said that the scientific and other communities using GridIR will have to make an effort to put some knowledge into the search engines. Mr. Gamiel replied that, yes, this has been a problem going back to WAIS; a huge problem. He said it was a library science problem, a computer science problem, etc., rolled into one. He added that the group hopes to
address it to some extent through the query processor, etc.

Mr. Gamiel then presented a set of slides, "GridIR Architecture Tutorial" (prepared at MCNC/CNIDR by Mr. Jeremiah Morris), including the following slides: "Architectural Overview," "Document Updates: Configuration," "Document Updates: Update Detection," "Document Updates: Notification," "Document Updates: Persistent Query," "User Query: Database Metadata Discovery," "User Query: Anatomy of an Indexer," "User Query: Submittal," and "User Query: Creation and Expansion." At this point Mr. Gamiel ended the presentation due to shortness of time, and he said that the slides would be available on the web site.

Dr. Newby said that the group welcomes help and values input of the participants. He added that he expects some sort of reference implementation to be ready by around GGF8.

Dr. Jackson enquired about the group working with the semantic grid people. Mr. Gamiel replied that, yes, he agreed that this is a good idea, and that the group would make contact with them.

About GIR-WG.org