GGF7,
March 4-7, 2003, Tokyo, Japan
GIR WORKING GROUP
GGF7, Tokyo, Japan
March 5, 2003
Discussants:
Kevin Gamiel (MCNC/CNIDR)
Gregory Newby (Arctic Region Supercomputing Center)
Participants:
Steve Fisher (RAL)
Tom Jackson (University of York)
Kazuo Kojo
Domenico Laforenza (Information Science and Technologies Institute)
Patricio Ortiz (Dep. Physics & Astronomy, University of Leicester)
Nassib Nassar (Etymon)
(and others)
Mr. Gamiel began by asking the meeting participants
to record their name on the attendance list. He then directed the
participants' attention to the GGF's IP policy statement and mentioned
that the statement could be found linked from the GGF web site.
Next he presented
the agenda for the meeting, which he said had been posted for about
a month [ed.: on the GridIR web site]. Following this
he asked the participants how many of them were familiar with GridIR.
About half indicated that they were. He then asked if there were
any questions about the agenda. None responded.
Mr. Gamiel proceeded
to discuss the status of the working group. He presented the "GIR
WG Status" and "GIR WG Charter" slides.
Dr. Newby
next presented several slides: "What is GridIR?," "What
is GridIR, part 2," "Where are we now?," and "Old
Business: Relationships to other WG/RGs."
Dr. Laforenza asked
whether the group had any people from Google or Yahoo involved with
GridIR?
Dr. Newby replied, mentioning first that he did
not expect very much involvement from the search engine vendors.
He stated that he had talked with someone from Google and that Google
do not see the problem [ed.: that is, the problem(s) addressed by
GridIR]. He continued that Google's view is, "monolithic systems
are good." He said
that Google do not see GridIR as "competition" but that
this might change if the working group were to be successful. He
also said that GridIR is in a difficult situation because Google's
system provides high quality results, while GridIR in the reference
implementations would not return results quite so fast as Google.
Mr.
Gamiel added to this by pointing out that GridIR has open source
systems. He mentioned that CNIDR have the Isite and Isearch software,
and that Etymon have the Amberfish software. He stated that the
working group welcome other vendors as well. He said also that there
had been initial communications with OCLC and that they had shown
tremendous interest. He continued that OCLC have been around and
involved for years with standards such as Z39.50, and that they
seem to be interested in pushing GridIR. He said that the working
group look forward to having them join us.
Dr. Fisher stated that
monolithic systems do well with static documents. He added that
in particular communities there are data sets not available to Google,
etc., that GridIR might be able to help make available.
Dr. Newby
replied that, yes, this is exactly the working group's point. He
also mentioned that GridIR is not specifically concerned with databases
(although it could interact with them) and would include XML structured
and unstructured data as well.
Mr. Gamiel added that GridIR enables
what is sometimes called "deep web searching" and searching
of metadata that would lead the searcher to data.
Dr. Newby proceeded
to the next slide, "Old
Business: Relevant Research and Products to GridIR." He pointed
out that GridIR is not a database such as the Oracle and DB2 systems.
Those systems, he explained, address problems separate from those
addressed by GridIR and in particular do not focus on the problems
related to GridIR's collection manager and query type functionality.
Mr.
Gamiel presented the "Reference Implementation Status" slide.
He requested that participants contribute suggestions for IR systems
to be interfaced with the GridIR reference implementation currently
under development. He stated that CNIDR would like to work with
people to get systems integrated with the reference implementation.
Dr. Newby requested that participants contribute data collections
as well.
Dr. Newby next presented the "Use Scenarios" and "Draft
Requirements Document" slides. He then opened a discussion
on the Requirements document.
Mr. Gamiel followed with an overview
and some discussion of each requirement listed in the document,
beginning with the "Distributed," "
Asynchronous notification," and "Event-driven operation" requirements.
Dr.
Fisher pointed out that the "Distributed" support
requirement should be classified as a "must" rather than "should." Mr.
Gamiel confirmed that the correction was accurate.
Mr. Gamiel discussed
the "Service-level authentication," "Index-level
authentication," "Record-level
authentication," and "Encryption" requirements. Dr.
Fisher asked if "authentication" on
some of the requirements should really be "authorization." Mr.
Gamiel replied, yes. Dr. Newby added that there was a component
of both: authentication at a high level, and maybe authorization
at a low level.
Mr. Gamiel discussed the "Data transport mechanism
and protocol agnostic," "Service persistence," and "Query
persistence" requirements. Dr. Newby said that the query processor
is one of the most interesting parts of GridIR, and that it is very
hard to come by in other IR systems. He pointed out that it would
enable standing queries, etc., and that this is very different from
typical IR systems. He added that it would rely very much on Grid
push-pull capabilities.
Mr. Gamiel continued with a discussion of
the "Metadata services" requirement. Dr. Ortiz stated
that different communities use different words for things that are
conceptually similar, and he asked whether GridIR will use metadata
in that sense. Mr. Gamiel replied that this would typically take
place in the query processor, which is able to address query expansion.
Dr. Newby pointed out that people might not realize why this does
not already happen when in theory it seems that it should. The problem,
he explained, is that it is difficult (i.e. computationally expensive)
to do query expansion, synonyms, etc. He added that this is a known
research problem, and GridIR does not attempt to solve it; however,
it does change the IR framework so that it would be very acceptable
to have a smaller collection with vocabulary mapping and perform
more expensive queries, by doing event-driven asynchronous queries.
Mr. Gamiel said that from the beginning the intention has been to
split up the traditional monolithic IR system, freeing researchers
to concentrate on particular components.
Mr. Gamiel continued with
a discussion of the "Data
collection description services," "Data collection scheduling
services," "Data collection delivery services," and "Document
content type independent" requirements. Dr. Newby mentioned
that mime types would be used to communicate document types.
Mr.
Gamiel then proceeded with the remaining requirements: "Document
transformation capabilities," "Index generation services
based on input collections," "Index query services generating
result sets," "
Query type independent," "Result type independent," "Result
set delivery services," "Data presentation services from
result sets," "
Index scan services," "Peer-to-peer communication model," "
Client-server communication model," "Merging," and "Multi-lingual
capable."
Dr. Newby next discussed the "Requirements
Overview" slide.
He said that the group would discover what is missing or needs elaboration
via the reference implementation process. He added that it would
be very important to work with others on the reference implementation
in order to test out the requirements.
Dr. Newby posed two questions
to the participants to gauge consensus: (1) "Based on what
you have read and seen, does GridIR seem to be on the right track?
Have we made a reasonable and clear case?" (2) " Does
the Requirements document seem to be heading in the right direction?" The
majority indicated a positive response to both questions.
Dr. Newby
asked the participants to subscribe to the mailing list [ed.: if
they have not already] and said that the mailing list has had fairly
low traffic so far. He also presented the "Further new business?" slide
and asked if anyone had new items of business to bring up. (There
were none.)
Mr. Gamiel asked if there were any questions or
concerns at this point about the current state of things or where
the group is headed.
Dr. Kojo stated that the project is very interesting.
He said his only concern was that it was too huge to be handled
by the limited resources of the people present. Dr. Newby said that
Google seems very easy, like it is a solved problem. He said that
the group's experience is that it is not only difficult to create
a single search engine but that there are significant challenges
with query processing, etc. Therefore he agreed that it could be
huge but that the group is trying to pare down the problem. Mr.
Gamiel replied that the point was well taken. He said that when
the group broke the problem down, it felt that, for example, although
the collection manager could be split off into other working groups,
at the moment it was not being handled in the GGF.
Dr. Ortiz said
that different communities use different terms and that he wants
to use one word to search for something. He said that the scientific
and other communities using GridIR will have to make an effort to
put some knowledge into the search engines. Mr. Gamiel replied that,
yes, this has been a problem going back to WAIS; a huge problem.
He said it was a library science problem, a computer science problem,
etc., rolled into one. He added that the group hopes to
address it to some extent through the query processor, etc.
Mr. Gamiel
then presented a set of slides, "GridIR Architecture Tutorial" (prepared
at MCNC/CNIDR by Mr. Jeremiah Morris), including the following slides: "Architectural
Overview," "Document
Updates: Configuration," "Document Updates: Update Detection," "Document
Updates: Notification," "Document Updates: Persistent
Query," "User Query: Database Metadata Discovery," "User
Query: Anatomy of an Indexer," "User Query: Submittal," and "User
Query: Creation and Expansion." At this point Mr. Gamiel ended
the presentation due to shortness of time, and he said that the
slides would be available
on the web site.
Dr. Newby said that the group welcomes help and
values input of the participants. He added that he expects some
sort of reference implementation to be ready by around GGF8.
Dr.
Jackson enquired about the group working with the semantic grid
people. Mr. Gamiel replied that, yes, he agreed that this is a good
idea, and that the group would make contact with them.
|