A SCALABLE FRAMEWORK AND PROTOTYPE FOR CAS E-SCIENCE

Based on the Small-World model of CAS e-Science and the power low of Internet, this paper presents a scalable CAS e-Science Grid framework based on virtual region called Virtual Region Grid Framework (VRGF). VRGF takes virtual region and layer as logic manage-unit. In VRGF, the mode of intra-virtual region is pure P2P, and the model of inter-virtual region is centralized. Therefore, VRGF is decentralized framework with some P2P properties. Further more, VRGF is able to achieve satisfactory performance on resource organizing and locating at a small cost, and is well adapted to the complicated and dynamic features of scientific collaborations. We have implemented a demonstration VRGF based Grid prototype—SDG.


INTRODUCTION
"E-Science is about global collaboration in key areas of science and the next generation of infrastructure that will enable it" (Taylor, 2002).E-Science enables scientists to generate, analyze, share, and discuss their insights, experiments, and results in a more effective manner.These experiments involve geographically distributed and heterogeneous resources such as computational resources, scientific instruments, databases, and applications.The data in these experiments are usually massive and distributed across numerous institutions for various reasons including: the inherent distribution of data sources; large-scale storage and computational requirements; the need to ensure high-availability and fault tolerance of data; and caching to provide faster access.CAS (the Chinese Academy of Sciences) e-Science (Nan, 2002) is built upon the mass scientific data resources of the Scientific Database (SDB), in which multi-disciplinary scientific data are accumulated through the course of scientific activities in the CAS.The nodes in CAS e-Science are located at various institutes, of which each has a specific research domain.Scientific activities in the CAS have Small-World characteristics.For example, in the scientific collaborations graph, the nodes are in research institutes, and two research institutes are connected if they have the same service properties and research domains.The vision of CAS e-Science is to take valuable data resources into full play by benefiting from advanced information technologies, Grid technology and P2P technology in particular.
The representatives of Grid systems, such as Globus (Foster, 1997) and Web services (W3C, 2003), provide flexible and uniform access interfaces to all types of resources with grid service or web service and are used to build the grid application.However, in these systems, the computing mode is client/server, and the services are published and discovered with a centralized mode, which has poor scalability and a single concentrated point of failure.Some P2P systems, e.g.Napster (Napster website), are distributed file systems with centralized modes, which manage the directory by a server.They provide strong capability to manage open file-sharing systems.However, they still use a centralized mode, so they have a single point of failure.Other P2P systems, such as Gnutella (Gnutella website) (Zeinalipour, 2002) and Freenet (Clarke, 2002), are pure P2P systems, which have a completely decentralized structure.They have the great advantage of having no bottlenecks and good robustness.However, they are faced with some challenges such as security, network bandwidth, and architectural design and are difficult to search for services that are clustered together as described by WSDL (Christensen, 2001) and GSDL (Foster, 2002).They also have poor scalability and low efficiency because of the exponential growth of redundant information.VDHA (Huang, 2002) is a virtual and dynamic hierarchical architecture, in which Grid nodes are grouped virtually.It has a decentralized architecture with some P2P properties and also has scalable, autonomous, exact, and full service discovery properties.However, it lacks a mathematic model, and the classifying strategy of the virtual organization is not clear.The Sun's Project JXTA 2.0 Super-Peer Virtual Network (Traversat, 2004) is similar to VDHA.
We use the model of "Small-World and power law" as a theoretical foundation and present the grid framework based on virtual regions called Virtual Region Grid Framework (VRGF).This framework is decentralized and scalable and implements a scalable VRGF based Grid system-SDG (Scientific Data Grid), which combines the advantages of P2P and C/S.The rest of the paper is organized as follows.In Section 2, the VRGF model and its related protocols are described.Section 3 lays out an implementation prototype of VRGF.Finally, we give the conclusion and outline future work.

Definition and Related Concepts
Definition 1 (Power Law) -the distribution probability of the nodes with k degree in a random graph is expressed as ( ) In a network, a few nodes are of a high degree, and many nodes are of a low degree.
Therefore, there is a high probability of finding related information through the high degree nodes (Faloutsos, 1999).
Definition 2 (the Small-World) -the Small-World is a network topology that has a large clustering coefficient and small average path length (Kleinberg, 2000).
Definition 3 (Service Similarity) -the service similarity of two nodes is the similarity of the service properties and the research domains between the two nodes.The function of service similarity between two nodes is expressed as , , , where 1 μ and 2 μ are weights.Definition 4 (Host) -a client host is an apparatus (such as desktop computer, PDA, mobile computer, etc), which is used to log into a Grid system.

Definition 5 (Virtual Region) -a Virtual Region is formed virtually by Grid nodes based on the service properties and the research domains of the nodes. The nodes of intra-virtual regions have high similarity.
Definition 6 (Grid Node) -a grid node is an ordinary node in the Grid system.
Definition 7 (Head Node) -a head node is a Grid node which manages the virtual region.The highest performance node in the virtual region is chosen as the head node, which locates the logical center of the virtual region.The head node provides the properties of the virtual region and the interface for the inter-virtual region.
Definition 8 (Active Grid Node) -an active grid node is the Grid node that takes charge of a node joining to the Grid.Each virtual region has an active grid node.In this framework, we use the head node as the active grid node.
A node can take any active grid node as an entrance node to join to the Grid system.Definition 9 (Layer) -the virtual regions of Layer i L are composed of the head nodes of 1 i L − .

The Description of VRGF
The nodes of CAS e-Science are usually located in institutes.The institutes are always formed into virtual regions according to specific domains, and several virtual regions share a more general common domain.This is similar to Small-World networks (Duncan, 1999).Two characteristics distinguish Small-World networks: first, a small average path length, typical of random graphs; second, a large clustering coefficient that is independent of network size.The clustering coefficient means the number of a node's neighbors that are connected to each other.One can picture a Small-World as a graph constructed by loosely connecting a set of almost complete subgraphs.Thus, CAS e-Science has the Small-World property.The Small-World example of scientific activity is the scientific collaboration graph, where the nodes are scientists, and two scientists are connected if they have the same research field.Such graphs with a Small-World character in scientific collaborations can span a variety of different domains, including physics, biomedical research, mathematics, and computer science.Then, CAS e-Science, in which nodes are institutes and edges are relationships among institutes having the same research domains, becomes the Grid system with the Small-World character.
According to the Small-World model of CAS e-Science and the power law of the Internet, the nodes in VRGF are formed virtually to virtual regions based on the service properties and the research domains.Virtual regions are virtually hierarchical, with one root-layer, several middle-layers, and the lowest layer (layer 0). Figure 1 shows the network topology of VRGF.The network topology has many layers, and each layer is composed of virtual regions, which include many Grid nodes.Any node of the Grid system belongs to one or more virtual regions.All physical nodes are in layer 0 virtual regions.Among these nodes of layer 0, one (just one) node (called the head node) in each virtual region is chosen to form the upper-layer (layer 1) virtual region.From the nodes in these upper-layer virtual regions, one is chosen to form the upper-upper-layer (layer 2) virtual region in the same way, and this is repeated until one root-layer (only one node) is formed.In the virtual region the node with the highest performance is chosen to be the head node (also the active grid node), which is not only in the low-layer, but also in the upperlayer.All active grid nodes in each layer are connected.Nodes can join and leave a virtual region dynamically.
In VRGF, the intra-virtual region nodes have high service similarity, and the inter-virtual region nodes have low service similarity.So there is a higher probability of satisfying specific service requests in an intra-virtual region than in an inter-virtual region and for locating services from all nodes to the intra-virtual region nodes.
Thus, VRGF topology has several properties: (1) high performance for locating services and avoiding request flooding efficiently; (2) high scalability and robustness; (3) transforming data streams to the active virtual region easily.

Figure 1. VRGF network topology
The implementation architecture includes four layers.The bottom layer is a fabric layer, which provides the fundamental functions, such as security management, etc.The next layer is a transport layer, which takes the responsibility for communication between nodes.It contains many transport protocols, such as SOAP, XML, etc.The core layer is the next layer, which deals with the messages received from or sent to the transport layer.The top layer is the application layer which deals with the client and server-side tasks or applications.
One of the working scenarios is as follows: The client requests to query data in the Grid by sending a query message (task), which indicates the service name, searching model, domain knowledge, and so on, to the head node.Then, according to the service similarity, the head node locates the related virtual regions, in which the core layer of their head nodes decomposes the query message and dispatches to the nodes of the relevant virtual region.After querying the data of these nodes, they send the response message to the head node, which sends the result to the client.

CONCLUSIONS AND FUTURE WORK
VRGF adopts the Small-World model and power law to organize virtual regions based on service properties and research domains, so the nodes in an intra-virtual region have high service similarity.Therefore, VRGF has built on a mathematic model.
VRGF can solve scaling and autonomy problems and has high performance and accurate discovery of resources and services.It has a high probability of satisfying a specific service request in a virtual region, and the complexity of locating services is reduced from all nodes of the Grid to the nodes of an intra-virtual region.We have implemented a demonstration prototype.
Our further work will focus on completing the SDG, on enriching the services of SDG, and on implementing the mechanism for locating services.
is the similar function based on domain text.The match algorithm based on the keywords is adopted in 1 because ICT and ISCAS have the same research domain name.