__________________________________________________________________________________
GRID Networking workplan : Release
1 : 26-Sept-2000__________________________________________________________________________________
Prepared by UK-HEP-GRID-NETWORK group
Contact Information:
Persons in involved in preparation of this release
P.Clarke UCL
R.Cranfield UCL
G.Fayers ICSTM
J.Gordon CLRC-RAL
R.Hughes-Jones U.Manchester
P.Jeffreys CLRC-RAL
P.Kummer CLRC-Daresbury
J.MacAllister U.Oxford
D.Salmon CLRC-RAL
A.Sansum CLRC-RAL
R.Tasker CLRC-Daresbury
M.Parsons EPSS
M.Westhead EPSS
S.Saliah Brunel
Context
The purpose of this document is to identify and detail a programme of GRID related networking workpackages which we believe should be undertaken as part of the UK initiative to develop a GRID capability in the UK.
The document includes all areas of investigation which we believe are necessary for at least the first stage of R&D. An attempt is made to quantify the staff-year resources necessarry to carry out this programme.
It is important to note that this has been prepared without any constraint coming from the known available resources, i.e. it is intended to be complete rather than fitted to a limited number of staff-years. Therefore it follows that at this stage there is no implication that the persons involved in the preparation of this document have the necessary resources to carry out all of this programme. This issue is addressed separately.
Introduction
There are several clear and distinct GRID networking issues:
- Sustained high rate, robust bulk data transfer
- Use of differentiated services for time critical requirements
- Monitoring and Resource prediction metrics for use by matchmaking service
- Data flow Modelling
These items form the basis of workpackages described below.
1. Sustained bulk file transfer
GRID middleware needs to intelligently distribute data across several sites in order to be in a position to efficiently match user requests with resources. This requires reliable transfer of data entities. In the early stages for pragmatic reasons this is likely to be limited to file oriented transfer (as opposed to object oriented transfer) but in either case needs to reliably move/copy files across the network.
Generally this is done in anticipation of user requests, thus is not time critical on the scale of minutes or hours.
Typical raw/ESD file sizes will be 1 – 10 TBytes to start with, going up to 100s of TBytes later. To set the scale to transfer a 1 TByte file "overnight" (i.e. in 10 hours maximum) requires a sustained rate of 300 Mbits/sec over the WAN. To put this in context the ac.uk WAN provision will be 2.4 Gbits/s in 2001 rising to 10 Gbits/s in a few years, thus the GRID requirements represent a substantial fraction of this.
Issues:
- Knowledge of availbility of bandwidth on bulk network provision. This is a monitoring and prediction issue. We need to use/adapt monitoring tools to predict average sustained capacity.
- Robusness of protocols for long integral transfers. We need to demonstrate that large data sets can be reliably transferred, that protocols are resilient against interruption, congestion and failure. This requires phased testing of large transfers using dedicated PVCs where the background conditions can be controlled.
- Currently we have little experience of sustained sourcing, transfer and sinking of data at very high rates (> 600 Mbits/s). It is essential to gain experience in this core requirement of GRID operation. Partnerships with industry will be the only way to access the required bandwidth.
Accordingly we propose the following workpackages:
|
# |
Description |
Deliverable |
Resources required |
|
Fast ftp package tests |
|||
|
1 |
|
Report |
0.1 SY (assuming 5 packages) |
|
2 |
Measure performance of suitable packages in "laboratory" conditions
|
Measurements of throughput as a function of ftp parameters (eg: window size, number of parallel threads). Report. |
0.3 SY |
|
3 |
Measure performance between remote sites . Measurements should run at speeds ranging from 10 Mbit/s up to 1 Gbit/s for short periods where possible for short periods. 3.1 between collaborating UK sites using temporary PVC 3.2 between collaborating UK sites on ac.uk WAN 3. 3 between UK and CERN on WAN and PVC 3.4 between UK and US on WAN (and PVC if available)
|
Measurements of throughput as a function of ftp parameters. Measurments as a function of time of day. Measurments as a function of RTT, packet loss, and other WAN performance metrics.
Report |
PVC: 10 Mbit/s 00 50 Mbits/ 01 > 100 Mbits/s 01-02
1.5 SY |
|
s |
4.1 Live demonstration: Identify candidate experiment(s) to demonstrate live fast data transfer. |
Demonstration. Report |
Collaborators at CERN, SLAC, FNAL 0.2 SY |
|
Meta data brokering packages |
|||
|
5 |
5.1 Obtain, install and understand candidate packages. 5.2 Appraise applicability for further use |
0.1 SY |
|
|
6. |
Technical demonstration [ eg: mock repository and catalogue, access from test applications] 6.1 Demonstrate within laboratory environment 6.2 Demonstrate between remote sites within UK 6.3 Demonstrate from UK to SLAC and/or FNAL
|
0.5 SY |
|
|
7 |
7.1 Live demonstration: Identify candidate experiment(s) to demonstrate live meta data access. |
0.2 SY |
|
For relevant information obtained so far see appendix.
2. QoS / Differentiated services
Requirements for QoS
- Sustained large scale bulk file transfers are a requirement of (HEP) GRID activities. The evidence availaable suggests that such transfers do not consistently run to completion
- In a "best effort" environment sustained large scale bulk transfers will constantly compete with other traffic exhibiting different traffic profiles/characteristics. Depending on total load this could/will result in the non-optimal operation of these other protocols and their applications. Within the network, packets from all protocols are currently dealt with identically.
- During normal GRID operations it will from time to time be essential for a task to obtain small amounts of data from a remote site or to cause procesing to be initiated remotely and obtain results. This may happpen, for instance, if a particular entity does not exist at the point of request. Such demands have a time-bound constraint and must be serviced at high priority.
Existing Capabilities
The mechanisms to meet the above requirements are not currently in place within the network .
- Basic network resiliance is currently not sufficient to meet the demands of sustained data transfer.
- Recent experience with CAR+WRED within JANET suggest even that the basic congestion avoidance strategies and tools are at an early developmental stage of deployment.
- Differentiated networking services (diffserv) are widely seen as the way forward to provide a mechanism to deal appropriately with data flows which have differing time-bound constraints. Diffserv is expected to scale across the Internet but in the first instance will appear as single provider solutions and later through bilateral agreements as multi-provider solutions. In the long term the bandwidth broker will provide such services and as such has been identified as a critical element within Internet2.
Issues and Futures
- Either the reliability within the network must be improved, or applications are developed that are resilient to such service breaks. The latter seems the most likely as the network is typically a complex association of service providers.
- A test bed for traffic management technique testing needs to be established in association with UKERNA and the router vendors. [This is a problematic area as most large service providers run their networks at 50 - 70% utilisation and have no such issue with congestion etc, further they view traffic management techniques as additional, and largely unwelcomed, complexity. Only by convincing these providers that the GRID will introduce such bandwidth demands will traffic management be raised on their development agenda]
- Current deployment of diffserv is limited and services making use of diffserv are not yet available. Proprietary implementations exist which operate only on a limited set of network hardware. Coordination/collaboration with Internet2 is vital to ensure relevant experiences and expertise can be intoduced with SuperJANET4.
- The development of the bandwidth broker is crucial for the future - more complex - deployment of diffserv.
- GRID Matchmaking services need to be able to match the priority requirements of applications to network capability, for example diffserv utilisation based upon the prevailing state of the network. In the general sense the need to determine specific patterns of GRID QoS requirments exists as this will allow engineered solutions to be developed in the network.
|
# |
Description |
Deliverable |
Resources Required |
|
Quality of Service |
|||
|
1 |
|
Report
Report |
0.05 SY
0.1SY |
|
2 |
|
Report
Configuration manuals; measurements; reports.
Configuration manuals; measurements; reports.
Monitored statistical and analytical reports |
0.1SY
0.3SY + routers
0.2SY + coordination with JANET operational staff 0.1SY + long-term stats collection |
|
3 |
|
Early community experience; configuration manuals; measurements and reports. Configuration manuals; measurements and reports as (i) above
as (ii) above
Report (confidentiality?) Report (confidentiality?) |
0.4SY + routers
0.2SY
0.2SY + coordination with JANET operational staff 0.2SY + coordination with coordination with JANET and other, i.e. ESnet, Abilene operational staff 0.1SY
0.1SY |
|
4 |
|
Research papers; technical reports; input into IETF process; testbed deployment to demonstrate proof of purpose etc. |
1.0+SY |
|
5 |
GRID matchmaking. |
Report
Measurements and report Report |
0.2SY
0.2SY + test environment 0.1SY |
3. Monitoing and Resource prediction metrics
Many tools already exist worldwide for the purposes of monitoring the state of the international networks. It is evident that the successful long term operation of a GRID will rely crucially upon maintaining and evolving these services.
On a long timescale it is necessary to identify long standing bottlenecks and routing,peering problems, and where necessary lobby or take direct action to rectify the situation. This can only be done based upon a detailed knowledge of the history of performance indicators
On a much shorter timescale, the middleware needs to be aware of available resources on the scale of hours in order to decide how best to service user requests. This requires the development of metrics to predict prevailing bandwidth conditions.
|
# |
Description |
Deliverable |
Resources Required |
|
Monitoring |
|||
|
|
|||
|
1 |
|
Statements |
0.1 SY |
|
2 |
2.1 Survey existing network monitoring packages and decide on applicability of each for GRID monitoring and whether a package can be adapted for the multi-threaded n-dimensional and high performance nature of the GRID. 2.2 Survey existing multi-dimensional statistics packages and techniques with a view to application to GRID monitoring. |
Report |
0.2 SY |
|
3 |
3.1 Install and test the packages which appear most promising for GRID networking. 3.2 Adapt any packages which appear to have excellent applicability to GRID monitoring. |
Report |
0.1 SY per package for a thorough survey. |
|
4. |
4.1 Define GRID monitoring metrics for resource availability prediction. 4.2 Test the effectiveness of the metric(s) chosen. |
Report |
0.2 SY |
|
5. |
5.1 Investigate the applicability of different types of monitoring data e.g. 5.1.1 Historical data. 5.1.2 Long Term Recent. 5.1.3 Short Term Recent 5.1.4 Live. |
Report |
0.1 SY |
|
6. |
6.1 Investigate what can be done in general for protocol sensitive monitoring. |
Report |
0.1 SY |
|
7 |
7.1. Set up monitoring tests using existing ICFAMON packages to provide (a) comparison with existing experience and (b) to have something in place at the outset. |
Install and report. |
0.1 SY |
|
8 |
8.1 Define and test the presentation of GRID monitoring data. |
Report |
0.3 SY |
5. Data flow modelling
It is envisaged that data flow modelling (computer modelling) may be a very important tool relevant to:
- Specification of GRID components
- Optimisation of static characteristics: Network Topology choice, resource location
- Optimisation of dynamic characteristics: eg: replication and cacheing strategy
- Planning for evolution
Much work in this area has already taken place within the Monarch project, and UK groups have direct experience of more detailed modelling of LHC experiment data flow.
The initial requirement will be to determine the applicability of such modelling to the GRID problem. This should be done in parallel with work to appraise the use of Monarch tools. The ensuing work will then require the implementaion and evolution of a suitable model to provide input to GRID design.
NOT FORMULATED YET
Last modified Wed 26 November 2003 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3