GridPP 22nd Collaboration Meeting @ UCL ======================================= Wednesday 1st April Discussion Session: Implications of the Experiment priorities and plans ----------------------------------------------------------------------- Chair: Pete Clarke Panel: Chris Brew, Raja Nandakumar, Graeme Stewart, Dave Colling, Glenn Patrick PC commenced with an introduction to networking and the planned 10 gig resilient link from RAL to CERN. Re the 4 x 1 gig backup to OPN, the default would be to lie idle. PC asked if we should rely on extra capacity for normal use? Or, rather, for more 'inventive' use? CB advised that he wasn't clear that the network fibre was the bottleneck, he believed that end-to-end connectivity was the issue. RN noted that at each end there were multiple servers, so it should be ok. PC suggested then, that we get the link in place and then see how we use it? DB suggested that it might be better to leave it idle. CB countered that idle resources generally don't work when you switch them on. PC asked how would the Tier-1 reply to experiment priorities --> REPLACE and plans? WITH in the event of a total loss of the Tier-1? AS advised that there was little there that was different to what he would expect: CASTOR re-start was seen as a priority, wasn't too concerning, the Tier-1 team was a large team with prioritised start orders. The only resource contention within the team would be the database area - staff resources were light there - priorities for the database team needed to be identified. The question was asked whether there was a smaller subset of kit to bring up on behalf of the experiments? This was difficult to answer, however the storage infrastructure could be helpful. GS noted that recovering the RAL share of raw data would be a priority - it might be possible to hold this on disk beforehand. AS asked if more hardware could be helpful? This was unclear. DC advised that the experiment view was that the Tier-1 should be custodial - therefore custodial copies would be good - if RAL went down then this could be absorbed elsewhere, however if they lost Fermilab or Brookhaven then no - losing Fermilab would be very serious. AS advised that communications were much improved, also, getting info out to VOs was part of normal ops; experiments would want to know uncertainty in the estimates. CB agreed that round table co-ordinated effort would be required if a major incident happened. AS reported that weekends were a challenge - small incidents could become big problems. CB noted that no-one was on call on behalf of the experiments at weekends. JG asked about phone numbers - did anyone have a listing? RN advised that for LHCb there was a Grid Ops expert. DB advised that a large list of phone numbers was available on the GridPP website, relating to all sites. PC asked who generally held the phone numbers? DB noted that it had been recommended for people to print-out the GridPP database and keep it handy as hardcopy. GS commented that a lot of the numbers were probably out-of-date now. PC suggested that the list should be tested out - ban phone and email and see how you manage to contact people. Stuart Wakefield asked about the recent power outage and what had been the typical response? Had there been an incident report? AS advised that the typical response was 12 hours, and yes, a 'Post Mortem' was posted at the moment. Staff had been onsite within the hour, a second power cut had happened at 08:40 am but no 'all clear' had been given by Buildings & Maintenance until around 10:00 am. They didn't get going again until about 10:20 am and MyProxy was back up around 2:00 pm. JC asked about the 'resilience' of people - was there one person switching manual switches in the experiments? DC advised that they had the Data Ops team - they worked in computing shifts and there was a series of people involved. GS reported that for ATLAS, there was a Distributed Computing expert on call and a team of 8 people. In the event of a major incident they could draw other people in, not just the person on call. JC asked how many people could switch the Tier-2 to the Tier-1? GS reported 4-5 people - but this would only be done after a couple of days in order to provide maximum resources to the experiments - info was on the twiki. CB commented on a lack of documentation centrally in CMS on disaster planning, however there was the CMS twiki and the email system. RN advised that there was one person on shift during working hours - the central issue for LHCb was the link between them and CERN. If this went down overnight, LHCb didn't need to take critical action, this could be attended to the next day, therefore there was one shift person and an expert on call, plus the DIRAC team in various places. LHCb did analysis at the Tier-1, and the whole load could be taken over for LHCb at CERN. Over a day they would likely investigate an issue. DC noted that there were weaker parts of the system. JC advised that there were Ops Teams but no clear picture overall. RN reminded that the GGUS ticketing system was available and the LHCb VO Ops contact was called if necessary. GS strongly advised NOT sending an alarm ticket to ATLAS via GGUS, as this would not be dealt with quickly enough to respond to an emergency effectively. JG informed that there was a well-defined 24/7 number at CERN available. PC noted that there were various issues/questions involved here: having another designated Tier-1 in the UK? Having one or two sites as hot spares? JG advised that you could receive raw data and keep it for some time, and it stays in the UK. PC noted it didn't have to be a 'reliable' site for that. DC commented that custodial data was a responsibility however. PC asked if we do some important functions in the UK relating to this? CB confirmed yes - CPU & storage at various sites. GS also confirmed that the Tier-2s provide CPU & storage, and are fairly reliable. TD noted in addition that Oracle database people at a site would also be essential. PC suggested that if this was important, fallback time could be put in, in order to study it. CB advised that failover to a different Tier-1 was preferred. TD confirmed that 24/7 cover could be called upon at short notice. GS advised that there were other issues in relation to critical service - we couldn't do the whole thing at Glasgow, for example. It was suggested that a temporary Tier-1 should be available, to hold then transfer data, and again once RAL was back up, it could be copied back. DK advised that Taiwan have another machine room, they've moved the kit, but it would depend what the 'disaster' was. DB suggested that the most likely 'disaster' was that Oracle or CASTOR stops working for a significant period of time. The question was, you can't get anything only off tape - but at PPD there was a lot of disk - we could look on PPD as a large disk pool for a few weeks - could that be a plan? CB noted that if there was tape anywhere else they may go for it, but might consider PPD if all other Tier-1s were full. DC suggested that this could be taken to the experiments and asked. DB asked that if we had to clear disk space at PPD, was the network there to enable use? CB advised that network connections were being worked on at the moment. Jens Jensen noted that we had to weight the risk against the actual likelihood of it happening - this might not be worth spending six months on. CB advised that if there was a meltdown on the Tier-1 batch system then it would be easy to configure the batch to use CASTOR. AS noted that if batch capacity burnt out we could use V-LAN and store it with the CB. GS advised that regarding the ATLAS Computing Model - the Tier- 1 had custodial responsibility of data. Tape would eventually become too expensive and we may move to solid state disks - if we get rid of tape then we get rid of tape access copies. The Tier-2 capacity could fill-in - if we were three Tier- 1s down, then we could use the Tier-2. ATLAS have experienced five Tier-1s down and loss of custodial data - disk might be full in five years and network bandwidth was an issue. JG noted that the UK needed to know how to respond if five Tier-1s were down - Edinburgh and Daresbury have tape systems both on JANET. GS suggested putting two copies at the Tier-2s - it didn't have to be tape. It was suggested that staff effort could be allocated to building another Tier-1 site - but this would mean 'stealing' Tier-1 staff, which would leave RAL vulnerable. RN noted that LHCb would prefer to have services at the same place, rather than distributed. GS advised that you could have a catalogue at one site and a server at another - but this would depend on reliability. PC drew the discussion to a close and thanked the panel, and also the floor, for their contributions.