GridPP PMB Minutes 450 (23.01.2012) =================================== Present: Dave Britton (Chair), Jeremy Coles, Andrew Sansum, Dave Colling, Glenn Patrick, Roger Jones, Pete Gronbech, Steve Lloyd, Robin Middleton, Dave Kelsey (Suzanne Scott - Minutes) Apologies: Tony Doyle, Pete Clarke, Tony Cass, John Gordon, Neil Geddes 1. RAL SRM upgrade issue & RAL DNS issue ========================================= AS reported that the failure and suspension of the SRM upgrade occurred on what was suppose to be a transparent intervention, site at risk. The problem was twofold: 1. one of the SRM database tables couldn't update and performance fell, this took some time to fix by unpinning the table; 2. the test version on the server they had been testing - the hostname had returned the long form, but when deployed the hostname was returned as the short form, but this wasn't fixed everywhere. An analysis had been done and it was re-rolled out, things had been ok so far. DB considered this to be a 'wisdom' issue - usually 'wisdom' was passed on - was there a mechanism for ensuring they could keep track of knowledge to make sure it didn't get lost? AS advised that they needed to do things in standard ways - they had pinned tables to stop Oracle making choices. It was spotted by the database team. They do have 'change' meetings and do discuss issues around passing on knowledge. The hostname issue in particular was a mismatch between testbed and production environment, and they hadn't got to the bottom of that yet. It took a long time to identify that it was a DNS problem. In relation to the dialogue with the network people, AS reported that he had met Robin Tasker on Thursday and had a conversation during which AS expressed concern. AS would follow this up with an email summary to RT. DB noted that it was important to maintain a good relationship but that we were obligated to point these things out. 2. AOCB ======== - Re the date of the next PMB meeting, DB would be at Manchester for an NGS review, therefore he proposed to cancel next week's meeting on 30th January. The following week was 6th February, and DB had a Viva at 2.00 pm. He could be present at the PMB meeting for one hour, but suggested that this start at 12.30 pm. This was agreed. - AHM paper: RJ had the materials and would do this tonight. DC noted he had an extension until Wednesday. JC volunteered to help with editing. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL presented Neasan O'Neill's report as follows: NO was travelling again and was in Amsterdam until Thursday. 1) He had contacted all of the PP masterclasses, mixed responses received so far to GridPP content 2) He was working on the website 3) The CHEP 2010 journal was now published, papers were uploaded and added to the website 4) He was looking into a UK NGI stand at the EGI community forum which would take place in Munich at the end of March 5) He was writing a CVMFS news item 6) He was investigating a news item on future strategies' workshops and findings SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that they had been affected by DNS issues over the weekend; there were some problems with data cleanup procedures, which go to 'broker-off' if too full - the threshold should go down to 70% as protection; there was some trouble with smaller sites; the recent 'off' for ATLAS was now finished, this had gone smoothly. SI-3 CMS weekly review & plans ------------------------------- DC reported that the webpages were inconsistent, there had been a series of problems at RAL - everything with ok apart from that. ACTION 450.1 DC to send the spreadsheet accounting numbers to December, to SL. DC noted re the accounting that they may have to start again in January. The CMS metrics had been fairly constant. SI-4 LHCb weekly review & plans -------------------------------- GP noted that things had been quiet, there was MC production happening. SI-5 User Co-ordination Issues ------------------------------- GP advised that there had been DNS issues over the weekend. There were no other issues to report. SI-6 Production Manager's Report --------------------------------- JC reported as follows: 1) Generally very quiet week as seen on the ops dashboard. Durham appears in several reports as having problems. Last week availability/reliability figures were given for sites under target and no explanation was given for UCL-HEP (95%: 75%). The reason is now known to be “most problems were due to our lcg-CE (now decommissioned) and some overloading of DPM pool nodes, which we are working around for now by limiting the number of analysis jobs”. 2) It has been noted that SL4 support will stop from 2nd February. There are still some sites running on this OS version but the end-of-life date is known. The last LCG-CEs are being closed shortly. 3) T2K is now back following the earthquake in Japan. They announced last week that they intend to start storing files on T2 SEs again very shortly ahead of processing. They discovered in the last data processing campaign that only half of the data is worth analysing and are currently removing unneeded data and where possible moving remaining files into a T2KORGDISK disk token. They are being invited to this week's storage meeting to discuss their approach. 4) Hardware purchasing was discussed at last week’s ops meeting with useful input from Martin Bly. All sites were aware of the deadlines. Some drives are difficult to order at the moment (due to Thailand floods) but 2TB nearline SAS appears fine and is what most sites are opting to buy though it does not offer the best unit of storage per pound. It was noted that having to buy now is not optimal for many sites but required. For information: A) There is an operations TEG F2F meeting today in Amsterdam: https://indico.cern.ch/conferenceDisplay.py?confId=161833. One of the most discussed areas on TB-SUPPORT is the future of YAIM. B) There is currently an effort to update UK NGI sites in the GOCDB. Several sites such as the PPS ones no longer have any resources but remain active in GOCDB. C) Some sites rely on tarball installs for their WNs but plans for a UMD/EMI tarball release for the workernodes is still unclear. ACTION 450.2 Re SL6, JC to come back to the PMB with regard to plans & schedules. SI-7 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) FY11 procurements - One disk delivery received - One disk delivery tomorrow (Tuesday 24th Jan) - One CPU delivery received - One CPU delivery tomorrow (Tuesday 24th Jan) 2) Tier-1 network bypass of the site firewall failed at 12:15 Monday 16th January. Re-instated by site network team at 14:01. Partial failure on UKLIGHT-> SAR router. 3) Site network DNS problems Sunday 22nd January 04:00-12:00 caused knock effect on Tier-1. Several DNS servers not responding. Site networking team oncall responded. Cause not yet reported to us. Service: 1) Summary of operational is at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-01-18 2) CASTOR a) Upgrade to the ATLAS SRM on Tuesday 17th January ran into problems. Upgraded had to be reverted. Problem traced to statistics being locked on one of the SRM tables and also problem with hostname unexpectedly returning shortform name on target hosts. Upgrades re-scheduled for this week. First SRM (cms) now upgraded successfully. 3) 3D Completed upgrade to ATLAS 3D service, upgrade was successful but overran the scheduled downtime slightly. Staff: 1) Grid team leader post ongoing. 2) Recruitments underway * Vasily Savin (Fabric sysadmin) started on Monday 16th. * One sysadmin post will start at the start of February * One Grid Team member - expected to start today. * Database post recruitment underway. SI-8 LCG Management Board Report --------------------------------- DB noted there had been no meeting. AOB === PG reported that he had received Quarterly Reports, however 3 or 4 were still outstanding - could he remind everyone to do their reports asap. He had emailed all PIs re the budgets and the ordering of kit. JC noted issues at Manchester. DB advised that the Force10 pricing was now on the portal, it was cheaper than expected. DB reminded everyone to spend the money successfully within timescale. DB advised that all GridPP4 grants (Tranche 1) needed to be extended by six months to enable crossover to Tranche 2. A letter would come from STFC to authorise this. DB noted we could start working towards issuing the new grants soon - DB would deal with this. REVIEW OF ACTIONS ================= 436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the term of GRIDPP4. Ongoing. 438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible. Ongoing. 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing. 448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co- ordinator position (not necessarily based at RAL). Ongoing. 448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy. Ongoing. 449.1 AS to document the recent network incidents at RAL. Ongoing. 449.2 JC to devise a provisional Agenda for GridPP28 (for the two main days Tue & Wed). Ongoing. ACTIONS AS OF 23.01.12 ====================== 436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the term of GRIDPP4. 438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible. 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co- ordinator position (not necessarily based at RAL). 448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy. 449.1 AS to document the recent network incidents at RAL. 449.2 JC to devise a provisional Agenda for GridPP28 (for the two main days Tue & Wed). 450.1 DC to send the spreadsheet accounting numbers to December, to SL. 450.2 Re SL6, JC to come back to the PMB with regard to plans & schedules. There will be no PMB meeting on Monday 30th January 2012. The next PMB meeting will take place at 12:30 pm on Monday 6th February.