GRIDPP24 COLLABORATION MEETING @ RHUL ===================================== Discussion Session - 14th April 2010 ------------------------------------ Topic: The short/medium/long-term future of CASTOR --------------------------------------------------- Chair: Tony Doyle Panel: Matthew Viljoen, Shaun de Witt, Andrew Sansum, Raja Nandakumar, Alberto Pace TD introduced the speakers and advised that talks would last 5 minutes each, with slides, followed by discussion. MV, CASTOR Team Leader at RAL, presented on the RAL CASTOR upgrade plans and timeline. He advised that this had been endorsed yesterday, as a proposal, by the PMB. The decision had been made to go ahead with the plans, but not actually upgrade yet - this would be decided at a later date. MV presented slides as follows: - current context and proposal - reason for the change - upgrade plans including test systems and timeline (from Gantt Chart), plus VO participation TD asked if there were any questions on the upgrade and timeline? RN asked that if this were an open system for test by a VO, could the VO not perform a full-scale test? MV advised that this would happen in two ways: firstly, in the testing phase pre-production, and secondly, in post- upgrade testing. RN asked whether stopping writing into the CASTOR instance would make the rollback easier? MV indicated that it was possible - they could take it offline but hadn't discussed it yet. It was asked whether CERN could provide a certified migration path? MV advised that they do testing, and this matches the production at RAL, however RAL needed to test on its own setup as much as possible. Dave Wallom asked whether there was an alternative to the proposal? MV advised that they could stay where they were on 2.1.7 and not upgrade at all; they could also ditch CASTOR and move to an alternative. TD noted that at CERN version 2.1.9 had been running successfully for three months now. SdW next presented on CASTOR history and previous lessons learned - CASTOR had had a chequered past and there was a long timescale proposed between 2.1.7 and 2.1.9. SdW presented slides as follows: - CASTOR availability historically, with timeline - from 2.0 (in 2005) to present (2.1.7) and lessons learned with each upgrade - in their experience, other minor upgrades had the potential to cause wider problems and issues RN commented that the last major issues were seen during the move from 2.1.6 to 2.1.7. Furthermore, the move from 2.1.7 to 2.1.8 at CERN had led to loss of data - did MV have any answers to these issues? MV noted that some of these were the result of Oracle architecture. Alberto Pace noted that there had also been operational issues at CERN. Clarification on the 'big ID' problem was requested. SdW explained that CASTOR and the SRM couldn't access upgraded data due to I.D. recognition errors by Oracle and its caching system. TD asked whether the Database Group had a plan in terms of upgrades on the database side? MV confirmed yes, they had already started to test the upgrade path. AS next presented on CASTOR pros and cons. He advised that we had been here before - we had been on dCache previously and had moved to CASTOR for the tape management system: we were on familiar territory - the dCache installation had taken a long time. AS presented slides as follows: current context --------------- - CASTOR testing from 2005 - the team were now working well but were vulnerable to Oracle and SAN issues - it had taken 2-3 years to get CASTOR to full production capability - the same team had worked on CASTOR since its inception - it had taken a lot of staff effort to keep CASTOR running - despite its chequered past it was working well now to replace CASTOR ----------------- - this would be a huge project over years - we would need to be convinced that the replacement was adequate - it was important to remain agile but the grass was not always greener - we need 20 petabytes of data by 2012 - we need to run the best storage solution we can for the project Dave Colling commented that if we went by past history and were terrified of upgrading because it had broken so many times, this would lead to atrophy. He felt that a more cautious approach would be better than the one outlined by MV - if it didn't work this time then it was clear we couldn't stick with the same product. Dave Wallom noted that STFC provided data services to other people - were they aligned with CASTOR? MV advised that STFC were rolling out a new instance for facilities like Diamond, and were moving to 2.1.9 shortly, that was the plan. DW commented that if they were moving all STFC operations onto CASTOR, it would be unwise for us to change. AS agreed, however he cautioned that we still had to deliver to the project. Ewan MacMahon asked whether there was any support in CASTOR to dump metadata into neutral territory to help recatalogue? SdW advised yes: via the nameserver and tape robot. MV noted however that this would need a bespoke solution involving development effort. Stephen Burke asked what, given the 5-year timescale, was missing and what was there in the future that we actually needed, as opposed to what we get? What could another solution provide? AS said we couldn't answer that question, it was down to experiment requirements. Dave Wallom asked about the STFC roadmap? When would CASTOR not work due to scale? AS advised that you had to deliver to the project you were on - lessons could give a bigger solution, but we can't know yet. AP, from the CERN IT Department, next presented on CASTOR status, showing slides as follows: - since 2008 there had been 40 minor releases in production phase - effort was focussed on reducing the operational cost and simplifying the architecture - today CERN were running 2.1.9, and they considered it to be stable; SRM v2.9 had been released - several years of stable operation were expected - they hoped to improve CASTOR interfaces for physics analysis Stephen Burke asked whether this meant that the 2.1.9 upgrade was the last one for a while? AP advised that reasons for upgrading were always: extra features (which were driven by the experiments); and better tools for CASTOR operation. Currently they had reached a complete set of features and did not have a date for 2.1.10 as yet. John Gordon commented that even the title 2.1.10 betokened a problem with either the product, or the naming convention. AP confirmed that new versions of CASTOR did not change sub-functionality. Dave Wallom asked if anyone outside of HEP were using CASTOR? AP confirmed no-one. RN asked if there were any major changes for 2.1.10? AP advised that there was no date for it, therefore no functionality. TD noted that in earlier stages there had been plans to productise - had that been given up now? AP confirmed there were no plans to productise CASTOR. Dave Wallom asked what the sustainability of CASTOR was over the next 5-10 years, as only three Tier-1 sites plus CERN were actually using it? AP answered that it was attached to LHC physics and its technology had been developed to address current requirements, therefore it was sustainable in the long-term, however major technological changes would ensure a likely major shift in CASTOR. You could question the model of hierarchical storage, which would certainly happen within a 10-year horizon. Dave Wallom asked how he could ascertain details of the specification of CASTOR? He knew of lots of other companies out there who also do hierarchical storage and it would be good to know and also to compare the performance specifications. He wanted to be clear on what the benefit was of developing this ourselves, rather than buying it in from elsewhere. AP responded that the achievement of CASTOR was its ability to expose to the end user both performance and availability of the hardware, also its efficiency in using the resources - this meant that the performance of CASTOR was better than other commercial systems. Tony Cass commented that CERN had evaluated another system in the 1990's and had analysed usage pattern, and our usage pattern was different to other communities. TC also noted that we could change CASTOR according to our requirements. It was asked how tied was CASTOR to Oracle? SdW responded that three-quarters of the logic was stored in Oracle procedures - you can't separate them. Dave Wallom observed that CASTOR took a lot of staff to run it, it was of large value - was it cost-effective if eventually it was only working at one site? If CERN were the only ones using CASTOR - at what point do CERN decide to ditch it and use something else? TD noted that this was an ongoing evaluation and that things did change over time. Dave Britton commented that there were three things wrong with CASTOR: complexity / fragility / expense. He noted we have had problems with its inherent fragility in the past, which occurred largely due to its complexity. It was expensive for GridPP in terms of staff and Oracle licences etc. Furthermore, DB observed that in the global situation, three sites were using CASTOR and seven out of eight Tier-1s were using other products, which also must have their issues. RN advised that if one looked at the GGUS tickets, they would tell a story. DB noted that there might well be a better system out there, but it would be two years of work to move from one place to another. For the long term, DB advised that in 2-4 years we would need to weigh-up the fragility/complexity/expense of CASTOR for the future - things have been very bad with CASTOR in the past and maybe this would be the moment that the grass was indeed greener. We would need to look at the global context. John Gordon commented that open software was very agile, and dCache was fully open source. RN next presented on LHCb and the proposed CASTOR upgrade. He presented slides as follows: - there was a long history of LHCb requesting to move to 2.1.9 - LHCb would prefer to upgrade to 2.1.9 as soon as possible - LHCb were working with CERN in relation to the access process and xrootd TD brought the discussion session to a close, thanked the floor for their questions, and thanked the panel for their presentations.