GridPP Inefficient Jobs Policy
Full Policy Document(Version 18 March 2008)
All UK sites are given flexibility to deal with stalled jobs (in order that their CPUs are occupied more fully overall) according to the following stalled job definition:
Any job consuming <10 minutes CPU over a given 6 hour period (efficiency < 0.027) is considered stalled.
The following intervention scheme should be applied:
- Either
- If the site identifies that the problem is due to a well known problem, e.g., a hanging lcg-cp command, then the jobs may be deleted at once, with the user or VO being informed of the problem. (Note, the site should attempt to identify the SE, SURL or LFN involved to help debug the underlying data management issue).
- Or
- In cases where the reasons for stalling are not obvious (e.g. a binary just hanging), sites should contact users or VO production teams informing them of the number of stalled jobs at a site, providing as much information as possible to help debug the problem. (An example of such an email is given in Appendix 1 of the full Policy).
In this case
- The user or VO should respond within 6 hours, stating whether the site should simply delete the jobs or take further debugging steps.
- If no response is received within 6 hours the site may delete the jobs, informing the user of the action taken.
If no contact is possible with the user or VO because this information is unavailable then the site may delete the stalled jobs from the batch system immediately. If more than 20 jobs are deleted in any 24 hour period this way by the site then the site should raise a GGUS ticket against the user or VO for future reference.
The Production Manager should be informed of any issues that arise in implementing this draft policy.
Future policy and all inefficient job parameters will be reviewed on an annual review basis in ~March by the DTeam.
Last modified Thu 20 March 2008 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3