Job pairs failing
2 posters
Page 1 of 1
Job pairs failing
I am running a job on StarExec, and I am having trouble debugging my job submission. I am getting a lot of job-pair failures with the following statuses "benchmark error (12)" (pair id 52343076) and "run script error (11)"(pair id 52345975). The output log generated is empty and I can run the benchmarks on my machine without any issues. I am not sure if the problem is on the StarExec side or if I have setup my script incorrectly for StarExec. (For the star exec admins, the job id is 6709 if you want to look into this.) Does anybody know what "benchmark error (12)" and "run script error (11)" mean, and what are the likely causes of these? Any suggestions for what I can do to debug these kinds of errors?
lianah- Posts : 11
Join date : 2014-04-23
Re: Job pairs failing
I'm looking into it, Liana, and will let you know as soon as I figure out what the problem is.
Aaron
Aaron
Re: Job pairs failing
Thank you for the quick reply. I tried a couple other jobs and I wonder if the problem is not related to all2.queue. I ran test jobs on long.queue and all1.queue with a small timeout and have not seen these error messages so far.The StarExec log for some of the failed jobs also suggests that maybe something is not mounted correctly:
"02/23/15 08:02:23 AM CST: WORKING_DIR is /export/starexec/sandbox2
chown: cannot access `/export/starexec/sandbox2': No such file or directory"
I will use the all1.queue for now.
"02/23/15 08:02:23 AM CST: WORKING_DIR is /export/starexec/sandbox2
chown: cannot access `/export/starexec/sandbox2': No such file or directory"
I will use the all1.queue for now.
lianah- Posts : 11
Join date : 2014-04-23
Re: Job pairs failing
Right. The problem looks like a missing sandbox2 directory. I had not noticed that the errors are localized to nodes on all2.q (so thanks for pointing that out). But some all2.q nodes have sandbox2 directories. We'll look into it. If running on all.q is a workaround, then please do that for now.
Aaron
Aaron
Re: Job pairs failing
Just an update: it looks like somehow one of the local sandbox or sandbox2 directories is getting blown away when a solver is running on a compute node. Our scripts are not recreating this directory after the solver runs, and the solver does have the power to destroy that directory (not that they should do so) when they run, since that directory is chown'ed to the sandbox or sandbox2 user before the solver executes. I do not know if anyone's solver is (accidentally) doing this, but that could be the explanation for the missing directories.
We should have a fix out tomorrow that simply checks for this directory and creates it if it is missing.
Aaron
We should have a fix out tomorrow that simply checks for this directory and creates it if it is missing.
Aaron
Re: Job pairs failing
We just deployed a fix for this this morning, so you should not see pairs failing for this reason now (let me know if you still do).
Aaron
Aaron
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|