[Dxspider-support] Cluster freezing up after short while
Dirk Koopman
djk at tobit.co.uk
Sun Sep 16 21:15:02 BST 2012
On 16/09/12 11:20, Aurelio - PC5A wrote:
> Hello,
>
> The machine runs under Ubuntu. I will try an upgrade. Tnx
>
> Ian suggested to check for corrupt dupe and or user file.
Quite correct. First place to look.
> I looked in the FAQ's on the wiki page but didn't find any hints how to
> deal with that (repair, delete..?)
>
Really?
Searching the mailing list from www.dxcluster.org will find you hundreds
of hits.
I thought I had put something on the FAQ page *ages* ago, but it may
have got mislaid in Ian's disk crash.
Anyway, let's do all of them at once:
* stop the node.
* rm /spider/data/dupefile
* rm /spider/data/users.v*
* perl /spider/data/user_asc
That will remove the dupefile and rebuilds the user file.
Also you should consider:
* rm /spider/data/qsl.v1
Which is a distinct possibility if it always hangs on an incoming PC11
or PC61.
* restart the node.
Also, what does 'top' say (not just the load average, but the first 10
processes). A recent problem I have been experiencing in the day job is
very slow writing of data to an ext4 filesystem. This shows itself by
writing jobs being in wait state 'D' a lot. Normally one never sees
because it should not spend any time in this state. If this is happening
then you will see a 'jbd2' in state 'D' a *lot*.
If it still hangs (or even before you do the above), try a:
ps axl
Most of the time you will see something like this:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 1005 2282 2281 20 0 70816 50788 ep_pol S ? 714:39
/usr/bin/perl -w /spider/perl/cluster.pl
The two columns of interest are 'STAT' and 'WCHAN'. This says that the
node is 'S' for 'sleeping' and waiting for input in 'ep_pol' (yours will
probably say 'select').
Hope this helps.
Dirk
More information about the Dxspider-support
mailing list