[Dxspider-support] Cluster freezing up after short while

Dirk Koopman djk at tobit.co.uk
Sun Sep 16 21:15:02 BST 2012


On 16/09/12 11:20, Aurelio - PC5A wrote:
> Hello,
>
> The machine runs under Ubuntu. I will try an upgrade. Tnx
>
> Ian suggested to check for corrupt dupe and or user file.

Quite correct. First place to look.

> I looked in the FAQ's on the wiki page but didn't find any hints how to
> deal with that (repair, delete..?)
>

Really?

Searching the mailing list from www.dxcluster.org will find you hundreds 
of hits.

I thought I had put something on the FAQ page *ages* ago, but it may 
have got mislaid in Ian's disk crash.

Anyway, let's do all of them at once:

* stop the node.
* rm /spider/data/dupefile
* rm /spider/data/users.v*
* perl /spider/data/user_asc

That will remove the dupefile and rebuilds the user file.

Also you should consider:

* rm /spider/data/qsl.v1

Which is a distinct possibility if it always hangs on an incoming PC11 
or PC61.

* restart the node.

Also, what does 'top' say (not just the load average, but the first 10 
processes). A recent problem I have been experiencing in the day job is 
very slow writing of data to an ext4 filesystem. This shows itself by 
writing jobs being in wait state 'D' a lot. Normally one never sees 
because it should not spend any time in this state. If this is happening 
then you will see a 'jbd2' in state 'D' a *lot*.

If it still hangs (or even before you do the above), try a:

ps axl

Most of the time you will see something like this:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0  1005  2282  2281  20   0  70816 50788 ep_pol S    ?        714:39 
/usr/bin/perl -w /spider/perl/cluster.pl

The two columns of interest are 'STAT' and 'WCHAN'. This says that the 
node is 'S' for 'sleeping' and waiting for input in 'ep_pol' (yours will 
probably say 'select').

Hope this helps.

Dirk





More information about the Dxspider-support mailing list