[Dxspider-support] Need node partners.....again!

Dirk Koopman djk at tobit.co.uk
Fri Jun 5 12:05:28 BST 2015


On 05/06/15 06:25, Pascal Stevenhaagen [PB1SAM] wrote:
> Most of the times the cluster stops and starts correct.
> Once in a while it goes wrong.
> You can also remove the file by yourself, like
> rm /spider/local/cluster.lck
>
> To make a batch, simply create a text file with a name you like,
> like /spider/spiderstart.sh
> Then add the following lines:
>
> #!/bin/bash
> rm /spider/local/cluster.lck
> /usr/bin/perl -w /spider/perl/cluster.pl
>
>
> Close and save the file, give it execute permissions.
> Now edit your inittab file, and finde the line
> |
> ##Start DXSpider on bootup and respawn it should it crash
> DX:3:respawn:/bin/su -c "/usr/bin/perl -w /spider/perl/cluster.pl" sysop >/dev/tty7
>
>
> Change "/usr/bin/perl -w /spider/perl/cluster.pl"||  to
> "/spider/spiderstart.sh"
> |

I am getting rather concerned by this. The cluster.lck file is there for 
a reason. If the node has crashed or otherwise stopped without removing 
the .lck file, then restarting the node will not fail. The reason for 
this is that the .lck file contains the process id of the cluster.pl 
that started it. The cluster.pl reads its .lck file and, if the process 
id in that file doesn't exist, then it will just start normally. You 
don't need to remove .lck file first. The new process will replace that 
old process id in that file with its own. All sorts of support programs 
(e.g. create/update_sysop.pl) rely on that lock file being there to 
prevent accidents/corruptions occurring.

So if another cluster.pl is started and it complains about there being 
another process already running, then the chances are strong that it is 
not lying. Removing the .lck file and then just starting another 
cluster.pl will certainly corrupt things like the userfile - but it will 
also fail anyway because it also won't be able to start up listeners on 
ports like 27754 or 7300/8000 (depending on which you use).

So we need to get to the bottom of this. Please in mind that, just now, 
there are 387 nodes in the DXSpider compatible network. All bar about 40 
of those are running DXSpider and have zero problems with .lck files 
(that have been reported at least).

I do not approve of "work arounds" like the one detailed above. If they 
are a) necessary and b) actually work, then I want to know why and how 
so that the main line code can be changed to make the work around 
unnecessary.

If cluster.pl complains that it is already running then it is a simple 
matter to check:

ps ax | grep cluster

will produce something like:

10764 pts/53   S+     0:00 grep --color=auto cluster
19194 pts/48   S+    24:27 perl ./cluster.pl

That says that there is a node currently running. If you look at the 
cluster.lck:

cat local/cluster.lck

gives:

19194

You can see that it agrees with the "ps ax". This means that it is 
running and if you can't get in then there is something else wrong. You 
can investigate "hanging" in the first instance by look at this:

netstat -tapn | grep -P '7300|27754'
(Not all processes could be identified, non-owned process info
  will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:7300 0.0.0.0:*               LISTEN      
19194/perl
tcp        0      0 127.0.0.1:27754 0.0.0.0:*               LISTEN      
19194/perl
tcp        0      0 127.0.0.1:27754 127.0.0.1:40010         ESTABLISHED 
19194/perl
tcp        0      0 127.0.0.1:40010 127.0.0.1:27754         ESTABLISHED 
19264/perl

The two numbers to the right of the 'tcp' are number of bytes in the 
receive and tx queues. On a quiet system they will be 0. Even on a busy 
system:

tcp        0      0 82.103.135.24:7300 0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:7300 0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:27754 0.0.0.0:*               LISTEN
tcp        0      0 82.103.135.24:7300 213.138.110.143:42281   ESTABLISHED
tcp        0      0 82.103.135.24:7300 99.6.147.52:49377       ESTABLISHED
tcp        0      0 82.103.135.24:7300 83.162.186.242:36370    ESTABLISHED
tcp        0      0 82.103.135.24:39326 195.171.43.144:7300     ESTABLISHED
tcp        0      0 82.103.135.24:7300 217.146.110.41:36623    ESTABLISHED
tcp        0      0 82.103.135.24:7300 93.142.192.186:51161    ESTABLISHED
tcp        0      0 82.103.135.24:7300 217.160.22.169:49550    ESTABLISHED
tcp        0      0 82.103.135.24:7300 85.220.185.162:49179    ESTABLISHED
tcp        0      0 82.103.135.24:7300 178.128.167.25:54418    ESTABLISHED
tcp        0      0 82.103.135.24:7300 164.126.146.32:49172    ESTABLISHED
tcp        0      0 82.103.135.24:7300 92.72.61.178:51181      ESTABLISHED
tcp        0      0 82.103.135.24:7300 83.252.226.34:59837     ESTABLISHED
tcp        0      0 82.103.135.24:7300 107.211.218.32:2416     ESTABLISHED
tcp        0      0 82.103.135.24:7300 89.140.118.183:55942    ESTABLISHED
tcp        0      0 82.103.135.24:7300 80.0.168.159:59331      ESTABLISHED
tcp        0      0 82.103.135.24:7300 89.168.61.217:1091      ESTABLISHED
tcp        0      0 82.103.135.24:7300 93.95.80.107:4308       ESTABLISHED
tcp        0      0 82.103.135.24:7300 2.230.223.137:49356     ESTABLISHED
tcp        0      0 127.0.0.1:27754 127.0.0.1:50397         TIME_WAIT
tcp        0      0 82.103.135.24:7300 79.141.97.6:52724       ESTABLISHED
tcp        0      0 82.103.135.24:7300 68.100.98.221:65350     ESTABLISHED
tcp        0      0 82.103.135.24:7300 80.217.42.87:49317      ESTABLISHED
tcp        0      0 82.103.135.24:7300 99.82.248.159:1400      ESTABLISHED
tcp        0      0 82.103.135.24:7300 108.72.240.121:53807    ESTABLISHED
tcp        0      0 82.103.135.24:7300 92.13.176.38:49173      ESTABLISHED
tcp        0      0 82.103.135.24:7300 109.69.104.145:39727    ESTABLISHED
tcp        0      0 82.103.135.24:7300 58.162.248.200:63144    ESTABLISHED
tcp        0      0 82.103.135.24:7300 86.26.145.167:2664      ESTABLISHED
tcp        0      0 82.103.135.24:7300 83.4.1.15:5738          ESTABLISHED
tcp        0      0 82.103.135.24:7300 92.4.126.139:52101      ESTABLISHED
tcp        0      0 82.103.135.24:7300 86.167.103.21:56777     ESTABLISHED
tcp        0      0 82.103.135.24:7300 37.228.211.66:49286     ESTABLISHED
tcp        0      0 82.103.135.24:7300 79.106.20.5:62846       ESTABLISHED
tcp        0      0 82.103.135.24:7300 184.1.71.111:58854      FIN_WAIT2
tcp        0      0 82.103.135.24:7300 108.85.7.188:64788      ESTABLISHED
tcp        0      0 82.103.135.24:7300 212.159.40.67:61489     ESTABLISHED
tcp        0      0 82.103.135.24:7300 202.154.141.28:49388    ESTABLISHED
tcp        0      0 82.103.135.24:7300 87.81.158.136:56937     ESTABLISHED
tcp        0      0 82.103.135.24:7300 78.70.174.109:64745     ESTABLISHED
tcp        0      0 82.103.135.24:7300 193.53.39.133:2791      ESTABLISHED
tcp        0      0 82.103.135.24:7300 91.3.231.33:49730       ESTABLISHED
tcp        0      0 82.103.135.24:7300 73.26.162.240:51183     ESTABLISHED
tcp        0      0 82.103.135.24:7300 78.25.123.189:17219     ESTABLISHED
tcp        0      0 82.103.135.24:7300 216.54.125.50:3243      ESTABLISHED
tcp        0      0 82.103.135.24:7300 87.114.78.109:63748     ESTABLISHED
tcp        0      0 82.103.135.24:7300 78.1.230.118:53454      ESTABLISHED
tcp        0      0 82.103.135.24:7300 84.106.116.54:3161      ESTABLISHED
tcp        0      0 82.103.135.24:7300 82.0.27.159:1177        ESTABLISHED
tcp        0      0 82.103.135.24:7300 204.235.44.74:58853     ESTABLISHED
tcp        0      0 82.103.135.24:7300 184.1.71.111:58852      TIME_WAIT
tcp        0      0 82.103.135.24:7300 68.47.234.14:51087      ESTABLISHED
tcp        0      0 82.103.135.24:7300 70.178.167.206:51860    ESTABLISHED
tcp        0      0 82.103.135.24:7300 68.100.96.156:64788     ESTABLISHED
tcp        0      0 82.103.135.24:7300 71.123.183.231:59821    ESTABLISHED
tcp        0      0 82.103.135.24:7300 188.221.68.33:49760     ESTABLISHED
tcp6       0      0 2a00:9080:1:5cf::1:7300 :::*                    LISTEN
tcp6       0      0 2a00:9080:1:5cf::1:7300 2001:41c8:51:457::60624 
ESTABLISHED
tcp6       0      0 2a00:9080:1:5cf::1:7300 2a01:260:8033:1:c:27973 
ESTABLISHED
tcp6       0      0 2a00:9080:1:5cf::1:7300 2a01:7e00::f03c:9:36535 
ESTABLISHED

They will be (mostly) 0. If you have anything else then this needs to be 
investigated.

Dirk G1TLH

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.tobit.co.uk/pipermail/dxspider-support/attachments/20150605/9518cbb6/attachment-0001.html>


More information about the Dxspider-support mailing list