[Dxspider-support] Need node partners.....again!
Dirk Koopman
djk at tobit.co.uk
Fri Jun 5 12:05:28 BST 2015
On 05/06/15 06:25, Pascal Stevenhaagen [PB1SAM] wrote:
> Most of the times the cluster stops and starts correct.
> Once in a while it goes wrong.
> You can also remove the file by yourself, like
> rm /spider/local/cluster.lck
>
> To make a batch, simply create a text file with a name you like,
> like /spider/spiderstart.sh
> Then add the following lines:
>
> #!/bin/bash
> rm /spider/local/cluster.lck
> /usr/bin/perl -w /spider/perl/cluster.pl
>
>
> Close and save the file, give it execute permissions.
> Now edit your inittab file, and finde the line
> |
> ##Start DXSpider on bootup and respawn it should it crash
> DX:3:respawn:/bin/su -c "/usr/bin/perl -w /spider/perl/cluster.pl" sysop >/dev/tty7
>
>
> Change "/usr/bin/perl -w /spider/perl/cluster.pl"|| to
> "/spider/spiderstart.sh"
> |
I am getting rather concerned by this. The cluster.lck file is there for
a reason. If the node has crashed or otherwise stopped without removing
the .lck file, then restarting the node will not fail. The reason for
this is that the .lck file contains the process id of the cluster.pl
that started it. The cluster.pl reads its .lck file and, if the process
id in that file doesn't exist, then it will just start normally. You
don't need to remove .lck file first. The new process will replace that
old process id in that file with its own. All sorts of support programs
(e.g. create/update_sysop.pl) rely on that lock file being there to
prevent accidents/corruptions occurring.
So if another cluster.pl is started and it complains about there being
another process already running, then the chances are strong that it is
not lying. Removing the .lck file and then just starting another
cluster.pl will certainly corrupt things like the userfile - but it will
also fail anyway because it also won't be able to start up listeners on
ports like 27754 or 7300/8000 (depending on which you use).
So we need to get to the bottom of this. Please in mind that, just now,
there are 387 nodes in the DXSpider compatible network. All bar about 40
of those are running DXSpider and have zero problems with .lck files
(that have been reported at least).
I do not approve of "work arounds" like the one detailed above. If they
are a) necessary and b) actually work, then I want to know why and how
so that the main line code can be changed to make the work around
unnecessary.
If cluster.pl complains that it is already running then it is a simple
matter to check:
ps ax | grep cluster
will produce something like:
10764 pts/53 S+ 0:00 grep --color=auto cluster
19194 pts/48 S+ 24:27 perl ./cluster.pl
That says that there is a node currently running. If you look at the
cluster.lck:
cat local/cluster.lck
gives:
19194
You can see that it agrees with the "ps ax". This means that it is
running and if you can't get in then there is something else wrong. You
can investigate "hanging" in the first instance by look at this:
netstat -tapn | grep -P '7300|27754'
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:7300 0.0.0.0:* LISTEN
19194/perl
tcp 0 0 127.0.0.1:27754 0.0.0.0:* LISTEN
19194/perl
tcp 0 0 127.0.0.1:27754 127.0.0.1:40010 ESTABLISHED
19194/perl
tcp 0 0 127.0.0.1:40010 127.0.0.1:27754 ESTABLISHED
19264/perl
The two numbers to the right of the 'tcp' are number of bytes in the
receive and tx queues. On a quiet system they will be 0. Even on a busy
system:
tcp 0 0 82.103.135.24:7300 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:7300 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:27754 0.0.0.0:* LISTEN
tcp 0 0 82.103.135.24:7300 213.138.110.143:42281 ESTABLISHED
tcp 0 0 82.103.135.24:7300 99.6.147.52:49377 ESTABLISHED
tcp 0 0 82.103.135.24:7300 83.162.186.242:36370 ESTABLISHED
tcp 0 0 82.103.135.24:39326 195.171.43.144:7300 ESTABLISHED
tcp 0 0 82.103.135.24:7300 217.146.110.41:36623 ESTABLISHED
tcp 0 0 82.103.135.24:7300 93.142.192.186:51161 ESTABLISHED
tcp 0 0 82.103.135.24:7300 217.160.22.169:49550 ESTABLISHED
tcp 0 0 82.103.135.24:7300 85.220.185.162:49179 ESTABLISHED
tcp 0 0 82.103.135.24:7300 178.128.167.25:54418 ESTABLISHED
tcp 0 0 82.103.135.24:7300 164.126.146.32:49172 ESTABLISHED
tcp 0 0 82.103.135.24:7300 92.72.61.178:51181 ESTABLISHED
tcp 0 0 82.103.135.24:7300 83.252.226.34:59837 ESTABLISHED
tcp 0 0 82.103.135.24:7300 107.211.218.32:2416 ESTABLISHED
tcp 0 0 82.103.135.24:7300 89.140.118.183:55942 ESTABLISHED
tcp 0 0 82.103.135.24:7300 80.0.168.159:59331 ESTABLISHED
tcp 0 0 82.103.135.24:7300 89.168.61.217:1091 ESTABLISHED
tcp 0 0 82.103.135.24:7300 93.95.80.107:4308 ESTABLISHED
tcp 0 0 82.103.135.24:7300 2.230.223.137:49356 ESTABLISHED
tcp 0 0 127.0.0.1:27754 127.0.0.1:50397 TIME_WAIT
tcp 0 0 82.103.135.24:7300 79.141.97.6:52724 ESTABLISHED
tcp 0 0 82.103.135.24:7300 68.100.98.221:65350 ESTABLISHED
tcp 0 0 82.103.135.24:7300 80.217.42.87:49317 ESTABLISHED
tcp 0 0 82.103.135.24:7300 99.82.248.159:1400 ESTABLISHED
tcp 0 0 82.103.135.24:7300 108.72.240.121:53807 ESTABLISHED
tcp 0 0 82.103.135.24:7300 92.13.176.38:49173 ESTABLISHED
tcp 0 0 82.103.135.24:7300 109.69.104.145:39727 ESTABLISHED
tcp 0 0 82.103.135.24:7300 58.162.248.200:63144 ESTABLISHED
tcp 0 0 82.103.135.24:7300 86.26.145.167:2664 ESTABLISHED
tcp 0 0 82.103.135.24:7300 83.4.1.15:5738 ESTABLISHED
tcp 0 0 82.103.135.24:7300 92.4.126.139:52101 ESTABLISHED
tcp 0 0 82.103.135.24:7300 86.167.103.21:56777 ESTABLISHED
tcp 0 0 82.103.135.24:7300 37.228.211.66:49286 ESTABLISHED
tcp 0 0 82.103.135.24:7300 79.106.20.5:62846 ESTABLISHED
tcp 0 0 82.103.135.24:7300 184.1.71.111:58854 FIN_WAIT2
tcp 0 0 82.103.135.24:7300 108.85.7.188:64788 ESTABLISHED
tcp 0 0 82.103.135.24:7300 212.159.40.67:61489 ESTABLISHED
tcp 0 0 82.103.135.24:7300 202.154.141.28:49388 ESTABLISHED
tcp 0 0 82.103.135.24:7300 87.81.158.136:56937 ESTABLISHED
tcp 0 0 82.103.135.24:7300 78.70.174.109:64745 ESTABLISHED
tcp 0 0 82.103.135.24:7300 193.53.39.133:2791 ESTABLISHED
tcp 0 0 82.103.135.24:7300 91.3.231.33:49730 ESTABLISHED
tcp 0 0 82.103.135.24:7300 73.26.162.240:51183 ESTABLISHED
tcp 0 0 82.103.135.24:7300 78.25.123.189:17219 ESTABLISHED
tcp 0 0 82.103.135.24:7300 216.54.125.50:3243 ESTABLISHED
tcp 0 0 82.103.135.24:7300 87.114.78.109:63748 ESTABLISHED
tcp 0 0 82.103.135.24:7300 78.1.230.118:53454 ESTABLISHED
tcp 0 0 82.103.135.24:7300 84.106.116.54:3161 ESTABLISHED
tcp 0 0 82.103.135.24:7300 82.0.27.159:1177 ESTABLISHED
tcp 0 0 82.103.135.24:7300 204.235.44.74:58853 ESTABLISHED
tcp 0 0 82.103.135.24:7300 184.1.71.111:58852 TIME_WAIT
tcp 0 0 82.103.135.24:7300 68.47.234.14:51087 ESTABLISHED
tcp 0 0 82.103.135.24:7300 70.178.167.206:51860 ESTABLISHED
tcp 0 0 82.103.135.24:7300 68.100.96.156:64788 ESTABLISHED
tcp 0 0 82.103.135.24:7300 71.123.183.231:59821 ESTABLISHED
tcp 0 0 82.103.135.24:7300 188.221.68.33:49760 ESTABLISHED
tcp6 0 0 2a00:9080:1:5cf::1:7300 :::* LISTEN
tcp6 0 0 2a00:9080:1:5cf::1:7300 2001:41c8:51:457::60624
ESTABLISHED
tcp6 0 0 2a00:9080:1:5cf::1:7300 2a01:260:8033:1:c:27973
ESTABLISHED
tcp6 0 0 2a00:9080:1:5cf::1:7300 2a01:7e00::f03c:9:36535
ESTABLISHED
They will be (mostly) 0. If you have anything else then this needs to be
investigated.
Dirk G1TLH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.tobit.co.uk/pipermail/dxspider-support/attachments/20150605/9518cbb6/attachment-0001.html>
More information about the Dxspider-support
mailing list