[Dxspider-support] Followup - Hang hang and some more hang...

Wed Nov 16 12:46:05 GMT 2005

From: Lee Sawkins
Date: 11/15/05 07:06:16
To: The DXSpider Support list
Subject: Re: [Dxspider-support] Followup - Hang hang and some more hang...

Hi Dirk

Kelly ran his test software with my Spider cluster.

What I see is 100% CPU usage and the system is too busy to accept any
other commands.  This is what is to be expected.  There is simply not
the CPU power to respond quickly enough to all the commands.

On Kelly's node the problem may be simply that one of his users has a
badly performing client program that is simply overloading the cluster.
I once had a user that was echoing everything back.  This caused 100%
cpu usuage for most of one dx contest weekend and made things really
sluggish.

By the way, my Windows 2K Spider cluster hit a maximum of 60 telnet
links the other day.  Whether this was the maximum possible or just a
coincidence, I don't know.

On AR Cluster nodes, the sysops have to trim their databases about once
per month to try to keep the nodes responsive to user commands.  They
leave 30 days worth of Dx Spots for their users.  I have never heard
even one complaint about lack of Dx Spot history.  It takes the AR
Cluster node down for about 30 minutes to do this maintenance.
Something to think about if you are going to use databases.  I really
like the way Spider is now.  It simply runs with almost no maintenance.

Would it not be possible for the sh/dx or sh/mydx commands to default to
looking back say 5 days maximum instead of 100?  I believe almost all of
the commands such as "sh/dx 15" which take a very long time to execute
are simply mistakes by the users.  The bad thing is that the response
takes so long that the users will enter the command a couple of more
times, just to be sure the node isn't ignoring them!  If users really
want to look back 100 days how about "sh/dx/dxxx", where the xxx can be
100 or any other number of days that they want.  I am sure this would
cut down 99% of the problems.  Maybe a sysop selectable item.

Lee

Dirk Koopman wrote:
>
> On Sun, 2005-11-13 at 14:42 -0500, charlie carroll wrote:
> > An observation.  Kelly sent me his load test software.  The test robots
> > issue a series of commands once logged in.  One command causes my node
> > to noticably slow down....   sh/dx/10 15.  This means show me 10 entries
> > from the spot database each which contains the text "15"  As a result,
> > my node has to open and search every spot log searching for the text
> > "15"  This takes a noticable amount of time.  If I have several robots
> > connected doing the same thing, my node seemingly grinds to a halt until
> > each of these searches is processed.  One other observation... a search
> > such as  sh/dx/10 3d2 completes a lot quicker.
> >
> > Dirk can probably explain why the search for 3d2 completes faster than
> > the search for the text 15.
> >
>
> Ok. You have put your finger on (just) one of the reasons why I want to
> start looking at something a bit more sophisticated than flat files (for
> searching spots) on the one hand and more reliable than DB_File (for
> things like user file access) on the other.
>
> The sh/dx query (as someone else has pointed out) is wrong. However, you
> could argue that I could be a bit cleverer and catch that sort of thing.
> Having said that, what is happening is that the failure case for a
> search (ie none, or less than the total number, of the required prefix)
> will search the last 100 days worth of spots. Linearly. This is a
> comparatively slow operation.
>
> This is not a problem in normal use. But doing several failed searches
> will slow it down to a crawl, because it isn't a multithreaded process.
> Again, the tradeoff is simplicity for writing, ease (and consistency) of
> debugging versus possible performance gains. If I had wanted something
> that was blindingly fast, I would have written it in C - if I had had
> time. The usual problem, you can have two of: fast, good or cheap (read
> easy to program/modify)[be portable and easy to update].
>
> Using a simple database system like SQLite means that this sort of thing
> would not happen. A query would complete very quickly (given that the
> indexing is sane). It is also a module written in C and so is inherently
> quicker for *its* (completely different) failure cases (there is always
> at least one for tasks like these).
>
> I must say that I am really curious as to why certain people only manage
> a "few" (say 60 or so connections) and others (including me BTW) seem
> not to have any limitations at all. I believe Angel regularly has 150 or
> more on his box.
>
> I muse gently that, since there are no limitations in the code as to the
> number of connects, whether we should be looking somewhere else. And
> since you have had essentially the same problems with linux and windows
> (no more than 60 active connections), I wonder whether we are looking at
> a (stateful) firewall "active table" resource filling up problem?
>
> Dirk G1TLH

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.tobit.co.uk/pipermail/dxspider-support/attachments/20051116/873adf58/attachment.html