<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="IncrediMail 1.0" name=GENERATOR></HEAD>
<BODY style="BACKGROUND-POSITION: 0px 0px; FONT-SIZE: 12pt; MARGIN: 5px 10px 10px; FONT-FAMILY: Arial" bgColor=#ffffff background="" scroll=yes ORGYPOS="0">
<TABLE id=INCREDIMAINTABLE cellSpacing=0 cellPadding=2 width="100%" border=0>
<TBODY>
<TR>
<TD id=INCREDITEXTREGION style="FONT-SIZE: 12pt; CURSOR: auto; FONT-FAMILY: Arial" width="100%">
<DIV>
<DIV id=receivestrings>
<DIV dir=ltr style="FONT-SIZE: 11pt"><I><B>From:</B></I> <A href="mailto:ve7cc@shaw.ca">Lee Sawkins</A></DIV>
<DIV dir=ltr style="FONT-SIZE: 11pt"><I><B>Date:</B></I> 11/15/05 07:06:16</DIV>
<DIV dir=ltr style="FONT-SIZE: 11pt"><I><B>To:</B></I> <A href="mailto:dxspider-support@dxcluster.org">The DXSpider Support list</A></DIV>
<DIV dir=ltr style="FONT-SIZE: 11pt"><I><B>Subject:</B></I> Re: [Dxspider-support] Followup - Hang hang and some more hang...</DIV></DIV>
<DIV> </DIV><SPAN id=receiveimages> </SPAN>
<DIV>Hi Dirk</DIV>
<DIV> </DIV>
<DIV>Kelly ran his test software with my Spider cluster.</DIV>
<DIV> </DIV>
<DIV>What I see is 100% CPU usage and the system is too busy to accept any</DIV>
<DIV>other commands. This is what is to be expected. There is simply not</DIV>
<DIV>the CPU power to respond quickly enough to all the commands.</DIV>
<DIV> </DIV>
<DIV>On Kelly's node the problem may be simply that one of his users has a</DIV>
<DIV>badly performing client program that is simply overloading the cluster.</DIV>
<DIV>I once had a user that was echoing everything back. This caused 100%</DIV>
<DIV>cpu usuage for most of one dx contest weekend and made things really</DIV>
<DIV>sluggish.</DIV>
<DIV> </DIV>
<DIV>By the way, my Windows 2K Spider cluster hit a maximum of 60 telnet</DIV>
<DIV>links the other day. Whether this was the maximum possible or just a</DIV>
<DIV>coincidence, I don't know.</DIV>
<DIV> </DIV>
<DIV>On AR Cluster nodes, the sysops have to trim their databases about once</DIV>
<DIV>per month to try to keep the nodes responsive to user commands. They</DIV>
<DIV>leave 30 days worth of Dx Spots for their users. I have never heard</DIV>
<DIV>even one complaint about lack of Dx Spot history. It takes the AR</DIV>
<DIV>Cluster node down for about 30 minutes to do this maintenance.</DIV>
<DIV>Something to think about if you are going to use databases. I really</DIV>
<DIV>like the way Spider is now. It simply runs with almost no maintenance.</DIV>
<DIV> </DIV>
<DIV>Would it not be possible for the sh/dx or sh/mydx commands to default to</DIV>
<DIV>looking back say 5 days maximum instead of 100? I believe almost all of</DIV>
<DIV>the commands such as "sh/dx 15" which take a very long time to execute</DIV>
<DIV>are simply mistakes by the users. The bad thing is that the response</DIV>
<DIV>takes so long that the users will enter the command a couple of more</DIV>
<DIV>times, just to be sure the node isn't ignoring them! If users really</DIV>
<DIV>want to look back 100 days how about "sh/dx/dxxx", where the xxx can be</DIV>
<DIV>100 or any other number of days that they want. I am sure this would</DIV>
<DIV>cut down 99% of the problems. Maybe a sysop selectable item.</DIV>
<DIV> </DIV>
<DIV>Lee</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>Dirk Koopman wrote:</DIV>
<DIV>></DIV>
<DIV>> On Sun, 2005-11-13 at 14:42 -0500, charlie carroll wrote:</DIV>
<DIV>> > An observation. Kelly sent me his load test software. The test robots</DIV>
<DIV>> > issue a series of commands once logged in. One command causes my node</DIV>
<DIV>> > to noticably slow down.... sh/dx/10 15. This means show me 10 entries</DIV>
<DIV>> > from the spot database each which contains the text "15" As a result,</DIV>
<DIV>> > my node has to open and search every spot log searching for the text</DIV>
<DIV>> > "15" This takes a noticable amount of time. If I have several robots</DIV>
<DIV>> > connected doing the same thing, my node seemingly grinds to a halt until</DIV>
<DIV>> > each of these searches is processed. One other observation... a search</DIV>
<DIV>> > such as sh/dx/10 3d2 completes a lot quicker.</DIV>
<DIV>> ></DIV>
<DIV>> > Dirk can probably explain why the search for 3d2 completes faster than</DIV>
<DIV>> > the search for the text 15.</DIV>
<DIV>> ></DIV>
<DIV>></DIV>
<DIV>> Ok. You have put your finger on (just) one of the reasons why I want to</DIV>
<DIV>> start looking at something a bit more sophisticated than flat files (for</DIV>
<DIV>> searching spots) on the one hand and more reliable than DB_File (for</DIV>
<DIV>> things like user file access) on the other.</DIV>
<DIV>></DIV>
<DIV>> The sh/dx query (as someone else has pointed out) is wrong. However, you</DIV>
<DIV>> could argue that I could be a bit cleverer and catch that sort of thing.</DIV>
<DIV>> Having said that, what is happening is that the failure case for a</DIV>
<DIV>> search (ie none, or less than the total number, of the required prefix)</DIV>
<DIV>> will search the last 100 days worth of spots. Linearly. This is a</DIV>
<DIV>> comparatively slow operation.</DIV>
<DIV>></DIV>
<DIV>> This is not a problem in normal use. But doing several failed searches</DIV>
<DIV>> will slow it down to a crawl, because it isn't a multithreaded process.</DIV>
<DIV>> Again, the tradeoff is simplicity for writing, ease (and consistency) of</DIV>
<DIV>> debugging versus possible performance gains. If I had wanted something</DIV>
<DIV>> that was blindingly fast, I would have written it in C - if I had had</DIV>
<DIV>> time. The usual problem, you can have two of: fast, good or cheap (read</DIV>
<DIV>> easy to program/modify)[be portable and easy to update].</DIV>
<DIV>></DIV>
<DIV>> Using a simple database system like SQLite means that this sort of thing</DIV>
<DIV>> would not happen. A query would complete very quickly (given that the</DIV>
<DIV>> indexing is sane). It is also a module written in C and so is inherently</DIV>
<DIV>> quicker for *its* (completely different) failure cases (there is always</DIV>
<DIV>> at least one for tasks like these).</DIV>
<DIV>></DIV>
<DIV>> I must say that I am really curious as to why certain people only manage</DIV>
<DIV>> a "few" (say 60 or so connections) and others (including me BTW) seem</DIV>
<DIV>> not to have any limitations at all. I believe Angel regularly has 150 or</DIV>
<DIV>> more on his box.</DIV>
<DIV>></DIV>
<DIV>> I muse gently that, since there are no limitations in the code as to the</DIV>
<DIV>> number of connects, whether we should be looking somewhere else. And</DIV>
<DIV>> since you have had essentially the same problems with linux and windows</DIV>
<DIV>> (no more than 60 active connections), I wonder whether we are looking at</DIV>
<DIV>> a (stateful) firewall "active table" resource filling up problem?</DIV>
<DIV>></DIV>
<DIV>> Dirk G1TLH</DIV></DIV>
<DIV> </DIV></TD></TR>
<TR>
<TD id=INCREDIFOOTER width="100%">
<TABLE cellSpacing=0 cellPadding=0 width="100%">
<TBODY>
<TR>
<TD width="100%"></TD>
<TD id=INCREDISOUND vAlign=bottom align=middle></TD>
<TD id=INCREDIANIM vAlign=bottom align=middle></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE></BODY></HTML>