[Dxspider-support] Character encoding?

Thu Oct 25 01:31:19 CEST 2007

Hi Dirk,

thanks for the reply. This is what I expected and as an IT guy fully 
understand the pain and know that (nearly) impossible to find a solution 
in the current network composed of elements running so many different 
software and versions.

Technically one solution is a new link protocol using UTF-8 internally 
and converting to a traditional single byte set at the borders to 
ordinary dxcluster PC messages. iso-8859-1 seems to be a safe solution. 
Also strings from the old style nodes can be converted to UTF-8 at 
arrival. This would be transparent for the existing network so wouldn't 
cause harm. However still nodes out of control can alter text causing 
pain for dup filtering.

Just quick thoughts. To be honest as dxclusters used today character 
encoding is not a main issue from the end user point thanks to the 
commonly used English-like text. Maybe I'm wrong.

Regards... Béla

P.S: Yes, like #E3

Dirk Koopman írta:
> Bela Markus wrote:
>   
>> Hi,
>>
>> what is about character encoding within a SPIDER network and in general 
>> in DXCLUSTER? While in practice I do not see special characters in spots 
>> and announcements, are there any restrictions, conventions or rules?
>>     
>
> This is a huuuuuuuuge can of worms. In essence, there are no standards, 
> conventions or rules. Unlike you, I see lots of special characters in 
> spots and announces. It causes a number of headaches when I am trying to 
> de-duplicate them and I don't always get it right.
>
> Then there are the problems of perl versions. There is a partition wall 
> between 5.8.1 and earlier versions of perl. Anything before 5.8.1 will 
> not handle utf8 (my preferred solution) properly and treats any 
> characters as single byte, locale based, values. So whatever éáőúűüö is 
> in your locale, that's what gets used. If someone else's (receiving) 
> locale treats that byte value as something else - well, that's just tough!
>
> And then there are the authors out there that will insist in trying to 
> "clean up" text. My rule has been, for a very long time, that provided 
> the PC sentence that comes in is valid, it goes out *exactly* as it 
> comes in, with only the hop count being modified. The problem is that 
> there is an implementation out there (and I don't know who's it is) that 
> converts locale stuff to utf8 or vice versa (I haven't worked out which 
> way either). This really does not help!
>
>   
>> My system is running on CentOS 5 with UTF-8 and Hungarian locales. A 
>> quick try in local announcements shows that special characters like 
>> éáőúűüö etc. are OK. In the log file they are converted to single byte 
>> codes. No more examination done.
>>     
>
> I would like to go to utf8. But I can't see this happening for quite a 
> while yet. And if I *do* go down that road it will mean that everyone 
> will *have* to upgrade to a modern perl. Which will cause many people 
> that are (still) running on things like Redhat 6.2 more pain than the 
> occasional dupe spot or announce.
>
> When you say single byte codes, I am presuming you mean things like %E3?
>
> Dirk
>
> _______________________________________________
> Dxspider-support mailing list
> Dxspider-support at dxcluster.org
> http://mailman.tobit.co.uk/mailman/listinfo/dxspider-support
>
>
>