[Dxspider-support] Character encoding?

Dirk Koopman djk at tobit.co.uk
Wed Oct 24 14:41:57 CEST 2007


Bela Markus wrote:
> Hi,
> 
> what is about character encoding within a SPIDER network and in general 
> in DXCLUSTER? While in practice I do not see special characters in spots 
> and announcements, are there any restrictions, conventions or rules?

This is a huuuuuuuuge can of worms. In essence, there are no standards, 
conventions or rules. Unlike you, I see lots of special characters in 
spots and announces. It causes a number of headaches when I am trying to 
de-duplicate them and I don't always get it right.

Then there are the problems of perl versions. There is a partition wall 
between 5.8.1 and earlier versions of perl. Anything before 5.8.1 will 
not handle utf8 (my preferred solution) properly and treats any 
characters as single byte, locale based, values. So whatever éáőúűüö is 
in your locale, that's what gets used. If someone else's (receiving) 
locale treats that byte value as something else - well, that's just tough!

And then there are the authors out there that will insist in trying to 
"clean up" text. My rule has been, for a very long time, that provided 
the PC sentence that comes in is valid, it goes out *exactly* as it 
comes in, with only the hop count being modified. The problem is that 
there is an implementation out there (and I don't know who's it is) that 
converts locale stuff to utf8 or vice versa (I haven't worked out which 
way either). This really does not help!

> 
> My system is running on CentOS 5 with UTF-8 and Hungarian locales. A 
> quick try in local announcements shows that special characters like 
> éáőúűüö etc. are OK. In the log file they are converted to single byte 
> codes. No more examination done.

I would like to go to utf8. But I can't see this happening for quite a 
while yet. And if I *do* go down that road it will mean that everyone 
will *have* to upgrade to a modern perl. Which will cause many people 
that are (still) running on things like Redhat 6.2 more pain than the 
occasional dupe spot or announce.

When you say single byte codes, I am presuming you mean things like %E3?

Dirk



More information about the Dxspider-support mailing list