Erlang/OTP Forums

Author Message

<  Erlang questions mailing list  ~  The importance of Basic Unicode Understanding in Erlang

Guest
Posted: Tue Sep 27, 2011 3:37 pm Reply with quote
Guest
Hi there everyone.

I've recently done some work where, due to circumstances, unicode woes were had by everyone. It kind of got me by surprise, and I figure that if it hasn't bitten you yet, it might sooner or later. As such, I published a blog post on the issue yesterday: http://ferd.ca/will-the-real-unicode-wrangler-please-stand-up.html

It's not going into advanced details, but it's about some very simple warnings. When dealing with strings, all the binary_to_list, list_to_binary, iolist_to_binary functions are to be avoided. The length function is no longer safe and neither are comparison operators. When using io:format, "~s" is no longer what we want all the time, but rather "~ts", etc. This partial support is weirder for countries and languages that depend on some unicode characters for their everyday use when the Erlang source files are always assumed to be latin-1, although the Erlang shell is fine with unicode.

I'm no expert in i18n affairs, but we currently have no standard library way to do basic operations such as calculating the length of strings, splitting binaries or items by clusters, ways to perform normalisations, doing uppercase/lowercase/titlecase strings, comparing strings, reversing them, etc. We have to rely on external libraries. While these libraries are not bad, it is obvious that standard implementations are usually nicer for everyone. I'm also in no position to force people to implement libraries are need when I'm offering no money incentive myself.

As such, I felt like having (yet another) discussion of the issues of unicode, and what we think would be the ideal way to solve the problem within Erlang. Any opinion?

--
Fred H
Guest
Posted: Tue Sep 27, 2011 6:48 pm Reply with quote
Guest
I'm replying to this with erlang-questions in CC.

On 2011-09-27, at 13:43 PM, Michael Uvarov wrote:

> Hi,
>
> There is two ways to solve this problem:
> use erlang for working with data;
> use a native (c or c++) implimentation.
> First variant is save, but slow.
> Second variant is fast, but if we will choose it we will have problems
> with concurency and stability of our application (nifs and drivers can
> crash whole vm).
>
> Why is erlang implimentation slow?
> Strings in Erlang are simple lists, 16 byte per char, no sequel form.

You're forgetting the binary format, that lets you have any number of bytes per characters. There is currently a match type for utf8, utf16 and utf32, and it can support BOMs if I recall.

> They are slow. If you work with advanced unicode algorithms
> (normalization, collation, splitting) and i18n (locale-dependible
> algorithms), you also needs global store for unidata and CLDR data.
>
> Second way is to use nifs and ICU. ICU is fast, well-tested. ICU
> allows multithreads, you only need to have a copy of resourse for each
> thread.
> But ICU uses UTF-16, which is not nice formatted in the Erlang shell.
> Also code of nif must be very simple and well-tested.
>
> Also ICU has API for processing dates and formating messages in the
> third format (first and second are printf format and erlang's
> io:format). But it is closer to gettext application.
>
> There are few ports ICU for Erlang. Basho has icu4e (nif for basic
> functions, no locales). There is Starling driver also. And I am
> writing my realization of nifs with locales.
>
> --
> С уважением,
> Уваров Михаил.
> Best regards,
> Uvarov Michael

I must admit to not being knowledgeable enough for the rest of this post, but I find it instructive. You bring the point of locales, which is also pretty interesting. What's a smart way to handle locales? should they be VM-specific, process-specific?

--
Fred Hébert
http://www.erlang-solutions.com

_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Post received from mailinglist
steved
Posted: Tue Sep 27, 2011 6:54 pm Reply with quote
User Joined: 29 Apr 2008 Posts: 78
Have said it before, will say it again. "Strings" are a really bad way
of thinking about text. Not just in erlang but in any language/
platform. This is because the encoding is almost universally handled
implicitly (but differently). As long as that implicit data remains
consistent the problem with not being able to derive the character set
encoding remains largely hidden - and so we sweep it under the carpet
and bemoan the inconsistencies resulting as being the result of using
a platform or language. But obviously, it isn't, as the issue is
pandemic and deeply ingrained.

/s
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
Post received from mailinglist
View user's profile Send private message
Guest
Posted: Wed Sep 28, 2011 9:14 pm Reply with quote
Guest
On 09/27/2011 05:37 PM, Fr
Guest
Posted: Wed Sep 28, 2011 10:52 pm Reply with quote
Guest
On 29/09/2011, at 10:14 AM, Richard Carlsson wrote:
> - The "good old length and comparison functions" are not broken, they just answer much simpler questions than what you're asking. length(S) tells you how many code points are in string S, no more, no less. Not glyphs, not graphemes, not abstract characters. Code points.

I should point out that the question "how many characters are there" is locale-dependent.

My mother's father, looking at the place name "LJubljana" would have seen 7 letters.
I see 9. (There are in fact 7 Unicode code points. Who said one code point couldn't
count as more than one letter?) Looking at my Father's middle name: "Æneas", I see
5 letters. (Unicode agrees with me.) Other people see 6.

This means that there is no such thing as a "unicode" function
grapheme_length :: String → Integer
but only a function
grapheme_length :: String × Locale → Integer

This is only the beginning of the problems!

> Similar for comparisons.

And again, similar for comparisons.


_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Post received from mailinglist
Guest
Posted: Thu Sep 29, 2011 11:20 am Reply with quote
Guest
On 2011-09-28, at 17:14 PM, Richard Carlsson wrote:

> On 09/27/2011 05:37 PM, Fr
Guest
Posted: Thu Sep 29, 2011 11:36 am Reply with quote
Guest
On 2011-09-28, at 23:14 , Richard Carlsson wrote:
>
> - The "good old length and comparison functions" are not broken, they just answer much simpler questions than what you're asking. length(S) tells you how many code points are in string S, no more, no less. Not glyphs, not graphemes, not abstract characters. Code points. Similar for comparisons. And for some applications, this is all you need. It's only when you want to apply "human" ideas of order and visual appearance that you need to use special library functions
I'm not sure I agree about that: let's imagine you send a name to a third-party tool, and that tool happens to have very precise ideas about normalization (e.g. it's an OSX API and it *will* manipulate only NFD strings). You send an NFC UTF-8 bytestring, you get an NDF UTF-8 bytestring, you decode to Unicode codepoints. The two unicode sequences are canonically equivalent, but not equal. This has little to do with "human ideas of order and visual appearance" now does it?

> - and if you do this, you should _know_ that this is what you're doing; not hope that a primitive function like length(S) will guess what kind of information you want it to compute.
I really don't agree. A good API should make doing the common and right thing *easy* (and fool-proof), and the uncommon (and usually wrong) thing harder. Most string APIs do the exact opposite re. unicode, that's bonkers. How often do you need the codepoints-length of a unicode sequence? The length of a UTF-encoded binary stream yes, the number of grapheme clusters (for elision or length cutoff of some character field), but the unicode sequence? I don't think I've ever had this need. The number of grapheme clusters for elision or length cutoff yes, but definitely not the number of codepoints.

> - There's nothing strange about having to use ~ts instead of ~s in format strings: similar changes have to be made in C code to handle wide characters and multibyte encodings. Backwards compatibility with the existing codebase is simply a necessary thing. Yes, you have to update your source code if you want to make it work on Unicode.
On the other hand, Erlang is not C, it's not like I have to do pointer arithmetics or consider collection ownership to know when to manually release my arrays.

_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
Post received from mailinglist

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum