Erlang/OTP Forums

Author Message

<  Erlang  ~  length of utf-8 string

vinnitu
Posted: Fri Jun 05, 2009 2:48 pm Reply with quote
User Joined: 31 May 2007 Posts: 14
Hi.

What is the right way to determinate length of utf-8 string?

if I use length() - it say me 4 on 2 russian letter, but 2 on 2 english letter...

how to solve this problem to show the same length?

Thanks
View user's profile Send private message
seanmc
Posted: Mon Jun 08, 2009 8:22 am Reply with quote
User Joined: 03 Aug 2007 Posts: 10
Hi Vinnitu,

I can't find any examples here to verify this but try:

string:len("example russian characters").

//Sean.
View user's profile Send private message
alexarnon
Posted: Wed Jun 24, 2009 8:14 am Reply with quote
User Joined: 26 Jan 2008 Posts: 14
Try: length(xmerl_ucs:from_utf8("example russian here")).
Using string:len(...) will simply return the original string/list's length.
View user's profile Send private message
rvirding
Posted: Thu Jun 25, 2009 11:47 pm Reply with quote
User Joined: 30 Aug 2006 Posts: 452 Location: Stockholm, Sweden
What do you mean when you say the "length of a utf-8 string"? Do you mean the number of code points, or how many bytes it takes in the current encoding? Or something else.

There is no built-in way to do this safely, it very much depends how you store your string. If it is a list then it is probably the length of the list you want as it is recommended you use code points in list. If it is a binary then the size of the binary will give the number of bytes in the used encoding.
View user's profile Send private message Visit poster's website MSN Messenger
rvirding
Posted: Thu Jun 25, 2009 11:49 pm Reply with quote
User Joined: 30 Aug 2006 Posts: 452 Location: Stockholm, Sweden
What do you mean when you say the "length of a utf-8 string"? Do you mean the number of code points, or how many bytes it takes in the current encoding? Or something else.

There is no built-in way to do this safely, it very much depends how you store your string. If it is a list then it is probably the length of the list you want as it is recommended you use code points in list. If it is a binary then the size of the binary will give the number of bytes in the used encoding.
View user's profile Send private message Visit poster's website MSN Messenger
Allan
Posted: Mon Jun 29, 2009 4:37 pm Reply with quote
User Joined: 29 Jun 2009 Posts: 30
vinnitu wrote:
What is the right way to determinate length of utf-8 string?

Since 5.6/OTP R12B Erlang has some unicode support. A standard string in Erlang is either a list of unicode code points or a binary containing utf-8 encoded code points.

So i guess, that you've got a utf-8 binary and want to know its length in code points / characters.
The easiest way to get this is to use length(unicode:characters_to_list(Utf8_binary)).
View user's profile Send private message Send e-mail ICQ Number
baryluk
Posted: Tue Aug 18, 2009 10:06 am Reply with quote
User Joined: 05 Aug 2009 Posts: 48
This depends what you mean length. Unicode allows for lots of character modifier, which can be befor or after code point. So 20 bytes of UTF-8, can be single character. If you want to know what is the width on the screen, there are some functions for this i think.
View user's profile Send private message

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum