Erlang/OTP Forums

Author Message

<  Yaws mailing list  ~  Unicode and json:encode

Guest
Posted: Wed May 09, 2007 9:10 am Reply with quote
Guest
json:encode assumes that any strings containing characters > 255 are
lists of Unicode scalar values, and encodes them as UTF-8. This isn't
enough, though, because there are Unicode scalar values < 255 which
need to be encoded to produce an equivalent UTF-8 string. For example,
LATIN SMALL LETTER A WITH DIAERESIS (U+00E4) is encoded in UTF-8 as
<<16#C3, 16#A4>>. It's better to leave the strings as-is and allow the
user to encode the result if necessary. json:encode will then work
correctly with both Unicode-scalar-value strings and UTF-8-code-point
strings.

I've attached a patch fixing this problem (I also removed comment
parsing).



Post recived from mailinglist
Guest
Posted: Wed May 09, 2007 11:32 pm Reply with quote
Guest
Brian Templeton wrote:
> json:encode assumes that any strings containing characters > 255 are
> lists of Unicode scalar values, and encodes them as UTF-8. This isn't

Patch looks good. Anyone on the list has an opinion here ??


/klacke

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Erlyaws-list mailing list
Erlyaws-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/erlyaws-list
Post recived from mailinglist
Gaspar
Posted: Thu May 10, 2007 7:50 pm Reply with quote
User Joined: 20 Jul 2006 Posts: 55
Hi there!

Claes Wikstrom wrote:
> Brian Templeton wrote:
>> json:encode assumes that any strings containing characters > 255 are
>> lists of Unicode scalar values, and encodes them as UTF-8. This isn't
>
> Patch looks good. Anyone on the list has an opinion here ??
>
>
> /klacke

We'd implemented unicode encoding in json module because it's sometimes
too difficult to run around whole data structure going out in request or
going to client in the server's answer and convert unicode lists to
utf-8 or some another encoding.

This change at least will break some things our application as we expect
that stings will be converted to unicode automagically Wink
Anyway, may be it's better to provide more consistent support for
outgoing data encoding? I.e. allowing programmer to pass some options to
jsonrpc:call and yaws_jsonrpc:handler functoins, which will control
automatic encoding of data?


about the patch - it seems to break some logic too.
in this part only line with unicode detection should be removed.
in case if illegal list is passed into encode function it should report
error and not fail later in encode_string with strange message.

encode(L) when is_list(L) ->
case is_string(L) of
yes -> encode_string(L);
unicode -> encode_string(xmerl_ucs:to_utf8(L));
no -> exit({json_encode, {not_string, L}})
end;


/Gaspar




--
Gaspar Chilingarov

System Administrator,
Network security consulting

t +37493 419763 (mob)
i 63174784
e nm@web.am

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Erlyaws-list mailing list
Erlyaws-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/erlyaws-list
Post recived from mailinglist
View user's profile Send private message
Guest
Posted: Fri May 11, 2007 6:27 pm Reply with quote
Guest
Gaspar Chilingarov <nm@web.am> writes:

> Hi there!
>
> Claes Wikstrom wrote:
>> Brian Templeton wrote:
>>> json:encode assumes that any strings containing characters > 255 are
>>> lists of Unicode scalar values, and encodes them as UTF-8. This isn't
>>
>> Patch looks good. Anyone on the list has an opinion here ??
>>
>>
>> /klacke
>
[...]

> about the patch - it seems to break some logic too.
> in this part only line with unicode detection should be removed.
> in case if illegal list is passed into encode function it should report
> error and not fail later in encode_string with strange message.
>
> encode(L) when is_list(L) ->
> case is_string(L) of
> yes -> encode_string(L);
> unicode -> encode_string(xmerl_ucs:to_utf8(L));
> no -> exit({json_encode, {not_string, L}})
> end;
>

I don't see why L should be checked both in is_string and
encode_string, so I've modified encode_string to exit with the same
not_string error previously used in encode, since you prefer that one.

I've also discovered another Unicode-related bug: the json module only
considers characters < 16#FFFF to be Unicode characters. (< 65000 in
the is_string function.) I can only assume that json.erl hasn't been
updated since before 2001 when Unicode 3.1 was published. Wink
0..16#FFFF is the Basic Multilingual Plane; all values from 0 to
16#10FFFF are valid Unicode code points. Patch is attached.



Post recived from mailinglist
Guest
Posted: Fri May 11, 2007 10:18 pm Reply with quote
Guest
Gaspar Chilingarov wrote:
> This change at least will break some things our application as we expect
> that stings will be converted to unicode automagically Wink
> Anyway, may be it's better to provide more consistent support for
> outgoing data encoding? I.e. allowing programmer to pass some options to
> jsonrpc:call and yaws_jsonrpc:handler functions, which will control
> automatic encoding of data?
>
> about the patch - it seems to break some logic too.
> in this part only line with unicode detection should be removed.
> in case if illegal list is passed into encode function it should report
> error and not fail later in encode_string with strange message.
>
> encode(L) when is_list(L) ->
> case is_string(L) of
> yes -> encode_string(L);
> unicode -> encode_string(xmerl_ucs:to_utf8(L));
> no -> exit({json_encode, {not_string, L}})
> end;

If the output is supposed to be encoded as UTF-8, then it should be:

encode(L) when is_list(L) ->
case is_string(L) of
no -> exit({json_encode, {not_string, L}})
_ -> encode_string(xmerl_ucs:to_utf8(L));
end;

Encoding strings containing only U+0000 to U+00FF as ISO-8859-1 and other
strings as UTF-8, is definitely wrong.

--
David Hopwood <david.hopwood@industrial-designers.co.uk>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Erlyaws-list mailing list
Erlyaws-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/erlyaws-list
Post recived from mailinglist

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum