Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt

"Tim Bray" <tbray@textuality.com> Wed, 31 January 2007 00:45 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1HC3bn-00061O-R3; Tue, 30 Jan 2007 19:45:55 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1HC3bm-00061I-Cl for discuss@apps.ietf.org; Tue, 30 Jan 2007 19:45:54 -0500
Received: from nz-out-0506.google.com ([64.233.162.231]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1HC3bl-0002Vh-54 for discuss@apps.ietf.org; Tue, 30 Jan 2007 19:45:54 -0500
Received: by nz-out-0506.google.com with SMTP id z3so34634nzf for <discuss@apps.ietf.org>; Tue, 30 Jan 2007 16:45:52 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=Ro+Tr85dixAYYR1zzRXbc8ul3KdCS+WZIp+wv6F2UbuRMWoJ+W9nLz3qCKoksxkCn7CoX+F9EzMxgDQw1YMR9WRFOExFVjpHbuu2PAQlsJqNpNTNHXLUUWgir8b9dOfwpLfaDdey/U6W8/MBqUaKvvLyE2IthEKRQS/Pv3tWVY4=
Received: by 10.35.17.12 with SMTP id u12mr268766pyi.1170204352142; Tue, 30 Jan 2007 16:45:52 -0800 (PST)
Received: by 10.35.71.14 with HTTP; Tue, 30 Jan 2007 16:45:52 -0800 (PST)
Message-ID: <517bf110701301645u1a0e5658v3f8beeca2a1136ce@mail.gmail.com>
Date: Tue, 30 Jan 2007 16:45:52 -0800
From: Tim Bray <tbray@textuality.com>
To: John C Klensin <john-ietf@jck.com>
Subject: Re: New draft (Was: I-D ACTION:draft-klensin-unicode-escapes-00.txt
In-Reply-To: <875A124D75A8B481E176CF06@p3.JCK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <875A124D75A8B481E176CF06@p3.JCK.COM>
X-Google-Sender-Auth: 83c20ebab7132798
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 8b30eb7682a596edff707698f4a80f7d
Cc: discuss@apps.ietf.org
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Pardon me for being late to this party, I was on vacation in
Australia.  I think this is a positive contribution.

First, a detail point:  In section 5.4, it's probably relevant that
per the Java Language Specification
(http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413p)
it's clear that a Java character literal or variable represents, not a
Unicode character, but a UTF-16 code point.   I guess the conclusion
is that it may be OK in certain circumstances to use \uNNNN, but it's
not OK to explain that by calling out to Java.

Second: I think that the discussion shows that the syntax problems
around representing Unicode characters in ASCII and other
Unicode-oblivious texts are tricky; witness the issues with delimiters
and ABNF/case.  This is further evidence, were any needed, that IETF
Working Groups SHOULD NOT specify Internet protocols which may be used
to transfer text but are not capable of representing the Unicode
character set, either by specifying the use of either hard-wired UTF-8
or alternatively XML, both of which have cracked this nut.

So here's a proposed recasting of second para of 1.1:

  When one moves to Unicode [Unicode] [ISO10646], where characters
   occupy two or more octets and may be coded in several different
   forms, the question of escapes becomes even more complicated.  In
   particular, we have seen fairly extensive use of both hexadecimal
   representations of the UTF-8 encoding [RFC3629] of a character and
   variations on the U+NNNN[N[N]] notation commonly used in conjunction
   with the Unicode Standard.

  New protocols that are required to carry textual content SHOULD be designed
  in such a way that the full repertoire of Unicode characters may be
represented
  in that text; UTF-8 and XML are both good options.

  This document proposes that existing protocols being internationalized SHOULD
   use some contextually-appropriate variation of the U+NNNN[N[N]]
notation unless
   other considerations outweigh those described here.