rfc3987bis and RFC 6365

Discussion:

Peter Saint-Andre

2012-06-07 18:42:20 UTC

<hat type='individual'/>

At IETF 84, we discussed the desirability of aligning the terminology in
3987bis with RFC 6365 ("Terminology Used in Internationalization in the
IETF"). This is ticket #85 in the tracker:

http://trac.tools.ietf.org/wg/iri/trac/ticket/85

I've completed a review of both documents and have a few suggestions...

1. In Section 1.3, cite RFC 6365 and specify that terms are to be
understood as defined in that document unless otherwise specified (in
fact, now that we have RFC 6365 it's not clear why we're citing RFC
2130, RFC 2277, or ISO 10646). I suggest:

OLD
The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646].

NEW
Various terms used in this document are defined in [RFC6365] and
[RFC3986]. In addition, we define the following terms for use in
this document.

2. Don't define anew in rfc3987bis terms that are defined in RFC 6365.
That would mean removing the following definitions from Section 1.3:

- character
- character repertoire
- character encoding (use "character encoding scheme" or "character
encodiring form" instead)
- charset

3. Do we really need to define "octet", "sequence of characters", and
"sequence of octets"?

4. Strangely, RFC 6365 does not define "UCS", so I suppose it's OK to
define that here.

5.

Peter

--
Peter Saint-Andre
https://stpeter.im/

Peter Saint-Andre

2012-06-07 19:47:32 UTC

Permalink

Sorry, I sent this email before completing the message. Addendum below.

Post by Peter Saint-Andre
<hat type='individual'/>
At IETF 84, we discussed the desirability of aligning the terminology in
3987bis with RFC 6365 ("Terminology Used in Internationalization in the
http://trac.tools.ietf.org/wg/iri/trac/ticket/85
I've completed a review of both documents and have a few suggestions...
1. In Section 1.3, cite RFC 6365 and specify that terms are to be
understood as defined in that document unless otherwise specified (in
fact, now that we have RFC 6365 it's not clear why we're citing RFC
OLD
The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646].
NEW
Various terms used in this document are defined in [RFC6365] and
[RFC3986]. In addition, we define the following terms for use in
this document.
2. Don't define anew in rfc3987bis terms that are defined in RFC 6365.
- character
- character repertoire
- character encoding (use "character encoding scheme" or "character
encodiring form" instead)
- charset
3. Do we really need to define "octet", "sequence of characters", and
"sequence of octets"?
4. Strangely, RFC 6365 does not define "UCS", so I suppose it's OK to
define that here.
5.
Peter

5. Various sections use the term "character encoding" when, per RFC
6365, "character encoding scheme" seems to be meant (examples include
Sections 4, 4.1, 5.2, 5.3, 5.4, 7.1, 7.3, 7.4, and 7.7).

The other uses of i18n terminology throughout the document appear to be
consistent with RFC 6365, but it would be helpful if other folks check
as well.

Peter

--
Peter Saint-Andre
https://stpeter.im/

John C Klensin

2012-06-08 07:29:47 UTC

Permalink

--On Thursday, June 07, 2012 13:47 -0600 Peter Saint-Andre

Post by Peter Saint-Andre
4. Strangely, RFC 6365 does not define "UCS", so I suppose
it's OK to define that here.

I can't speak for Paul's reasoning because I don't think we
discussed it explicitly, but omitting it from 6365 was
deliberate on my part. There were two reasons. The first is
that "universal character set" is itself ambiguous as to whether
it refers to the Unicode/10646 code set or some other attempt.
One can define that problem away if one assumes that the readers
will carefully refer to the definitions even when they thing
they know what a term means (my experience indicates that rarely
happens but YMMD). Second and far more important, I think we do
ourselves and our audience no favors by using essentially
synonymous terms interchangeably to refer to the same thing. It
does not help with understanding and may cause confusion. The
practice at the time RFC 2277 was written was to call that thing
"ISO 10646" (not correct when 2277 was written, but see below).
Once we discovered (more or less around the time RFCs 3454 and
3490 were coming together that we had clear requirements for
property tables (and at the time, encodings) that were not part
of ISO/IEC 10646 itself, the practice shifted toward calling
that thing "Unicode". We've gotten most of the community used
to seeing those two terms as mostly interchangeable and being
clear about the distinction when it is important. Introducing
"UCS" to the mix adds no value and risks reopening the mini-flap
about our combining "character repertoire", "code set" (or
"CCS"), and "encoding" into "charset" in RFC 2277 (and, earlier,
RFC 1341 and its successors).

(Massive nit-pick follows, but these things actually are
important if one wants a clear and useful definition)

I don't believe 3987bis should define "UCS"; I believe it should
get rid of the term entirely even if that means rewriting some
sentences rather than just performing string substitution. As
an example of the desirability of doing this, please read the
first paragraph of Section 2.1 [draft-ietf-iri-3987bis-11].
First, despite the earlier definition and the use of "Universal
Character Set in the Abstract [1] it notes "Universal Character
Set" in parentheses, and then cites [ISO10646]. The intervening
comma implies that those are two separate definitions, adding to
the potential confusion. Second, this definition (and the
other definitions, see [1] below) appears to pretend that
Unicode and ISO/IEC 10646 are the same, which they are not. RFC
6365 was extremely careful about the relationship, which is
another reason to use it rather then defining new terms.

There is an incidental problem about what "primarily" means in
the key sentence. There doesn't seem to be any nearby
explanation. If there isn't one, it should be dropped.

Recommendation: In Section 2.1,

Old:
The IRI syntax extends the URI syntax in [RFC3986] by
extending the class of unreserved characters, primarily
by adding the characters of the UCS (Universal Character
Set, [ISO10646]) beyond U+007F, subject...

New:
The IRI syntax extends the URI syntax in [RFC3986] by
extending the class of unreserved characters by adding
the characters (code points) of ISO/IEC 10646
[ISO10646] outside the ASCII repertoire, subject...

No "primarily", no expecting the user to either know all this
stuff (in which case a large chunk of this document would be
unnecessary) or to go running off to figure out what "U+007F"
means in this context, etc. If one doesn't find
citation-as-object offensive (the editors of this document
apparently do not) and also notes that "URI syntax" is used
without a citation in Sections 1.2 (perhaps there should be a
citation there, but, if it is there it is not needed here and if
it is not needed there, then it isn't needed here either) the
above can be further shortened to

New (short version):
The IRI syntax extends the URI syntax by extending the
class of unreserved characters by adding the characters
(code points) of [ISO10646] outside the ASCII
repertoire, subject...

In spot-checking the document further, I realized that "plain
text" is actually not defined anywhere. If it is going to be
used at all, I think it deserves a definition or citation.
Peter's comment that most of its uses should actually be
"running text" still applies.

------------------

[1] Adding to the mess, the usage of "Universal Character Set"
in the Abstract is followed by a citation of Unicode _and_ ISO
10646 (note that the latter is simply wrong, even though in
popular use in the IETF), one that tries to avoid the RFC
Editor's "no citations in Abstracts" rule by changing the
brackets to parens, something that should never fly). But it
effectively leaves us with three nearly-identical definitions:
(i) "Universal Character Set" in the Abstract, defined by
reference as "(Unicode/ISO 10646)", (ii) the actual definition
in Section 1.3, which is of UCS, not "Universal Character Set",
defined as (behold!) "Universal Character Set" which is then
defined as "ISO/IEC 10646" (correctly, not "ISO 10646") "and the
Unicode Standard", and, (iii) then we have the new inline
definition in Section 2.1,

best,
john

Peter Saint-Andre

2012-06-08 14:44:56 UTC

Permalink

Post by John C Klensin
--On Thursday, June 07, 2012 13:47 -0600 Peter Saint-Andre

Post by Peter Saint-Andre
4. Strangely, RFC 6365 does not define "UCS", so I suppose
it's OK to define that here.

Yes, I was going to suggest that, but I wasn't sure what to propose in
its place (i.e., "ISO/IEC 10646" or "Unicode" -- the latter has the
benefit of being much more familiar to most people who would read this
specification).

Post by John C Klensin
I believe it should
get rid of the term entirely even if that means rewriting some
sentences rather than just performing string substitution. As
an example of the desirability of doing this, please read the
first paragraph of Section 2.1 [draft-ietf-iri-3987bis-11].
First, despite the earlier definition and the use of "Universal
Character Set in the Abstract [1] it notes "Universal Character
Set" in parentheses, and then cites [ISO10646]. The intervening
comma implies that those are two separate definitions, adding to
the potential confusion. Second, this definition (and the
other definitions, see [1] below) appears to pretend that
Unicode and ISO/IEC 10646 are the same, which they are not. RFC
6365 was extremely careful about the relationship, which is
another reason to use it rather then defining new terms.
There is an incidental problem about what "primarily" means in
the key sentence. There doesn't seem to be any nearby
explanation. If there isn't one, it should be dropped.
Recommendation: In Section 2.1,
The IRI syntax extends the URI syntax in [RFC3986] by
extending the class of unreserved characters, primarily
by adding the characters of the UCS (Universal Character
Set, [ISO10646]) beyond U+007F, subject...
The IRI syntax extends the URI syntax in [RFC3986] by
extending the class of unreserved characters by adding
the characters (code points) of ISO/IEC 10646
[ISO10646] outside the ASCII repertoire, subject...

Works for me.

Peter

--
Peter Saint-Andre
https://stpeter.im/

John C Klensin

2012-06-08 19:15:34 UTC

Permalink

--On Friday, June 08, 2012 08:44 -0600 Peter Saint-Andre

Post by Peter Saint-Andre
...

Post by John C Klensin
(Massive nit-pick follows, but these things actually are
important if one wants a clear and useful definition)
I don't believe 3987bis should define "UCS";

Yes, I was going to suggest that, but I wasn't sure what to
propose in its place (i.e., "ISO/IEC 10646" or "Unicode" --
the latter has the benefit of being much more familiar to most
people who would read this specification).

As long as we are talking about either code points or
repertoires they are pretty much interchangeable (as I hope
everyone reading this knows). It seems to me it would be
rational to pick whatever term would be most familiar to the
audience, insert a parenthetical or other note on first use that
points to the other one (and maybe says "same for code points
and repertoire"), and then just move on and be consistent.

john

Martin J. Dürst

2012-10-20 07:39:48 UTC

Permalink

Hello Peter, others,

Implemented in my editorial copy. Many thanks for the actual text proposal.

Post by Peter Saint-Andre
2. Don't define anew in rfc3987bis terms that are defined in RFC 6365.
- character
- character repertoire

Done.

Post by Peter Saint-Andre
- character encoding (use "character encoding scheme" or "character
encoding form" instead)
- charset

These two are not that simple. For background, please check
http://www.w3.org/TR/charmod/#sec-Digital.

Here is what we currently have for "character encoding":
A method of representing a sequence
of characters as a sequence of octets (maybe with variants). Also,
a method of (unambiguously) converting a sequence of octets into a
sequence of characters.

The problem with 'charset' as defined in RFC 6365 (and elsewhere) is
that it's purely one-way, from octets to characters. But there's the
other direction, too.

The problem with "character encoding scheme" or "character encoding
form" is that they are much more specialized terms.

RFC 6365 has this to say after the definition of "charset":

Many protocol definitions use the term "character set" in their
descriptions. The terms "charset", or "character encoding scheme"
and "coded character set", are strongly preferred over the term
"character set" because "character set" has other definitions in
other contexts, particularly outside the IETF. When reading IETF
standards that use "character set" without defining the term, they
usually mean "a specific combination of one CCS with a CES",
particularly when they are talking about the "US-ASCII character
set".

Of course, per and http://www.w3.org/MarkUp/html-spec/charset-harmful
and as above, we sure don't want to use "character set". And we indeed
want something to denote "a specific combination of one CCS with a CES"
(or in some cases actually a combination of more than one CCS...), so
neither "coded character set" (CCS) nor "character encoding scheme"
(CES) will do, despite the suggestions above. So we just ended up with
"character encoding", using a simple term for a very central concept,
also in line with http://www.w3.org/TR/charmod/.

As a result of this, we only use "charset" when it's used as a label,
with a narrowed definition: "The name of a parameter or attribute used
to identify a character encoding."

I guess we could just drop the narrowing definition of "charset", but we
can't drop "character encoding".

Post by Peter Saint-Andre
3. Do we really need to define "octet", "sequence of characters", and
"sequence of octets"?

Good questions. RFC 6365 uses "octet" without defining it, so I guess we
can drop it. I think we can also drop "sequence of characters" and
"sequence of octets", but I'd like to get Larry's okay for these.

Post by Peter Saint-Andre
4. Strangely, RFC 6365 does not define "UCS", so I suppose it's OK to
define that here.

Following discussions later in this thread, I'm trying to get rid of
this. But it needs some more thought.

Regards, Martin.