Discussion:
Standardizing on IDNA 2003 in the URL Standard
Anne van Kesteren
2013-08-19 12:37:16 UTC
Permalink
http://lists.w3.org/Archives/Public/www-archive/2013Aug/0008.html
might be of interest to readers of these lists.


--
http://annevankesteren.nl/
Peter Saint-Andre
2013-08-19 16:35:44 UTC
Permalink
On 8/19/13 6:37 AM, Anne van Kesteren wrote:
> http://lists.w3.org/Archives/Public/www-archive/2013Aug/0008.html
> might be of interest to readers of these lists.

Hi Anne, thanks for the heads-up.

Given that IDNA 2003 is tied to Unicode 3.2 (via stringprep), I'm
curious to know more about what you mean by "IDNA 2003 ... without
restrictions to a particular Unicode version".

Do you have a preferred venue for discussion of this topic?

Peter

--
Peter Saint-Andre
https://stpeter.im/
Anne van Kesteren
2013-08-19 17:01:56 UTC
Permalink
On Mon, Aug 19, 2013 at 5:35 PM, Peter Saint-Andre <***@stpeter.im> wrote:
> Given that IDNA 2003 is tied to Unicode 3.2 (via stringprep), I'm
> curious to know more about what you mean by "IDNA 2003 ... without
> restrictions to a particular Unicode version".

As far as I can tell from implementations what it means is that the
NFKC normalization algorithm from Unicode is the one defined in the
latest edition of Unicode rather than that of Unicode 3.2. I don't
think the other tables from Stringprep have been modified, but I
haven't exhaustively tested that. I probably should.


> Do you have a preferred venue for discussion of this topic?

Not really. Wherever people pay attention I suppose :-)


--
http://annevankesteren.nl/
Mark Davis ☕
2013-08-19 17:31:53 UTC
Permalink
On Mon, Aug 19, 2013 at 7:01 PM, Anne van Kesteren <***@annevk.nl> wrote:

> As far as I can tell from implementations what it means is that the
> NFKC normalization algorithm from Unicode is the one defined in the
>

Rather than promoting different, arbitrary modifications of IDNA2003, ​I
would recommend instead using the TR46 specification, ​which provides a
migration path from IDNA2003 to IDNA2008. It is, with some small
exceptions, compatible with IDNA2003.

"To satisfy user expectations for mapping, and provide maximal
compatibility with IDNA2003, this document specifies a mapping for use with
IDNA2008. In addition, to transition more smoothly to IDNA2008, this
document provides a Unicode algorithm for a standardized processing that
allows conformant implementations to minimize the security and
interoperability problems caused by the differences between IDNA2003 and
IDNA2008. This Unicode IDNA Compatibility Processing is structured
according to IDNA2003 principles, but extends those principles to Unicode
5.2 and later. It also incorporates the repertoire extensions provided by
IDNA2008."

For more see http://unicode.org/reports/tr46/.

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**
Shawn Steele
2013-08-19 20:32:00 UTC
Permalink
I concur. We use the IDNA2008 + TR46 behavior.

-Shawn

From: ***@gmail.com [mailto:***@gmail.com] On Behalf Of Mark Davis ?
Sent: Monday, August 19, 2013 10:32 AM
To: Anne van Kesteren
Cc: Peter Saint-Andre; public-***@w3.org; ***@w3.org; www-tag.w3.org
Subject: Re: Standardizing on IDNA 2003 in the URL Standard


On Mon, Aug 19, 2013 at 7:01 PM, Anne van Kesteren <***@annevk.nl<mailto:***@annevk.nl>> wrote:
As far as I can tell from implementations what it means is that the
NFKC normalization algorithm from Unicode is the one defined in the

Rather than promoting different, arbitrary modifications of IDNA2003, ​I would recommend instead using the TR46 specification, ​which provides a migration path from IDNA2003 to IDNA2008. It is, with some small exceptions, compatible with IDNA2003.

"To satisfy user expectations for mapping, and provide maximal compatibility with IDNA2003, this document specifies a mapping for use with IDNA2008. In addition, to transition more smoothly to IDNA2008, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.2 and later. It also incorporates the repertoire extensions provided by IDNA2008."

For more see http://unicode.org/reports/tr46/.

Mark<https://plus.google.com/114199149796022210033>

— Il meglio Ú l’inimico del bene —
Anne van Kesteren
2013-08-20 12:32:23 UTC
Permalink
On Mon, Aug 19, 2013 at 6:31 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> Rather than promoting different, arbitrary modifications of IDNA2003, I
> would recommend instead using the TR46 specification, which provides a
> migration path from IDNA2003 to IDNA2008. It is, with some small exceptions,
> compatible with IDNA2003.

Last I checked with implementers there was not much interest in that.
And to be clear, it's not different and arbitrary. The modifications
have been in place since IDNA2003 support landed in browsers. As
should have been clear to the original authors of IDNA2003 too. Nobody
is going to arbitrarily freeze their Unicode implementation.

(Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
domain names in UTS #46 is somewhat confusing.)


On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
<***@microsoft.com> wrote:
> I concur. We use the IDNA2008 + TR46 behavior.

Interesting. Last I checked Internet Explorer that was not the case.
Since which version is this deployed? Does it depend on the operating
system? What variation of TR46 is implemented?


On Mon, Aug 19, 2013 at 11:36 PM, Vint Cerf <***@google.com> wrote:
> It seems to me that we would serve the community well if we work towards a
> well-defined and timely transition to IDNA2008. It has a key property of
> independence from any particular version of UNICODE (which was the primary
> reason for moving in that direction). It also has a canonical representation
> of domain labels which is also a powerful standardizing element. We are all
> aware of the potential for some backward incompatibility with IDNA2003 but
> the committee that developed IDNA2008 discussed these issues at length and
> obviously concluded that the features of IDNA2008 were superior over all to
> the status quo. It is a disservice in the long run to delay adoption of the
> newer design, especially given the huge expansion of the TLD space - all
> these TLDs should be developed and evolved on the IDNA2008 principles.

I don't think the committee has carefully considered the compatibility
impact. Deployed domains would become invalid. Long-standing practice
of case folding (e.g. the idea that http://EXAMPLE.COM/ and
http://example.com/ are identical) is suddenly something that is no
longer decided upon by IDNA but needs to be decided somehow at the
application-level. And when the Unicode consortium provided such
profiling for applications in the form of
http://unicode.org/reports/tr46/ that was frowned upon. It's not at
all clear what the transition path is envisioned here.


--
http://annevankesteren.nl/
Jungshik SHIN (신정식)
2013-08-20 14:46:27 UTC
Permalink
2013. 8. 20. 였전 5:33에 "Anne van Kesteren" <***@annevk.nl>님읎 작성:
>
> On Mon, Aug 19, 2013 at 6:31 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> > Rather than promoting different, arbitrary modifications of IDNA2003, I
> > would recommend instead using the TR46 specification, which provides a
> > migration path from IDNA2003 to IDNA2008. It is, with some small
exceptions,
> > compatible with IDNA2003.
>
> Last I checked with implementers there was not much interest in that.

Chrome is interested. It is very long overdue.

> And to be clear, it's not different and arbitrary. The modifications
> have been in place since IDNA2003 support landed in browsers. As
> should have been clear to the original authors of IDNA2003 too. Nobody
> is going to arbitrarily freeze their Unicode implementation.
>
> (Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
> domain names in UTS #46 is somewhat confusing.)
>
>
> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
> <***@microsoft.com> wrote:
> > I concur. We use the IDNA2008 + TR46 behavior.
>
> Interesting. Last I checked Internet Explorer that was not the case.
> Since which version is this deployed? Does it depend on the operating
> system? What variation of TR46 is implemented?
>
>
> On Mon, Aug 19, 2013 at 11:36 PM, Vint Cerf <***@google.com> wrote:
> > It seems to me that we would serve the community well if we work
towards a
> > well-defined and timely transition to IDNA2008. It has a key property of
> > independence from any particular version of UNICODE (which was the
primary
> > reason for moving in that direction). It also has a canonical
representation
> > of domain labels which is also a powerful standardizing element. We are
all
> > aware of the potential for some backward incompatibility with IDNA2003
but
> > the committee that developed IDNA2008 discussed these issues at length
and
> > obviously concluded that the features of IDNA2008 were superior over
all to
> > the status quo. It is a disservice in the long run to delay adoption of
the
> > newer design, especially given the huge expansion of the TLD space - all
> > these TLDs should be developed and evolved on the IDNA2008 principles.
>
> I don't think the committee has carefully considered the compatibility
> impact. Deployed domains would become invalid. Long-standing practice
> of case folding (e.g. the idea that http://EXAMPLE.COM/ and
> http://example.com/ are identical) is suddenly something that is no
> longer decided upon by IDNA but needs to be decided somehow at the
> application-level. And when the Unicode consortium provided such
> profiling for applications in the form of
> http://unicode.org/reports/tr46/ that was frowned upon. It's not at
> all clear what the transition path is envisioned here.
>
>
> --
> http://annevankesteren.nl/
>
Gervase Markham
2013-08-21 09:07:18 UTC
Permalink
[I'm also not on many of these lists...]

On 20/08/13 15:46, Jungshik SHIN (신정식) wrote:
>
> 2013. 8. 20. 오전 5:33에 "Anne van Kesteren" <***@annevk.nl
> <mailto:***@annevk.nl>>님이 작성:
>> Last I checked with implementers there was not much interest in that.

In the case of Mozilla, if it was something I said which gave you that
impression, I apologise. That's not correct.

> Chrome is interested. It is very long overdue.

We are also interested. Sticking with a single version of Unicode is
untenable; given that, implementing anything other than IDNA2008 would
just be some mish-mash which would behave differently to everyone else.
Our implementation was held up for quite some time by licensing problems
with idnkit2 (now resolved), and it's now held up (I believe) due to
lack of time on the part of the main engineer in this area. (Patches
welcome.) But, insofar as I have any say, we do want to move to
IDNA2008, perhaps with some compatibility mitigations from TR46. (We've
not yet developed a precise plan.)

With regard to any incompatibilities, particularly around sharp-S and
final sigma, my understanding and expectation is that the registries
most concerned with those characters (e.g. the Greek registry for final
sigma) were in agreement that IDNA2008 was the correct way forward, and
that any breakage caused by the switch was better than the breakage
caused by not moving. If I became aware that this was not the case, my
view might perhaps change. But I believe that it is. If there is a
phishing problem in any particular TLD due to this change, then I place
the blame for that squarely on the registry concerned.

This is https://bugzilla.mozilla.org/show_bug.cgi?id=479520 .

Gerv
Shawn Steele
2013-08-21 16:14:46 UTC
Permalink
> But I believe that it is. If there is a phishing problem in any particular TLD due to this change, then I place the blame for that squarely on the registry concerned.

Historically users blamed the browsers, not the registrars for things like the paypal-with-cyrillic-a homograph.
John C Klensin
2013-08-21 16:39:03 UTC
Permalink
--On Wednesday, August 21, 2013 16:14 +0000 Shawn Steele
<***@microsoft.com> wrote:

>> But I believe that it is. If there is a phishing problem in
>> any particular TLD due to this change, then I place the blame
>> for that squarely on the registry concerned.
>
> Historically users blamed the browsers, not the registrars for
> things like the paypal-with-cyrillic-a homograph.

Shawn, you can generalize from that to "historically, users
blame either the software with which they directly interact or
their blame their first-hop ISP" without any loss of
information. Taking the Eszett problem as an example, if a
registry decides to register a label containing an Eszett but
block a similar one containing an "ss" (a rational, but probably
not optimal, strategy by Mark's reasoning or mine), then the
complaints will be about inaccessibility from an IDDA2003 or
IDNA2008-with-UTR46-transition=on browser. If they allow "ss"
but not Eszett, than someone using an IDNA2008 browser (with no
transition tools) will be happy but someone expecting "ss" to
just work will be unhappy with all browsers.

That situation of course has the potential to provide clear
feedback to registries, even though well down in the tree. If
they sell or otherwise allocate and delegate names that often
don't "work", they are likely to have trouble with their own
customers and constituencies. Whether it is better to have
browsers (and other UIs) lead or follow is not a simple question
(although I clearly have biases about the right answer).

This is ultimately a "lose either way" situation, a problem that
was reasonably well understood and accepted when the IDNABIS WG
made its decisions. The question is where on the curve one
wants to fall and when. That question has no easy answers
although it is clear to me that "IDNA2003 forever" isn't one of
the reasonable ones.

best,
john
Shawn Steele
2013-08-21 17:05:21 UTC
Permalink
John Cowan
2013-08-21 18:07:21 UTC
Permalink
Shawn Steele scripsit:

> A non-final sigma isn't (my understanding) a valid form of the word,

Alas, things are not so simple. φιλος would be appropriate if the
semantic is 'friendship', but φιλοσ, with a non-final sigma, would
be appropriate as an abbreviation of φιλοσοφία 'philosophy'.
The Unicode rule is to downcase capital sigma to a non-final form if
a letter follows and to a final form otherwise, but this is just a
convention that dumb computers can follow rather than the whole truth.

> Eszett is less clear, because using eszett or ss influences the
> pronunciation (at least in Germany, in Switzerland that can be
> different). I imagine it's rather worse if you're Turkish and prefer
> different i's.

Actually, missing diacritics aren't a big problem in Turkish for native
speakers, because of the vowel-harmony rules, which mean that most
words contain either the front vowels e, i, ö, and ü, or else the back
vowels a, ı (dotless i), o, and u, but not both in the same word.

> For German, nobody is ever going to expect fußball.ch and fussball.ch
> to go different place. And nobody's going to be surprised if
> fußball.de and fussball.de end up at the same site.

Well, there are minimal pairs like Buße 'fine' vs. Busse 'buses', but
that's livable, particularly because in Switzerland and Liechtenstein
they are both spelled "Busse" anyway.

--
"But I am the real Strider, fortunately," John Cowan
he said, looking down at them with his face ***@ccil.org
softened by a sudden smile. "I am Aragorn son http://www.ccil.org/~cowan
of Arathorn, and if by life or death I can
save you, I will." --LotR Book I Chapter 10
Shawn Steele
2013-08-21 18:09:56 UTC
Permalink
> Alas, things are not so simple. φιλος would be appropriate if the semantic is 'friendship', but φιλοσ, with a non-final sigma, would be appropriate as an abbreviation of φιλοσοφία 'philosophy'.

Thanks, I
John C Klensin
2013-08-21 20:48:56 UTC
Permalink
--On Wednesday, August 21, 2013 17:05 +0000 Shawn Steele
<***@microsoft.com> wrote:

> IMO, the eszett & even more so, final sigma, are somewhat
> display issues. My personal opinion is we need a display
> standard (yes, that's not easy

Indeed. But it might be worth some effort.

> A non-final sigma isn't (my understanding) a valid form of the
> word, so you shouldn't ever have both registered. It could
> certainly be argued that 2003 shouldn't have done this
> mapping. If these are truly mutually exclusive, then the
> biggest problem with 2003 isn't a confusing canonical form,
> but rather that it doesn't look right in the 2003 canonical
> form. However there's no guarantee in DNS that I can have a
> perfect canonical form for my label. Microsoft for example,
> is a proper noun, however any browser nowadays is going to
> display microsoft.com, not Microsoft.com. (Yes, that's
> probably not "as bad" as the final sigma example).

Right. But I think that you are at risk of confusing two
issues. One is that, if the needs of the DNS were the only
thing that drove Unicode decisions, we all had perfect hindsight
and foresight, and it was easy to make retroactive or flag day
corrections, probably all position-dependent (isolated, initial,
medial, final in the general case) character variations would be
assigned only a code point for the base character with the
positional stuff viewed as strictly as a display issue (possibly
with an overriding qualifier codepoint). That would have meant
no separate code point for a final sigma in Greek; no separate
code points for final Kaf, Mem, Nun, Pe, or Tsadi in Hebrew; and
so on, i.e., the way the basic Arabic block was handled before
the presentation forms were added. If things had been done that
way, some of these things would have been entirely a display
issue, with the only difficult question for IDNA one of whether
to allow the presentation qualifier so as to permit preserving
word distinctions in concatenated strings -- in a one-case
script, selective use of final or initial character forms would
provide the equivalent of using "DigitalResearch" or "SonyStyle"
as distinctive domain name.

But it wasn't done that way. I can identify a number of reasons
why it wasn't and indeed why, on balance, it might have been a
bad idea. I assume Mark or some other Unicode expert would have
a longer list of such reasons than I do. So we cope. To a
first order approximation, the IDNA2003 method of coping was to
try to map all of the alternate presentation forms together...
except when it didn't. And, to an equally good approximation,
IDNA2008 deals with it by disallowing the alternate presentation
forms... except when it doesn't. The working group was
convinced that the second choice was less evil (or at least leas
of a problem) than the first one, but I don't think anyone would
really argue that either choice is ideal, especially when it
cannot be applied consistently without a lot of additional
special-case, code point by code point, rules.

Hard problem but, if we come back to the question from Anne that
started this thread, I don't think there is any good basis to
argue that the IDNA2003 approach is fundamentally better. It is
just the approach that we took first, before we understood the
problems with it.

> Eszett is less clear, because using eszett or ss influences
> the pronunciation (at least in Germany, in Switzerland that
> can be different). I imagine it's rather worse if you're
> Turkish and prefer different i's. For German, nobody is ever
> going to expect fußball.ch and fussball.ch to go different
> place.

I suspect that there are other possible examples that don't have
that property. But that is something on which Marcos should
comment. Clearly it is within the power of the registry to
arrange for "same place" if that is what they want to do. And,
if they do that for all such names, this whole discussion is
moot in practice.

>...
> For words that happen to be similar, there's no expectation
> that a DNS name is available. AAA Plumbing and all the other
> AAA whatever's out there aren't going to be surprised that
> AAA.com is already taken.

Surprised? Probably not. Willing to fight over who is the
"real" AAA? Yes, and we have seen that sort of thing repeatedly.

> So why's German more special that
> Turkish or English?

Because "ß" is really a different letter than the "ss"
sequence. And dotless i is really a different letter than the
dotted one, just as "o" and "0" or "l" and "1" are. If a
registry decides that the potential for spoofing and other
problems outweighs the advantages of keeping them separate and
potentially allocating them separately and either delegates them
to the same entity or blocks one string from each pair, I think
that is great. If they make some other decision, that is great
too. Where I have a problem is when a browser (or other lookup
application) makes that decision, essentially blocking one of
the strings, and makes it on behalf of the user without any
consideration of local issues or conventions.

I might even suggest that, because "O" and "0" and "l" and "1"
are more confusable (and hence spoofing-prone) than "ß" and
"ss", if you were being logically consistent, you would map all
domain labels containing zeros into ones containing "o"s and
ones containing "1" into ones containing "l". That would
completely prevent the "MICR0S0FT" spoof and a lot of others at
the price of making a lot of legitimate labels invalid or
inaccessible -- just like the "ß" case. And, like "ß",
treating 0 or 1 as display issues would not only not help very
much, it would astonish users of European digits.

best,
john
John Cowan
2013-08-21 21:30:12 UTC
Permalink
John C Klensin scripsit:

> That would have meant no separate code point for a final sigma in
> Greek;

See my earlier post for why that's not possible.

> no separate code points for final Kaf, Mem, Nun, Pe, or Tsadi in
> Hebrew;

This is even less possible. In Hebrew, a pe at the end of a word is
always /f/. But in Yiddish, there is a contrast between /p/ and /f/ in
final position that Hebrew does not have, and in that case a non-final
pe in final position is used for /p/, sometimes but not always with a
dagesh (embedded dot).

In short, Greek and Hebrew positional variants require AI-hard algorithms.

--
John Cowan ***@ccil.org http://ccil.org/~cowan
If I have seen farther than others, it is because I am surrounded by dwarves.
--Murray Gell-Mann
John C Klensin
2013-08-22 01:50:47 UTC
Permalink
--On Wednesday, August 21, 2013 17:30 -0400 John Cowan
<***@mercury.ccil.org> wrote:

> John C Klensin scripsit:
>
>> That would have meant no separate code point for a final
>> sigma in Greek;
>
> See my earlier post for why that's not possible.
>
>> no separate code points for final Kaf, Mem, Nun, Pe, or Tsadi
>> in Hebrew;
>
> This is even less possible. In Hebrew, a pe at the end of a
> word is always /f/. But in Yiddish, there is a contrast
> between /p/ and /f/ in final position that Hebrew does not
> have, and in that case a non-final pe in final position is
> used for /p/, sometimes but not always with a dagesh (embedded
> dot).
>
> In short, Greek and Hebrew positional variants require AI-hard
> algorithms.

John,

As I said, I know a rather long list of reasons why the model I
tried to describe was not reasonable or desirable and suggested
that people more expert would have an even longer list.

Your note only reinforced that impression.

The point was merely to suggest that there are no simplistic and
still fully general solutions to the relevant set of problems.
Your note reinforce that suggestion too.

thanks,

john
Anne van Kesteren
2013-08-22 10:52:24 UTC
Permalink
On Wed, Aug 21, 2013 at 6:05 PM, Shawn Steele
<***@microsoft.com> wrote:
> I'd much prefer a mechanism to suggest a preferred display form. That'd solve things like the Turkish I issue as well.

Yeah, that seems more sensible, avoids breaking a ton of URLs, and has
less potential for spoofing (given appropriate safeguards).

I don't buy the argument that then we should also alias 1 and l and o
and 0. Shawn's argument is about not making the status quo worse. We
could maybe make it better still, but that seems separate and given
what's deployed and relied upon unlikely.


--
http://annevankesteren.nl/
Gervase Markham
2013-08-22 12:15:11 UTC
Permalink
On 22/08/13 11:52, Anne van Kesteren wrote:
> Yeah, that seems more sensible, avoids breaking a ton of URLs, and has
> less potential for spoofing (given appropriate safeguards).

Can you (or e.g. Google - Jungshik?) put some metrics behind that "ton"?

The question is: how many domain names are there out there with live web
pages which use any one of the characters permitted in IDNA2008 but not
permitted in IDNA2008+TR46 (i.e. the non-alphabetic characters)?

Gerv
Gervase Markham
2013-08-21 16:54:12 UTC
Permalink
On 21/08/13 17:14, Shawn Steele wrote:
>> But I believe that it is. If there is a phishing problem in any
>> particular TLD due to this change, then I place the blame for that
>> squarely on the registry concerned.
>
> Historically users blamed the browsers, not the registrars for things
> like the paypal-with-cyrillic-a homograph.

Historically, this is true. If it happens again, we plan to put up a
significantly more robust defence, based on a decade of experience since
then of what the problem is and who should be solving it.

Gerv
Patrik Fältström
2013-08-21 18:21:44 UTC
Permalink
On 21 aug 2013, at 11:07, Gervase Markham <***@mozilla.org> wrote:

> With regard to any incompatibilities, particularly around sharp-S and
> final sigma, my understanding and expectation is that the registries
> most concerned with those characters (e.g. the Greek registry for final
> sigma) were in agreement that IDNA2008 was the correct way forward, and
> that any breakage caused by the switch was better than the breakage
> caused by not moving.

Correct. The incompatibility is that sharp-S and final sigma in IDNA2008 is not mapped to other characters. This implies that those characters finally can be used in for example domain names.

When communicating with specifically people registering domain names in German they did acknowledge that as domain names that sort of uses sharp-S has been registered with double-S instead, but now they have to introduce the sharp-S as well. This either the registrant can do themselves before someone is grabbing the domain at sunrise or similar.

But, the problem is more or less the same as what happened when introducing IDN in Sweden, Germany and other countries where earlier for example ä has been mapped to a or ae depending on what the registrant want. We will always get those problems when new characters are introduced in DNS. We have done it before, and we will do it again.

I.e. the message was clear that these kind of mapping errors creates problems that are compensated multiple times over and over again by a) getting a 1:1 mapping between A-Label and U-label, b) getting a unicode-version-independent standard and c) having the ability to use sharp-S in domain names and not only double-S.

Patrik
Marcos Sanz
2013-08-20 13:55:13 UTC
Permalink
idna-update-***@alvestrand.no wrote on 20/08/2013 14:32:23:

> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
> <***@microsoft.com> wrote:
> > I concur. We use the IDNA2008 + TR46 behavior.
>
> Interesting. Last I checked Internet Explorer that was not the case.

At this side of the keyboard, ß is still not supported in IE10/Win7-SP1

Best,
Marcos

> Since which version is this deployed? Does it depend on the operating
> system? What variation of TR46 is implemented?
Shawn Steele
2013-08-20 16:31:02 UTC
Permalink
> >> I concur. We use the IDNA2008 + TR46 behavior.
>
>> Interesting. Last I checked Internet Explorer that was not the case.
>
> At this side of the keyboard, ß is still not supported in IE10/Win7-SP1

Yes, that's the + TR46 behavior.

We're not changing the spoofable entries at this time due to security concerns. You can register the ss version and it'll get there. As a complete digression, IMO IDN/DNS should allow for a "display" form mechanism because the NFKC part of the mapping is a more than a bit destructive and there're a lot of other inpu
Shawn Steele
2013-08-20 16:37:40 UTC
Permalink
> At this side of the keyboard, ß is still not supported in IE10/Win7-SP1

(To be clear with the transitional support of TR46, www.Fußball.de will take you to www.fussball.de - which I strongly suspect is what the current owners of www.fussball.de expect, particularly for any of their Swiss visitors, we don't want IE to hijack those users and take them to another site, particularly if the tar
Andrew Sullivan
2013-08-20 16:06:20 UTC
Permalink
I'm pretty sure I'm not on many of these lists, so I bet this mail
won't go everywhere. Nevertheless,

On Tue, Aug 20, 2013 at 01:32:23PM +0100, Anne van Kesteren wrote:
> (Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
> domain names in UTS #46 is somewhat confusing.)

Or "broken". It can't apply to domain names, of course, because
that's not how the DNS works; but one might be forgiven for wondering
whether not understanding the details of an underlying technical
problem is a barrier to having an opinion in this space.

> I don't think the committee has carefully considered the compatibility
> impact. Deployed domains would become invalid.

The IDNABIS wg did not take that decision lightly. In my opinion, we
concluded that some deployed domains were just _broken_, and that we
were eventually going to endure this pain, and that it would be better
to do it earlier rather than later.

> Long-standing practice
> of case folding (e.g. the idea that http://EXAMPLE.COM/ and
> http://example.com/ are identical) is suddenly something that is no
> longer decided upon by IDNA but needs to be decided somehow at the
> application-level.

Well, sort of. There's nothing in IDNA2008 that prevents the OS from
providing a generic facility for this (which is apparently what the
current generation of Windows does).

The point was to take this mapping out of the _protocol_ and put it
into local rules that could be made locale-sensitive. The reason for
this is that, while it is impossible in general to provide case
folding rules where lower-case accented characters get mapped to upper
case without accents and then get case folded again (thereby losing
data), it _might_ be possible to do this in a locale-sensitive way if
one knew enough about the environment. For instance, in some writing
systems for French, it is standard practice to fold LATIN SMALL LETTER
E WITH ACUTE to LATIN CAPITAL LETTER E (not all French systems, of
course. Some fold to LATIN CAPITAL LETTER E WITH ACUTE). Now, if the
LATIN CAPITAL LETTER E is next downcased, what should you get? The
general rule will of course be LATIN SMALL LETTER E, but if you had a
clever program that could do intellingent things with the string
"ECOLE", the folding might be LATIN SMALL LETTER E WITH ACUTE, or the
folding might try both and see what happens. This example is a little
contrived -- the French example seems silly -- but examples in other
scripts and languages are in my view considerably more compelling. I
don't think that UTS#46 is actually different in this regard, although
it proposes uniform mapping rules in all cases.

IDNA2003 doesn't handle this case real well, because it can't
possibly. There's simply no room for locale in IDNA2003.

> And when the Unicode consortium provided such
> profiling for applications in the form of
> http://unicode.org/reports/tr46/ that was frowned upon.

I think the history us a little more complicated than that.

Best regards,

A

--
Andrew Sullivan
***@anvilwalrusden.com
John C Klensin
2013-08-20 19:33:45 UTC
Permalink
--On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
<***@denic.de> wrote:

> idna-update-***@alvestrand.no wrote on 20/08/2013 14:32:23:
>
>> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
>> <***@microsoft.com> wrote:
>> > I concur. We use the IDNA2008 + TR46 behavior.
>>
>> Interesting. Last I checked Internet Explorer that was not
>> the case.
>
> At this side of the keyboard, ß is still not supported in
> IE10/Win7-SP1

But that is completely consistent with IDNA2008 + UTR46 when the
most IDOA2003-like profile (or, if you prefer, stage of
transition) of UTR46 is used. One can debate endlessly whether
UTF46 is a good idea (and the IDNABIS WG did), but ultimately
[1] it was intended to provide an environment as much like that
of IDNA2003 as possible. That includes:

--strict backward compatibility with the interpretation
of strings that are valid with either IDNA2003 or
IDNA2008 and

-- continued support for strings that were valid in
IDNA2003 but that mapped into other strings before being
converted using ASCII strings using Punycode where those
target strings are valid under IDNA2008

If one accepts that kind of compatibility as a primary goal,
then the fact that "ß" was mapped to "ss" in IDNA2003 means
that mapping must be preserved forever and one will never [2]
actually be able to store an Eszett in the DNS.

The bottom line, at least IMO, is that one can adopt either of
two philosophical models. In one, whatever decisions were made
in building the IDNA2003 standard and the name strings those
decisions allowed are inviolable. Arguments that errors were
made, that those strings create risks, or that the rules
prohibit orthographically-reasonable strings are simply
irrelevant if they conflict with absolute compatibility. The
other(at the risk of showing my biases) is to assume that we are
human, that mistakes will get made, and that, if they are
significant, we should figure out how to correct them and move
on.

As others have suggested, the latter includes realizing that
some labels and practices that were allowed under IDNA2003 were
simply a bad idea and we should move away from them as soon as
possible rather than encouraging their use in even more
contexts. Coming back to the comment that started this note, it
also means that, if the relevant language communities decide,
for example, that Eszett is important as a character or that
zero-width joiners and non-joiners are critical, we need to
figure out how to accommodate them even if the accommodation is
not perfect and doesn't solve all problems. And, in each case,
we need to remember that the Internet is growing and reaching
more communities and more people within almost every community,
making transition now, even if painful, much less painful than
transition in the future.

FWIW, without at least some measure of the latter model, we
would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
8859-1 forever. The decision to interpret a string of non-ASCII
octets in content as, by default, a good candidate for UTF-8
rather than Latin-1 is, at least IMO, ultimately an incompatible
change of far more sweeping impact and consequences than this
IDNA2003 -> IDNA2008 transition.

In an odd way, while I would have preferred to see a much more
rapid transition, I think that exactly what should be happening
is happening. The various registries --both the
ICANN-supervised ones and many others at the root and various
other levels-- are prohibiting (and not renewing) strings that
do not conform with IDNA2008. Registries that want to support
labels that are problematic from a transition standpoint have
devised, or are devising, procedures to lower the odds of
strings that pose difficulties falling into hostile hands, just
as many of them do for potentially-confusing strings. The right
time to transition systems that look up names involves tricky
questions including the "pain now or more pain later"
considerations mentioned above. And where UTR 46 and/or RFC
5895 fit into transition strategies (as distinct from localized
mapping strategies), or not, is obviously part of that
transition question.

Anne, coming back to your original question, I don't know what
question you and your colleagues asked that got the "everyone is
still on IDNA2003" answer. Especially given the information
from Microsoft, I suspect it was close to "are you fully
supporting IDNA2008" for which as "no" answer might lead to a
"using IDNA2003" answer despite their telling us that they are
running IDNA2008 with UTR 46. Others have pointed out that
"IDNA2003 with the version restriction eliminated" may be a
sensible statement in individual cases but, because the Nameprep
profile of Stringprep is not simply Unicode Case Folding plus
NFKC, it leaves enough open to local interpretation that it is
not a plausible candidate for a statement in a standard that is
intended to promote interoperability.

Against that backdrop, I believe you should interpret what you
are seeing, not as "everyone is committed to IDNA2003"
(obviously not true as soon as exceptions are introduced) and
"IDNA2003 with exceptions forever" but as slow transition. If
you want a standard that works going forward, make the
assumption that the folks who designed IDNA2008 were not fools
and that browsers should be moving, and eventually will move
(unless you discourage them) in the IDNA2008 direction. Whether
you want to discuss transition or not is up to you. If you want
to follow Mark's recommendation (and Microsoft's lead) and
suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
really constitutes a transition strategy rather than an "IDNA
2003 forever" one, i.e., that you address the issues of when
"transition processing" gets turned off and the localization
issues (especially about case folding) mentioned by others. If
not, you and your working group put us all at risk of many
internationalized email applications working differently than
web browsers do, in a fork between IETF and W3C i18n standards,
divergence between assumptions and norms used by those who
create DNS names and those who look them up, and so on. I hope
we can agree that those would be bad outcomes.

regards,
john

-----------

[1] I hope Mark will more or less agree with this
characterization; it is a accurate and neutral as I know how to
make it.

[2[ This is associated with one of the key criticisms of UTR 46
that has not been discussed so far: It has been described as a
transition strategy, but there is really no mechanism in it for
deciding when to adopt the IDNA2008 model and rules in favor of
strict backward-compatibility with as many names that were valid
under IDNA2003 as possible. In reality, saying "we use UTR 46"
or "we conform to UTR 46" is somewhat underspecified because UTR
46 can be used strictly for local mapping, with what it calls
"transition processing" (which is where Eszett disappears),
and/or with other optional features such as flagging, but
continuing to look up, strings that contain punctuation or
symbol characters. Either of those latter options makes a
so-called "IDNA2008 + UTR46" implementation non-conforming with
IDNA2008.
Mark Davis ☕
2013-08-21 15:01:42 UTC
Permalink
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**


On Tue, Aug 20, 2013 at 9:33 PM, John C Klensin <***@jck.com> wrote:

>
>
> --On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
> <***@denic.de> wrote:
>
> > idna-update-***@alvestrand.no wrote on 20/08/2013 14:32:23:
> >
> >> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
> >> <***@microsoft.com> wrote:
> >> > I concur. We use the IDNA2008 + TR46 behavior.
> >>
> >> Interesting. Last I checked Internet Explorer that was not
> >> the case.
> >
> > At this side of the keyboard, ß is still not supported in
> > IE10/Win7-SP1
>
> But that is completely consistent with IDNA2008 + UTR46 when the
> most IDOA2003-like profile (or, if you prefer, stage of
> transition) of UTR46 is used. One can debate endlessly whether
> UTF46 is a good idea (and the IDNABIS WG did), but ultimately
> [1] it was intended to provide an environment as much like that
> of IDNA2003 as possible. That includes:
>
> --strict backward compatibility with the interpretation
> of strings that are valid with either IDNA2003 or
> IDNA2008 and
>
> -- continued support for strings that were valid in
> IDNA2003 but that mapped into other strings before being
> converted using ASCII strings using Punycode where those
> target strings are valid under IDNA2008
>
> If one accepts that kind of compatibility as a primary goal,
> then the fact that "ß" was mapped to "ss" in IDNA2003 means
> that mapping must be preserved forever and one will never [2]
> actually be able to store an Eszett in the DNS.
>
> The bottom line, at least IMO, is that one can adopt either of
> two philosophical models. In one, whatever decisions were made
> in building the IDNA2003 standard and the name strings those
> decisions allowed are inviolable. Arguments that errors were
> made, that those strings create risks, or that the rules
> prohibit orthographically-reasonable strings are simply
> irrelevant if they conflict with absolute compatibility. The
> other(at the risk of showing my biases) is to assume that we are
> human, that mistakes will get made, and that, if they are
> significant, we should figure out how to correct them and move
> on.
>
> As others have suggested, the latter includes realizing that
> some labels and practices that were allowed under IDNA2003 were
> simply a bad idea and we should move away from them as soon as
> possible rather than encouraging their use in even more
> contexts. Coming back to the comment that started this note, it
> also means that, if the relevant language communities decide,
> for example, that Eszett is important as a character or that
> zero-width joiners and non-joiners are critical, we need to
> figure out how to accommodate them even if the accommodation is
> not perfect and doesn't solve all problems. And, in each case,
> we need to remember that the Internet is growing and reaching
> more communities and more people within almost every community,
> making transition now, even if painful, much less painful than
> transition in the future.
>

The key migration issue is whether people are comfortable having
implementations go to different IP addresses for IDNs containing 'ß' (or
the other 3 related characters). The transitional form in TR46 is for those
who are concerned with that problem. If the registries either bundled 'ss'
with 'ß' or blocked (once either was registered the other could not), then
the ambiguous addressing issue would not be a problem. So it is a matter of
waiting for the significant registries to do that.


> FWIW, without at least some measure of the latter model, we
> would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
> 8859-1 forever. The decision to interpret a string of non-ASCII
> octets in content as, by default, a good candidate for UTF-8
> rather than Latin-1 is, at least IMO, ultimately an incompatible
> change of far more sweeping impact and consequences than this
> IDNA2003 -> IDNA2008 transition.
>

That's not a particularly good analogy. ASCII is and remains ASCII in
UTF-8; that's one of its virtues. Latin 1 was just one of many encodings
that used the high bit for different purposes, so UTF-8 was simply one of
many such encodings. It did not represent a backwards incompatibility with
existing standards.
​​

>
> In an odd way, while I would have preferred to see a much more
> rapid transition, I think that exactly what should be happening
> is happening. The various registries --both the
> ICANN-supervised ones and many others at the root and various
> other levels-- are prohibiting (and not renewing) strings that
> do not conform with IDNA2008. Registries that want to support
> labels that are problematic from a transition standpoint have
> devised, or are devising, procedures to lower the odds of
> strings that pose difficulties falling into hostile hands, just
> as many of them do for potentially-confusing strings. The right
> time to transition systems that look up names involves tricky
> questions including the "pain now or more pain later"
> considerations mentioned above. And where UTR 46 and/or RFC
> 5895 fit into transition strategies (as distinct from localized
> mapping strategies), or not, is obviously part of that
> transition question.
>

I agree with that, and it is the scenario envisioned for TR46. That is,
once all (significant) registries move to IDNA2008, then then clients can
impose stricter controls on the characters, excluding the characters that
are disallowed in IDNA2008. Because the registries will have moved, the
number of failing URLs would be acceptable.
​​

>
> Anne, coming back to your original question, I don't know what
> question you and your colleagues asked that got the "everyone is
> still on IDNA2003" answer. Especially given the information
> from Microsoft, I suspect it was close to "are you fully
> supporting IDNA2008" for which as "no" answer might lead to a
> "using IDNA2003" answer despite their telling us that they are
> running IDNA2008 with UTR 46. Others have pointed out that
> "IDNA2003 with the version restriction eliminated" may be a
> sensible statement in individual cases but, because the Nameprep
> profile of Stringprep is not simply Unicode Case Folding plus
> NFKC, it leaves enough open to local interpretation that it is
> not a plausible candidate for a statement in a standard that is
> intended to promote interoperability.
>
> Against that backdrop, I believe you should interpret what you
> are seeing, not as "everyone is committed to IDNA2003"
> (obviously not true as soon as exceptions are introduced) and
> "IDNA2003 with exceptions forever" but as slow transition. If
> you want a standard that works going forward, make the
> assumption that the folks who designed IDNA2008 were not fools
> and that browsers should be moving, and eventually will move
> (unless you discourage them) in the IDNA2008 direction. Whether
> you want to discuss transition or not is up to you. If you want
> to follow Mark's recommendation (and Microsoft's lead) and
> suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
> really constitutes a transition strategy rather than an "IDNA
> 2003 forever" one, i.e., that you address the issues of when
> "transition processing" gets turned off and the localization
> issues (especially about case folding) mentioned by others. If
> not, you and your working group put us all at risk of many
> internationalized email applications working differently than
> web browsers do, in a fork between IETF and W3C i18n standards,
> divergence between assumptions and norms used by those who
> create DNS names and those who look them up, and so on. I hope
> we can agree that those would be bad outcomes.
>
> regards,
> john
>
> -----------
>
> [1] I hope Mark will more or less agree with this
> characterization; it is a accurate and neutral as I know how to
> make it.
>

Yes, thanks.
​​

>
> [2[ This is associated with one of the key criticisms of UTR 46
> that has not been discussed so far: It has been described as a
> transition strategy, but there is really no mechanism in it for
> deciding when to adopt the IDNA2008 model and rules in favor of
> strict backward-compatibility with as many names that were valid
> under IDNA2003 as possible. In reality, saying "we use UTR 46"
> or "we conform to UTR 46" is somewhat underspecified because UTR
> 46 can be used strictly for local mapping, with what it calls
> "transition processing" (which is where Eszett disappears),
> and/or with other optional features such as flagging, but
> continuing to look up, strings that contain punctuation or
> symbol characters. Either of those latter options makes a
> so-called "IDNA2008 + UTR46" implementation non-conforming with
> IDNA2008.
>

Yes, it is the latter two options that can disappear under the right
conditions (as above).​​
Anne van Kesteren
2013-08-21 15:45:51 UTC
Permalink
On Wed, Aug 21, 2013 at 4:01 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> I agree with that, and it is the scenario envisioned for TR46. That is, once
> all (significant) registries move to IDNA2008, then then clients can impose
> stricter controls on the characters, excluding the characters that are
> disallowed in IDNA2008. Because the registries will have moved, the number
> of failing URLs would be acceptable.

I doubt that would be true for subdomains. E.g. I know people using
http://☺.example.com/ as domain (forgot whether that particular code
point is excluded, but you get the idea).

It's also not true for URLs in resources that depend on the mapping to
happen. Especially for uppercase/lowercase I would expect that to be
fairly common. And in URLs in resources should remain
locale-insensitive. That they depend on encodings to some extent is
bad enough.


--
http://annevankesteren.nl/
Mark Davis ☕
2013-08-21 16:01:51 UTC
Permalink
> It's also not true for URLs in resources that depend on the mapping to
happen.

TR46 really has 3 parts:

1. transitional handling for the 4 ambiguous characters
2. inclusion of symbols
3. client-side mapping (aka lowercasing)

Parts #1 and #2 are transitional in supporting IDNA2003 on the path to
IDNA2008.

Part #3 (client-side mapping) is something that is permitted by IDNA2008,
and is thus optional for even a fully IDNA2008-compliant implementation.



Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**


On Wed, Aug 21, 2013 at 5:45 PM, Anne van Kesteren <***@annevk.nl> wrote:

> On Wed, Aug 21, 2013 at 4:01 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> > I agree with that, and it is the scenario envisioned for TR46. That is,
> once
> > all (significant) registries move to IDNA2008, then then clients can
> impose
> > stricter controls on the characters, excluding the characters that are
> > disallowed in IDNA2008. Because the registries will have moved, the
> number
> > of failing URLs would be acceptable.
>
> I doubt that would be true for subdomains. E.g. I know people using
> http://☺.example.com/ as domain (forgot whether that particular code
> point is excluded, but you get the idea).
>
> It's also not true for URLs in resources that depend on the mapping to
> happen. Especially for uppercase/lowercase I would expect that to be
> fairly common. And in URLs in resources should remain
> locale-insensitive. That they depend on encodings to some extent is
> bad enough.
>
>
> --
> http://annevankesteren.nl/
>
Gervase Markham
2013-08-22 08:34:48 UTC
Permalink
On 21/08/13 16:45, Anne van Kesteren wrote:
> I doubt that would be true for subdomains. E.g. I know people using
> http://☺.example.com/ as domain (forgot whether that particular code
> point is excluded, but you get the idea).

Shame for them. The writing has been on the wall here for long enough
that they should not be at all surprised when this stops working.

Gerv
Anne van Kesteren
2013-08-22 10:37:58 UTC
Permalink
On Thu, Aug 22, 2013 at 9:34 AM, Gervase Markham <***@mozilla.org> wrote:
> On 21/08/13 16:45, Anne van Kesteren wrote:
>> I doubt that would be true for subdomains. E.g. I know people using
>> http://☺.example.com/ as domain (forgot whether that particular code
>> point is excluded, but you get the idea).
>
> Shame for them. The writing has been on the wall here for long enough
> that they should not be at all surprised when this stops working.

I don't think that's at all true. I doubt anyone realizes this. I
certainly didn't until I put long hours into investigating the IDNA
situation.

Furthermore, we generally preserve compatibility on the web so URLs
and documents remain working.
http://www.w3.org/Provider/Style/URI.html It's one of the more
important parts of this platform.


--
http://annevankesteren.nl/
Gervase Markham
2013-08-22 11:02:23 UTC
Permalink
On 22/08/13 11:37, Anne van Kesteren wrote:
>> Shame for them. The writing has been on the wall here for long enough
>> that they should not be at all surprised when this stops working.
>
> I don't think that's at all true. I doubt anyone realizes this. I
> certainly didn't until I put long hours into investigating the IDNA
> situation.

It's not been possible to register names like ☺☺☺.com for some time now;
that's a big clue. The fact that Firefox (and other browsers, AFAIAA)
refuses to render such names as Unicode is another one. (Are your
friends really using http://xn--74h.example.com/ ?)

Those two things, plus the difficulty of typing such names, means that
their use is going to be pretty limited. (Even the guy who is trying to
flog http://xn--19g.com/ , and is doing so on the basis of the fact that
this particular one is actually easy to type on some computers, has not
in the past few years managed to find a "Macintosh company with a
vision" to take it off his hands.)

> Furthermore, we generally preserve compatibility on the web so URLs
> and documents remain working.
> http://www.w3.org/Provider/Style/URI.html It's one of the more
> important parts of this platform.

(The domain name system is about more than just the web.)

IIRC, we must have broken a load of URLs when we decided that %-encoding
in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
beforehand it depended on the charset of the page or form producing the
link. Why did we do that? Because the new way was better for the future,
and some breakage was acceptable to attain that goal.

So what is the justification for removal of non-letter characters?
Reduction of attack surface. When characters are divided into scripts,
we can enforce no-script-mixing rules to keep the number of possible
spoofs, lookalikes and substitutions tractable for humans to reason
about in the case of a particular TLD and its allowed characters. If we
allowed 3,254 extra random glyphs in every TLD, this would not be so.

Gerv
Anne van Kesteren
2013-08-22 11:11:55 UTC
Permalink
On Thu, Aug 22, 2013 at 12:02 PM, Gervase Markham <***@mozilla.org> wrote:
> It's not been possible to register names like ☺☺☺.com for some time now;
> that's a big clue.

I don't think it is. There's sites out that rely on underscore working
in subdomains. You cannot register a domain name with an underscore.


> (Are your friends really using http://xn--74h.example.com/ ?)

Yeah (with "example" replaced). Renders fine in Safari, too.


> IIRC, we must have broken a load of URLs when we decided that %-encoding
> in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
> beforehand it depended on the charset of the page or form producing the
> link. Why did we do that? Because the new way was better for the future,
> and some breakage was acceptable to attain that goal.

Actually, I don't think we did. And the reason for that is that the
non-ASCII usage was primarily in the query string. And as it happens,
we still use the character encoding to go from code points to
percent-escaped byte code points there. The IETF STD doesn't admit to
this, which is part of the reason why we have
http://url.spec.whatwg.org/ now.


--
http://annevankesteren.nl/
Mark Davis ☕
2013-08-22 11:38:35 UTC
Permalink
I think this conversation is devolving a bit. There were many controversial
topics and a huge number of exchanges during the course of the development
of IDNA2008. Dredging them up and repeating them here will not do anyone
any good, and the flood of emails just cause listeners to tune out.

So I'd like to bump up a level, and get back to the real questions, which
is how to move the clients (browsers, search engines, etc.) forward.

There are three major options for clients:

1. Move immediately to IDNA2008.
2. Stay on IDNA2003.
3. Move to TR46+IDNA2008 as a transition to IDNA2008.

Recent history has shown that the major clients are reluctant to do #1
because of compatibility concerns. I don't think anyone really wants #2,
because it has an archaic Unicode version, but people are sticking with
that if they see #1 as the only other choice.

That effectively leaves #3. And certainly major players like IE have shown
that it can be deployed effectively.



Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**


On Thu, Aug 22, 2013 at 1:11 PM, Anne van Kesteren <***@annevk.nl> wrote:

> On Thu, Aug 22, 2013 at 12:02 PM, Gervase Markham <***@mozilla.org>
> wrote:
> > It's not been possible to register names like ☺☺☺.com for some time now;
> > that's a big clue.
>
> I don't think it is. There's sites out that rely on underscore working
> in subdomains. You cannot register a domain name with an underscore.
>
>
> > (Are your friends really using http://xn--74h.example.com/ ?)
>
> Yeah (with "example" replaced). Renders fine in Safari, too.
>
>
> > IIRC, we must have broken a load of URLs when we decided that %-encoding
> > in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
> > beforehand it depended on the charset of the page or form producing the
> > link. Why did we do that? Because the new way was better for the future,
> > and some breakage was acceptable to attain that goal.
>
> Actually, I don't think we did. And the reason for that is that the
> non-ASCII usage was primarily in the query string. And as it happens,
> we still use the character encoding to go from code points to
> percent-escaped byte code points there. The IETF STD doesn't admit to
> this, which is part of the reason why we have
> http://url.spec.whatwg.org/ now.
>
>
> --
> http://annevankesteren.nl/
>
Anne van Kesteren
2013-08-22 12:17:58 UTC
Permalink
On Thu, Aug 22, 2013 at 12:38 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> There are three major options for clients:
>
> 1 Move immediately to IDNA2008.
> 2 Stay on IDNA2003.
> 3 Move to TR46+IDNA2008 as a transition to IDNA2008.
>
> Recent history has shown that the major clients are reluctant to do #1
> because of compatibility concerns. I don't think anyone really wants #2,
> because it has an archaic Unicode version, but people are sticking with that
> if they see #1 as the only other choice.
>
> That effectively leaves #3. And certainly major players like IE have shown
> that it can be deployed effectively.

2 as deployed is not stuck on an archaic Unicode version. 3 might be
interesting, depending on what variant is chosen. E.g. Gerv has been
suggesting that we in Gecko should implement a different variety from
Internet Explorer... (I'm not a fan.)

Overall though I feel compatibility is downplayed way too much. It's
very bad to break deployed content.


--
http://annevankesteren.nl/
Andrew Sullivan
2013-08-22 15:17:10 UTC
Permalink
On Thu, Aug 22, 2013 at 01:17:58PM +0100, Anne van Kesteren wrote:
> On Thu, Aug 22, 2013 at 12:38 PM, Mark Davis ☕ <***@macchiato.com> wrote:

> > 2 Stay on IDNA2003.

> 2 as deployed is not stuck on an archaic Unicode version.

Right. 2 as deployed instead has _new_ compatibility problems as new
registries and names compatible with IDNA2008 but that don't work
correctly under IDNA2003 come online. Since that's where all the
growth in Unicode is, this position represents the trade off of
preventing a few things breaking right now (including a number of
names that are impossible to type, like those with smileys and so on)
at the cost of breaking future things more and more, as the IDNA2003
assumption of Unicode 3.2 shows more and more strain.

It seems to me that one possible explanation for the success of IPv4
was the early willingness to say, "These early ones didn't work.
We're breaking them, even if you're using them. Sorry." I very
strongly agree that preserving compatibility is extremely important.
But if you're going to have to break things -- and given what we've
learned, _some_ stuff needs to break -- the time to do it is as soon
as possible. The problem will only be worse over time.

Best,

A

--
Andrew Sullivan
***@anvilwalrusden.com
Marc Blanchet
2013-08-22 15:32:06 UTC
Permalink
Le 2013-08-22 à 11:17, Andrew Sullivan <***@anvilwalrusden.com> a écrit :

> On Thu, Aug 22, 2013 at 01:17:58PM +0100, Anne van Kesteren wrote:
>> On Thu, Aug 22, 2013 at 12:38 PM, Mark Davis ☕ <***@macchiato.com> wrote:
>
>>> 2 Stay on IDNA2003.
>
>> 2 as deployed is not stuck on an archaic Unicode version.
>
> Right. 2 as deployed instead has _new_ compatibility problems as new
> registries and names compatible with IDNA2008 but that don't work
> correctly under IDNA2003 come online. Since that's where all the
> growth in Unicode is, this position represents the trade off of
> preventing a few things breaking right now (including a number of
> names that are impossible to type, like those with smileys and so on)
> at the cost of breaking future things more and more, as the IDNA2003
> assumption of Unicode 3.2 shows more and more strain.

right, as one of the authors of nameprep/stringprep and co-chair of idn (v1), we did talked about future versions of Unicode (I was the one that brought this topic at that time), as many on this list remember. However, the basic IDNA2003 design was based on more openness/liberal to new codepoints. IDNA2008 is more tight into which new codepoints are allowed, and therefore, has better rules to manage the Unicode new versions.

So continuing on using IDNA2003 will put more potential wrong codepoints into the system, which results into more compatibilty problems in the future.

I agree with Vint that, for the good of the Internet DNS future, we ought to move as soon as possible to IDNA2008 in order to avoid increasingly more compatibility problems in the future.

Marc.

>
> It seems to me that one possible explanation for the success of IPv4
> was the early willingness to say, "These early ones didn't work.
> We're breaking them, even if you're using them. Sorry." I very
> strongly agree that preserving compatibility is extremely important.
> But if you're going to have to break things -- and given what we've
> learned, _some_ stuff needs to break -- the time to do it is as soon
> as possible. The problem will only be worse over time.
>
> Best,
>
> A
>
> --
> Andrew Sullivan
> ***@anvilwalrusden.com
Gervase Markham
2013-08-22 12:28:40 UTC
Permalink
On 22/08/13 13:17, Anne van Kesteren wrote:
> 2 as deployed is not stuck on an archaic Unicode version. 3 might be
> interesting, depending on what variant is chosen. E.g. Gerv has been
> suggesting that we in Gecko should implement a different variety from
> Internet Explorer... (I'm not a fan.)

You mean with regard to UseSTD3ASCIIRules? Happy to change my proposal
in that regard. See the bug.

Gerv
Anne van Kesteren
2013-08-22 12:36:56 UTC
Permalink
On Thu, Aug 22, 2013 at 1:28 PM, Gervase Markham <***@mozilla.org> wrote:
> You mean with regard to UseSTD3ASCIIRules? Happy to change my proposal
> in that regard. See the bug.

With regards to all the things that would be different, including
treatment of e.g. ß. We should provide a consistent platform to the
world, not one that keeps changing under your feet. And the
compatibility that is important here is not compatibility with
IDNA2003 per se, but with DNS as deployed and support for it in
clients.

As far as UseSTD3ASCIIRules is concerned, I haven't checked if TR46 is
safe when it comes to
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23009 if you turn that
flag off. We want "_" in the output, but never "/" (should just fail).
(Currently we have "/" in the output, but that causes downstream
security bugs due to parsing the serialization yielding different
results.)


--
http://annevankesteren.nl/
Gervase Markham
2013-08-22 13:05:34 UTC
Permalink
On 22/08/13 13:36, Anne van Kesteren wrote:
> As far as UseSTD3ASCIIRules is concerned, I haven't checked if TR46 is
> safe when it comes to
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=23009 if you turn that
> flag off.

AIUI, assuming we write our replacement for the STD3ASCIIRules to
disallow "/" in hostnames, we should be fine. When UseSTD3ASCIIRules is
false, "℁" (U+2101) will map to "a/s", and then the "/" will be disallowed.

TR46 section 4.1:

"If UseSTD3ASCIIRules=false, then the validity tests for ASCII
characters are not provided by the table status values, but are
implementation-dependent. For example, if an implementation allows the
characters [\u002Da-zA-Z0-9] and also the underbar (_), then it needs to
use the table values for UseSTD3ASCIIRules=false, and test for any other
ASCII characters as part of its validity criteria. *These ASCII
characters may have resulted from a mapping*: for example, a U+005F ( _
) LOW LINE (underbar) may have originally been a U+FF3F ( _ ) FULLWIDTH
LOW LINE."

(Emphasis mine.)

Gerv
Gervase Markham
2013-08-22 12:59:55 UTC
Permalink
On 22/08/13 13:17, Anne van Kesteren wrote:
> On Thu, Aug 22, 2013 at 12:38 PM, Mark Davis ☕ <***@macchiato.com> wrote:
>> There are three major options for clients:
>>
>> 1 Move immediately to IDNA2008.
>> 2 Stay on IDNA2003.
>> 3 Move to TR46+IDNA2008 as a transition to IDNA2008.
>>
>> Recent history has shown that the major clients are reluctant to do #1
>> because of compatibility concerns. I don't think anyone really wants #2,
>> because it has an archaic Unicode version, but people are sticking with that
>> if they see #1 as the only other choice.
>>
>> That effectively leaves #3. And certainly major players like IE have shown
>> that it can be deployed effectively.
>
> 2 as deployed is not stuck on an archaic Unicode version.

Are you sure that "as deployed" is interoperable, or have different
browsers done the "add new Unicode to IDNA2003" step differently?

Have you been arguing for 2 because you don't want 1? I'm not sure
anyone's been arguing for 1. It's always been about 3.

Gerv
Anne van Kesteren
2013-08-22 15:11:15 UTC
Permalink
On Thu, Aug 22, 2013 at 1:59 PM, Gervase Markham <***@mozilla.org> wrote:
> Are you sure that "as deployed" is interoperable, or have different
> browsers done the "add new Unicode to IDNA2003" step differently?

Relatively certain, though I've not tested extensively. Unassigned
code points are allowed, so for that Unicode 3.2 does not matter. The
other case where Unicode 3.2 matters is normalization. Browsers just
use their internal NFKC algorithm for that, which is not bound to any
particular version of Unicode, it's whatever the latest version of
Unicode is they implement.


> Have you been arguing for 2 because you don't want 1? I'm not sure
> anyone's been arguing for 1. It's always been about 3.

I argued for 1 because I've previously gotten signals from Apple &
Google that they don't see much benefit in moving. It seems in the
case of Google this might have been incorrect. It's also still unclear
to me what the drawback of IDNA2003 is given existing practice. What
Vint Cerf keeps saying is true, IDNA2003 is bad because it relies on
Unicode 3.2, but I don't think IDNA2003 as written is what's under
discussion here which makes matters confusing. What matters is
IDNA2003 as implemented and deployed throughout the DNS.


On Thu, Aug 22, 2013 at 2:05 PM, Gervase Markham <***@mozilla.org> wrote:
> AIUI, assuming we write our replacement for the STD3ASCIIRules to
> disallow "/" in hostnames, we should be fine. When UseSTD3ASCIIRules is
> false, "℁" (U+2101) will map to "a/s", and then the "/" will be disallowed.

I think we should write the actual rules in the standard rather than
have each implementer come up with his own UseSTD3ASCIIRules
replacement. The standard should be fully deterministic. Exact
algorithms from a /domain name/ to a /ASCII domain name/ and a
/Unicode domain name/. As well as when either would return failure.
I.e. the rules we want the URL parser to use (not necessarily the
address bar I suppose, that can be "magic").


--
http://annevankesteren.nl/
Andrew Sullivan
2013-08-22 16:26:27 UTC
Permalink
On Thu, Aug 22, 2013 at 04:11:15PM +0100, Anne van Kesteren wrote:
> discussion here which makes matters confusing. What matters is
> IDNA2003 as implemented and deployed throughout the DNS.

Except it's _not_ deployed throughout the DNS. The ASCII-form is
what's in the DNS. For the overwhelming majority of cases of valid,
actually deployed IDNA2003 labels that we have ever found, there will
be no change. And the applications are still doing the work of
translating those labels to Unicode.

IDNA2008 is supposed not only to reduce the number of code points that
are permitted by the protocol. Among other things, it's also designed
to improve the underlying normalization (NFC, which is better for
these purposes than NFKC according to UTC documents); to permit the
use of certain joiners that our Arabic-script using colleagues insist
are extremely important to them (you should hear the reaction when I
tell Arabic-using people that browsers aren't planning to do IDNA2008
yet); to ensure that every U-label has exactly one A-label and
conversely (which is not true under IDA2003); and still to make
possible the kind of mapping that is required in IDNA2003 while yet
permitting more locale-sensitive treatment in the unusual cases where
that is appropriate.

Given the places the Internet is growing, and if we assume that domain
names will continue to be at all important, the number of IDNs
actually deployed today is a tiny percentage of what it will be in the
near future, especially as more IDN TLDs come online. We need to fix
the known issues before it really is absolutely too late to do
anything.

Best,

A
--
Andrew Sullivan
***@anvilwalrusden.com
John C Klensin
2013-08-23 12:25:05 UTC
Permalink
--On Thursday, August 22, 2013 12:26 -0400 Andrew Sullivan
<***@anvilwalrusden.com> wrote:

> On Thu, Aug 22, 2013 at 04:11:15PM +0100, Anne van Kesteren
> wrote:
>> discussion here which makes matters confusing. What matters is
>> IDNA2003 as implemented and deployed throughout the DNS.
>
> Except it's _not_ deployed throughout the DNS. The ASCII-form
> is what's in the DNS. For the overwhelming majority of cases
> of valid, actually deployed IDNA2003 labels that we have ever
> found, there will be no change. And the applications are
> still doing the work of translating those labels to Unicode.
>...

Let me add a bit to this and see if I can make a useful
suggestion.

When the IDNA2003 discussions were occurring, the main rationale
for the various mappings (CaseFolding, NFKC, etc.) was precisely
what Anne mentioned early in the thread -- to give the users
what they would expect if, e.g., they typed FöO.example.com
rather than föo.example.com. IDNA2008 (especially RFC 5895 and
arguably UTR 46) are consistent with that view about user typing
and the user experience.

The place where this gets knotty is that, whether it got written
down or not, there was a general expectation among most of the
IDNA2003 participants that "real" canonical-form URLs -- the
stuff that gets transmitted between systems, would appear in
arefs, etc.-- would have their domain components in
ASCII-encoded form, matching what, as Andrew notes, is deployed
in the DNS. From that ASCII-encoded and DNS perspective, things
like Eszett are non-problems because it simply could not be
encoded under IDNA2003 -- it could be mapped to "ss" from user
input, but there was no way that ToUnicode(string) could even
produce a label containing one -- Punycode-encoded strings that
could include a representation of a Eszett character could not
exist prior to IDNA2008, so, from the DNS point of view, their
addition wasn't even an incompatible change.

Again from that perspective, where we got into trouble was that
browsers, presumably responding to the demands of page authors,
not only allowed native-character domain name labels in URLs but
even allowed the non-canonical forms. People took advantage of
that, as they will, and we ended up where we are today. But
that isn't an IDNA2008 problem because, from a good practices
standpoint, it, especially having non-canonical forms and
depending on mapping, was a bad idea even for IDNA2003. On the
DNS registration side, several parties took advantage of the
mappings and sold/ delegated native-character labels that could
note be mapped back from their Punycode-encoded forms -- another
thing that was clearly a bad practice at the time, but they were
no more deterred than some page authors (and email users, btw)
were.

Suggestions, at least as a starting point for some discussion:

(1) Move toward IDNA2008 terminology. We got rid of the
IDNA2003 terminology because it just got too clumsy when people
tried to be unambiguous about what they were talking about. In
the process, stop thinking about "IDNA2003 without Unicode
version restrictions". While the intent is clear, as others
have pointed out, that phrase can be used to describe enough
different things to be a potential source of interoperability
problems. As noted below, which IDNA2008 terminology is
necessary, it may not be sufficient. Note that this suggestion
doesn't require that anyone do anything different, only that we
change how we talk about it.

(2) For those who don't already, try to understand the reasons
for moving away from IDNA2003 rather than just saying "lots of
people are still using it" (whether that is correct or not).
Several of those reasons have been pointed out in this
discussion. For the benefit of those who didn't see it in this
multiple-list discussion, Olaf Kolkman recently reminded those
on the IDNA-update list about the discussion in RFC 4690,
especially Section 5.3,
http://tools.ietf.org/html/rfc4690#section-5.3.

(3) For strings that are valid under both IDNA2003 and IDNA2008,
try to remember in our various conversations that what has often
been called "preserving backward compatibility" or "preserving
IDNA2003 behavior" is also "ignoring what the document or user
specified and doing something else instead".

(4) Define a canonical form for the domain name part of a URL
and specify its use wherever that is feasible from a production
and user interface standpoint. For closeness to the DNS and
what actually appears there, that means that IDNs appear as
A-labels. If you decide you need to support native character
forms (as encoded UTF-8 or in IRIs) for whatever reason,
possibly including the considerations of RFC 6055, the canonical
form should allow IDNs only as U-labels. Noting the things like
certificates and their DNS analogues aren't, in general, going
to work with strings that require mappings to get to labels,
U-labels (and A-labels) are always safe and unambiguous, even
where other things might be plausible.

(5) For input from users. existing documents, etc., you will
almost certainly need support for a certain amount of mapping
(even if only case folding where that is appropriate).
Encourage designs that keep that as local as possible, i.e.,
that involve early conversions to U-labels and retention of the
U-labels. Then borrow from some of this thread or the comment
about flags in UTR46 and consider when and how aggressively to
warn whomever is relevant that depending on those mappings is
dangerous and may lead to trouble. Personally, I'd favor being
much more aggressive with page authors than with users and would
leave those who don't have much control over what is actually
going on to their own devices. Gerv and others may have better
ideas.

(6) Search engines and other things that return links should
return only canonical forms as discussed in (4) when those are
possible. Obviously, it isn't for strings that are disallowed
entirely, but this is important as a "get the users used to it"
transition step for strings that map into valid U-labels. There
is little reason for them to try to preserve forms that require
mapping, even if they found a particular resource by going
through a link that did. Similarly, when a domain name is
displayed back to a user, it should be displayed in canonical
form with either A-labels or U-labels. If that isn't what the
user typed, the difference can be a small security clue and
source of education for users who are paying attention. I
believe that some systems are doing those things already.

(7) IMO, UTR46 needs some work. The suggestions above lay the
foundation for what I believe is the most important substantive
piece of that work, and complement Mark's recent notes. I
believe that UTR46 is in need of serious discussion of when it
is plausible to shut off the "transition" machinery. Mark's
recent notes provide most of the information and text that I
believe need to be in the spec itself. It is almost trivial by
comparison, but I think it should contain some strong language
explaining why it is unreasonable to claim conformance with or
application of UTR46 without a statement as to which (if any)
transition mechanisms are being applied (e.g., whether a domain
name containing Eszett, ZWJ, or ZWNJ will be looked up or
changed into something else that the user didn't specify. I'll
respond separately to some of the details of those notes, but
want to start with the observation that my thinking, at least,
has evolves considerably in the last three or four years and
that I think we are now quibbling about details rather than
having major disagreements.

best,
john





>
> IDNA2008 is supposed not only to reduce the number of code
> points that are permitted by the protocol. Among other
> things, it's also designed to improve the underlying
> normalization (NFC, which is better for these purposes than
> NFKC according to UTC documents); to permit the use of certain
> joiners that our Arabic-script using colleagues insist are
> extremely important to them (you should hear the reaction when
> I tell Arabic-using people that browsers aren't planning to do
> IDNA2008 yet); to ensure that every U-label has exactly one
> A-label and conversely (which is not true under IDA2003); and
> still to make possible the kind of mapping that is required in
> IDNA2003 while yet permitting more locale-sensitive treatment
> in the unusual cases where that is appropriate.
>
> Given the places the Internet is growing, and if we assume
> that domain names will continue to be at all important, the
> number of IDNs actually deployed today is a tiny percentage of
> what it will be in the near future, especially as more IDN
> TLDs come online. We need to fix the known issues before it
> really is absolutely too late to do anything.
>
> Best,
>
> A
Jungshik SHIN (신정식)
2013-08-23 05:59:20 UTC
Permalink
On Thu, Aug 22, 2013 at 4:11 AM, Anne van Kesteren <***@annevk.nl> wrote:

> On Thu, Aug 22, 2013 at 12:02 PM, Gervase Markham <***@mozilla.org>
> wrote:
> > It's not been possible to register names like ☺☺☺.com for some time now;
> > that's a big clue.
>
> I don't think it is. There's sites out that rely on underscore working
> in subdomains. You cannot register a domain name with an underscore.
>
>
> > (Are your friends really using http://xn--74h.example.com/ ?)
>
> Yeah (with "example" replaced). Renders fine in Safari, too.
>
>
> > IIRC, we must have broken a load of URLs when we decided that %-encoding
> > in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
> > beforehand it depended on the charset of the page or form producing the
> > link. Why did we do that? Because the new way was better for the future,
> > and some breakage was acceptable to attain that goal.
>
> Actually, I don't think we did. And the reason for that is that the
> non-ASCII usage was primarily in the query string.


Well, there are tons of urls whose path part have non-ASCII characters.
They're very common in Korea, for instance.


> And as it happens,
> we still use the character encoding to go from code points to
> percent-escaped byte code points there. The IETF STD doesn't admit to
> this, which is part of the reason why we have
> http://url.spec.whatwg.org/ now.
>
>
> --
> http://annevankesteren.nl/
>
>
Gervase Markham
2013-08-23 10:03:49 UTC
Permalink
On 23/08/13 10:49, Vint Cerf wrote:
> If we go down the IDNA2008 + TR46 path, I think we ought to be very
> explicit about a date certain to drop TR46 treatment so as to eliminate
> mapping and to instantiate the uniqueness properties of IDNA2008. Is
> that possible and what timeframe makes sense? The longer we wait, the
> harder it will be to get there.

What exactly do you mean by "TR46 treatment"?

As Mark says, TR46 is 3 things:

1) Different treatment for the 4 deviations (ZWJ, ZWNJ, eszett, sigma)
2) Allowance of non-letters
3) A mapping, e.g. casefolding

The use of a mapping such as the one defined in 3) is permitted by
IDNA2008; I would expect that mapping to continue being used
indefinitely, as a valid part of the system.

So are you asking about when we should drop 1) and 2)?

We should consider the option of not tying the two together.

Gerv
Mark Davis ☕
2013-08-23 10:19:02 UTC
Permalink
There are two different issues.

A. The mapping is purely a client-side issue, and is allowed by IDNA2008.
So that is not a problem for compatibility.

The most important feature of 'no mapping' IMO is on the registry side: to
make certain that registries either disallow mapping during the
registration process, or that they very clearly show that the resulting
domain name is different than what the user typed. While an orthogonal
issue to the client-side we're discussing here, it is worth a separate
initiative.


B. The transitional incompatibilities are:

1. Non-letter support
2. 4 deviation characters

Both of these are just dependent on registry adoption. The faster that
happens, the shorter the transition period can be. Note the transition for
each of these is independent, and can proceed on a different timescale.
Moreover, terminating the transition period doesn't need all registries to
buy in.

1. The TR46 non-letter support can be dropped in clients once the major
registries disallow non-IDNA2008 URLs. I say URLs, because the registries
need to not only disallow them in SLDs (eg http://☃.com), they
*also*need to forbid their subregistries from having them in Nth-level
domains
(that is, disallow http://☃.blogspot.ch/ = xn--n3h.blogspot.ch).
2. The TR46 deviation character support can be dropped in clients once
the major registries that allow them provide a bundle or block approach to
labels that include them, so that new clients can be guaranteed that URLs
won't go to a different location than they would under IDNA2003. The
bundle/block needs to last while there are a significant number of IDNA2003
clients out in the world. Because newer browsers have automatic updates,
this can be far faster than it would have been a few years ago.



Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**


On Fri, Aug 23, 2013 at 11:49 AM, Vint Cerf <***@google.com> wrote:

>
> If we go down the IDNA2008 + TR46 path, I think we ought to be very
> explicit about a date certain to drop TR46 treatment so as to eliminate
> mapping and to instantiate the uniqueness properties of IDNA2008. Is that
> possible and what timeframe makes sense? The longer we wait, the harder it
> will be to get there.
>
> v
>
>
>
>
Mark Davis ☕
2013-08-23 11:27:05 UTC
Permalink
All of the 4 characters are important and deserve support. And I agree that
they are a priority.

For the transition, TR46 supports the desired display for users, which is
far more important than distinguishing two different sites. That is, if you
type "größer.at <http://xn--grsser-xxa.at>" you'll see
"größer.at<http://xn--grsser-xxa.at>"
as the display in your address bar, and if you type
"grösser.at<http://xn--grsser-xxa.at>"
you'll see "grösser.at <http://xn--grsser-xxa.at>" as the display. They
will go to the same IDNA2003 address until we can flip off the transitional
bit.

(Note that we might be able to get general agreement among major clients to
support these on a per-TLD basis. So if .AT bundle/blocked ß in all of its
subdomains, then ß could be allowed in any .at domain name.)


Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**


On Fri, Aug 23, 2013 at 1:01 PM, Vint Cerf <***@google.com> wrote:

> Mark,
>
> thanks for the refinement. Is it possible that the browser makers would
> agree to discuss a plan and schedule to achieve these various objectives?
> My impression regarding the 4 deviation characters is that the Arabic users
> would benefit from IDN2008 treatment of the zero width characters so that
> seems to have some priority. The sharp-S continues to foster debate but it
> is a legitimate character and is used in normal cases and seems to deserve
> support. The trailing sigma question continues to produce controversy
> although the IDNA2008 committee finally concluded it deserved support.
>
> vint
>
>
> On Fri, Aug 23, 2013 at 6:19 AM, Mark Davis ☕ <***@macchiato.com> wrote:
>
>> There are two different issues.
>>
>> A. The mapping is purely a client-side issue, and is allowed by IDNA2008.
>> So that is not a problem for compatibility.
>>
>> The most important feature of 'no mapping' IMO is on the registry side:
>> to make certain that registries either disallow mapping during the
>> registration process, or that they very clearly show that the resulting
>> domain name is different than what the user typed. While an orthogonal
>> issue to the client-side we're discussing here, it is worth a separate
>> initiative.
>>
>>
>> B. The transitional incompatibilities are:
>>
>> 1. Non-letter support
>> 2. 4 deviation characters
>>
>> Both of these are just dependent on registry adoption. The faster that
>> happens, the shorter the transition period can be. Note the transition for
>> each of these is independent, and can proceed on a different timescale.
>> Moreover, terminating the transition period doesn't need all registries to
>> buy in.
>>
>> 1. The TR46 non-letter support can be dropped in clients once the
>> major registries disallow non-IDNA2008 URLs. I say URLs, because the
>> registries need to not only disallow them in SLDs (eg http://☃.com),
>> they *also* need to forbid their subregistries from having them in
>> Nth-level domains (that is, disallow http://☃.blogspot.ch/ =
>> xn--n3h.blogspot.ch).
>> 2. The TR46 deviation character support can be dropped in clients
>> once the major registries that allow them provide a bundle or block
>> approach to labels that include them, so that new clients can be guaranteed
>> that URLs won't go to a different location than they would under
>> IDNA2003. The bundle/block needs to last while there are a significant
>> number of IDNA2003 clients out in the world. Because newer browsers have
>> automatic updates, this can be far faster than it would have been a few
>> years ago.
>>
>>
>>
>> Mark <https://plus.google.com/114199149796022210033>
>> *
>> *
>> *— Il meglio Ú l’inimico del bene —*
>> **
>>
>>
>> On Fri, Aug 23, 2013 at 11:49 AM, Vint Cerf <***@google.com> wrote:
>>
>>>
>>> If we go down the IDNA2008 + TR46 path, I think we ought to be very
>>> explicit about a date certain to drop TR46 treatment so as to eliminate
>>> mapping and to instantiate the uniqueness properties of IDNA2008. Is that
>>> possible and what timeframe makes sense? The longer we wait, the harder it
>>> will be to get there.
>>>
>>> v
>>>
>>>
>>>
>>>
>>
>
Shawn Steele
2013-08-23 17:05:42 UTC
Permalink
Ø (Note that we might be able to get general agreement among major clients to support these on a per-TLD basis. So if .AT bundle/blocked ß in all of its subdomains, then ß could be allowed in any .at domain name.)

I’m not going to be in favor of doing it on a per-TLD basis. Too confusing.
Gervase Markham
2013-08-23 13:15:05 UTC
Permalink
On 23/08/13 11:19, Mark Davis ☕ wrote:
> 1. The TR46 non-letter support can be dropped in clients once the major
> registries disallow non-IDNA2008 URLs. I say URLs, because the
> registries need to not only disallow them in SLDs (eg http://☃.com),
> they /also/ need to forbid their subregistries from having them in
> Nth-level domains (that is, disallow http://☃.blogspot.ch/
> <http://blogspot.ch/> = xn--n3h.blogspot.ch
> <http://xn--n3h.blogspot.ch>).

This is not my area of expertise, but I am not aware of a registry which
attempts to define by contract what their customers may or may not put
into the DNS "below" the domain they have purchased.

The way to make such domains not exist is for them to first not work in
browsers; I'm not sure we can do it the other way around.

> 2. The TR46 deviation character support can be dropped in clients once
> the major registries that allow them provide a bundle or block
> approach to labels that include them, so that new clients can be
> guaranteed that URLs won't go to a different location than they
> would under IDNA2003. The bundle/block needs to last while there are
> a significant number of IDNA2003 clients out in the world. Because
> newer browsers have automatic updates, this can be far faster than
> it would have been a few years ago.

I would be greatly blessed if someone were to put together four lists,
one for each deviation character, of registries which allow that
character, their current approach to the compatibility risk (bundling,
blocking, nothing etc.), and their opinion (if they have one) on whether
or not IDNA2008 is the way to go.

Gerv
Cary Karp
2013-08-23 15:17:38 UTC
Permalink
Quoting Gerv:

> On 23/08/13 11:19, Mark Davis ☕ wrote:
>> 1. The TR46 non-letter support can be dropped in clients once the major
>> registries disallow non-IDNA2008 URLs. I say URLs, because the
>> registries need to not only disallow them in SLDs (eg http://☃.com),
>> they /also/ need to forbid their subregistries from having them in
>> Nth-level domains (that is, disallow http://☃.blogspot.ch/
>> <http://blogspot.ch/> = xn--n3h.blogspot.ch
>> <http://xn--n3h.blogspot.ch>).
>
> This is not my area of expertise, but I am not aware of a registry which
> attempts to define by contract what their customers may or may not put
> into the DNS "below" the domain they have purchased.

They don't because they can't; conceivably one level down but
certainly not Nth-level. The concept itself is arguably antithetical
to one of the fundamental attributes of the DNS but there is no
practicable way to implement such a constraint, anyway. As a timeline
effect, 1. is identical to never.

> The way to make such domains not exist is for them to first not work in
> browsers; I'm not sure we can do it the other way around.

Right.

/Cary
John C Klensin
2013-08-23 15:46:41 UTC
Permalink
--On Friday, August 23, 2013 14:15 +0100 Gervase Markham
<***@mozilla.org> wrote:

> On 23/08/13 11:19, Mark Davis ☕ wrote:
>> 1. The TR46 non-letter support can be dropped in clients
>> once the major registries disallow non-IDNA2008 URLs. I say
>> URLs, because the registries need to not only disallow
>> them in SLDs (eg http://☃.com), they /also/ need to
>> forbid their subregistries from having them in Nth-level
>> domains (that is, disallow http://☃.blogspot.ch/
>> <http://blogspot.ch/> = xn--n3h.blogspot.ch
>> <http://xn--n3h.blogspot.ch>).
>
> This is not my area of expertise, but I am not aware of a
> registry which attempts to define by contract what their
> customers may or may not put into the DNS "below" the domain
> they have purchased.

Gerv,

At least historically, I am aware of such registries. In the
pre-ICANN period, Section 3 of RFC 1591 contained the statement

"Most of these same concerns are relevant when a
sub-domain is delegated and in general the principles
described here apply recursively to all delegations of
the Internet DNS name space."

which was intended to make the sort of relationship we need here
just about mandatory. My value recollection is that ICANN, in
its early days, attempted to impose similar "recursive
application" requirements in its contracts with registries.
That effort largely floundered for the delegation-only
registries that are probably a superset of what Mark considers
"major" because of the difficulties with imposed requirements
and enforcement (especially with ccTLDs but, in practice, with
many gTLDs as well). I note in particular that, as far as I can
tell, the Applicant Guidebook does not impose any such
requirement on current-round new gTLD applicants, implying that
it is already too late to effectively "forbid" much of anything.

On the other hand, within enterprise-level domains (those whose
subdomains make up the FQDN case), my experience has been that
naming conventions and restrictions of various sorts are both
common and enforced. Certainly not in all cases, but in enough
to be significant.

> The way to make such domains not exist is for them to first
> not work in browsers; I'm not sure we can do it the other way
> around.

That is precisely the chicken-and-egg problem I referred to in
my earlier note. If nothing else, a browser-first approach has
the advantage of having to convince under a dozen implementer
communities while getting most registries (including zone
administrators deep in the tree) to behave in a particular way
requires convincing perhaps hundreds of millions of entities,
most of whom are not following these lists (and most of whom
don't care about issues that extend beyond their local languages
and scripts).



--On Friday, August 23, 2013 10:19 -0400 John Cowan
<***@mercury.ccil.org> wrote:

> Mark Davis ☕ scripsit:
>
>> *also*need to forbid their subregistries from having them in
>> Nth-level domains
>> (that is, disallow http://☃.blogspot.ch/ =
>> xn--n3h.blogspot.ch).
>
> Through what technical or social means would that be arranged?
> TLD registries have never had any control over their
> subregistries' use of names that I know of; I should think it
> would have to be implemented by contract between the registry
> and the subregistry, and many existing subregistries might
> well balk.

As noted above, "never" is too strong. But, yes, in today's
world, contracts would be required and we already have empirical
experience with the "balking" part.

As Gerv suggests, the most effective mechanism involves
developers of broswers (and other applications that use the DNS)
making the transition in some way and thereby causing bottom-up
pressure on registries to avoid doing things that can cause name
conflicts or reference ambiguity.

That could take many forms. While I hate the idea of requiring
dual lookups, a browser that wished to be extra-careful about
conflicting names could look up both interpretations and, if it
found more than one (or two with different RR Sets), could
reasonably come back to the user with a "this may be a problem
or an attack, which one did you really want?" message, perhaps
even with a "if you don't like this story, complain to the
registry". Especially for FQDNs, that would be far more
effective and more reliable than waiting on the registries and
hoping that they are all doing what we would like them to do.
(Yes, I understand the implementation problems with this,
especially when there is no guarantee that all zones in a
particular tree will consistently use one interpretation. But
it would still be easier than convincing millions (or hundreds
of millions) of zone administrators and tabulating their status.)

best,
john
Gervase Markham
2013-08-23 16:44:46 UTC
Permalink
On 23/08/13 14:15, Gervase Markham wrote:
> I would be greatly blessed if someone were to put together four lists,
> one for each deviation character, of registries which allow that
> character, their current approach to the compatibility risk (bundling,
> blocking, nothing etc.), and their opinion (if they have one) on whether
> or not IDNA2008 is the way to go.

And, while I'm wishing, a write-up of how the current major web browsers
handle the four deviation characters.

Gerv
John Cowan
2013-08-23 14:19:12 UTC
Permalink
Mark Davis ☕ scripsit:

> *also*need to forbid their subregistries from having them in Nth-level
> domains
> (that is, disallow http://☃.blogspot.ch/ = xn--n3h.blogspot.ch).

Through what technical or social means would that be arranged? TLD
registries have never had any control over their subregistries'
use of names that I know of; I should think it would have to be
implemented by contract between the registry and the subregistry,
and many existing subregistries might well balk.

--
A rose by any other name John Cowan
may smell as sweet, http://www.ccil.org/~cowan
but if you called it an onion ***@ccil.org
you'd get cooks very confused. --RMS
Vint Cerf
2013-08-23 11:01:36 UTC
Permalink
Mark,

thanks for the refinement. Is it possible that the browser makers would
agree to discuss a plan and schedule to achieve these various objectives?
My impression regarding the 4 deviation characters is that the Arabic users
would benefit from IDN2008 treatment of the zero width characters so that
seems to have some priority. The sharp-S continues to foster debate but it
is a legitimate character and is used in normal cases and seems to deserve
support. The trailing sigma question continues to produce controversy
although the IDNA2008 committee finally concluded it deserved support.

vint


On Fri, Aug 23, 2013 at 6:19 AM, Mark Davis ☕ <***@macchiato.com> wrote:

> There are two different issues.
>
> A. The mapping is purely a client-side issue, and is allowed by IDNA2008.
> So that is not a problem for compatibility.
>
> The most important feature of 'no mapping' IMO is on the registry side: to
> make certain that registries either disallow mapping during the
> registration process, or that they very clearly show that the resulting
> domain name is different than what the user typed. While an orthogonal
> issue to the client-side we're discussing here, it is worth a separate
> initiative.
>
>
> B. The transitional incompatibilities are:
>
> 1. Non-letter support
> 2. 4 deviation characters
>
> Both of these are just dependent on registry adoption. The faster that
> happens, the shorter the transition period can be. Note the transition for
> each of these is independent, and can proceed on a different timescale.
> Moreover, terminating the transition period doesn't need all registries to
> buy in.
>
> 1. The TR46 non-letter support can be dropped in clients once the
> major registries disallow non-IDNA2008 URLs. I say URLs, because the
> registries need to not only disallow them in SLDs (eg http://☃.com),
> they *also* need to forbid their subregistries from having them in
> Nth-level domains (that is, disallow http://☃.blogspot.ch/ =
> xn--n3h.blogspot.ch).
> 2. The TR46 deviation character support can be dropped in clients once
> the major registries that allow them provide a bundle or block approach to
> labels that include them, so that new clients can be guaranteed that URLs
> won't go to a different location than they would under IDNA2003. The
> bundle/block needs to last while there are a significant number of IDNA2003
> clients out in the world. Because newer browsers have automatic updates,
> this can be far faster than it would have been a few years ago.
>
>
>
> Mark <https://plus.google.com/114199149796022210033>
> *
> *
> *— Il meglio Ú l’inimico del bene —*
> **
>
>
> On Fri, Aug 23, 2013 at 11:49 AM, Vint Cerf <***@google.com> wrote:
>
>>
>> If we go down the IDNA2008 + TR46 path, I think we ought to be very
>> explicit about a date certain to drop TR46 treatment so as to eliminate
>> mapping and to instantiate the uniqueness properties of IDNA2008. Is that
>> possible and what timeframe makes sense? The longer we wait, the harder it
>> will be to get there.
>>
>> v
>>
>>
>>
>>
>
Andrew Sullivan
2013-08-23 15:23:10 UTC
Permalink
On Fri, Aug 23, 2013 at 12:19:02PM +0200, Mark Davis ? wrote:

> registries disallow non-IDNA2008 URLs. I say URLs, because the registries
> need to not only disallow them in SLDs (eg http://☃.com), they
> *also*need to forbid their subregistries from having them in Nth-level
> domains
> (that is, disallow http://☃.blogspot.ch/ = xn--n3h.blogspot.ch).

This isn't something that they do today. Indeed, there is nothing to
prevent a site from putting a label there that is just the relevant
raw UTF-8 bits. The thing we use to avoid this is "it doesn't work".

In a different context, Dennis Jennings has been arguing for similar
rules as well, and it's a mistake. We do not _want_ deep labels to
have to follow the same rules as for names at the second or third
levels. For instance, we want top-level domain registries to permit
only LDH-labels (of some sort, including A-labels). But LDH-labels
don't include underscores. Does that mean that we'd want to ban (say)
SRV or DKIM TXT records? I think not.

The DNS is not a global database with consistent policy. That's a
deep down design feature, not a bug, and if people think that it
_needs_ to have a consistent policy, then we need a different naming
system.

Best,

A

--
Andrew Sullivan
***@anvilwalrusden.com
Vint Cerf
2013-08-23 09:49:46 UTC
Permalink
If we go down the IDNA2008 + TR46 path, I think we ought to be very
explicit about a date certain to drop TR46 treatment so as to eliminate
mapping and to instantiate the uniqueness properties of IDNA2008. Is that
possible and what timeframe makes sense? The longer we wait, the harder it
will be to get there.

v
John C Klensin
2013-08-23 15:13:09 UTC
Permalink
--On Friday, August 23, 2013 12:19 +0200 Mark Davis ☕
<***@macchiato.com> wrote:

> There are two different issues.
>
> A. The mapping is purely a client-side issue, and is allowed
> by IDNA2008. So that is not a problem for compatibility.

Agreed, with a few qualifications. First, for reasons
explained by others in this thread, IDNA2008 allows mapping to
correspond to well-understood local needs. Global and
non-selective use of the same mapping in every instance of a
particular browser, or by all browsers, is inconsistent with
that intent. That distinction is purely philosophical in the
vast majority of cases but may be quite important to the
exceptions; we should not lose track of it. Second, UTE46 uses
the terms "ToASCII" and "toUnicode" to describe operations that
are subtly different from the "ToASCII" and "ToUnicode" of
IDNA2003. That invites a different type of confusion and claims
of compatibility where interoperation doesn't exist. IMO, UTR46
and our general situation would benefit from changes in that
terminology. In addition, while a large and important fraction
of IDNA2003's Nameprep profile of StringPrep is identical to
NFKC compatibility mapping, that is NFKC mapping as of Unicode
3.2. Even if one uses UTR46 or some other set of rules to
preserve the Unicode 3.2-based NFKC mappings, it would probably
be appropriate to have a serious discussion of whether the needs
of the user and implementer communities are better served by
applying NFKC (and potentially Case Folding) to characters added
after Unicode 3.2. In addition to the purely theoretical
concerns, NFKC maps certain little-used Han characters onto
others. IDNA2008 disallows those characters, leaving the option
of permitting some of them (with little disruption) open in the
future if the language communities are convinced that they are
important. Mapping them out as soon as they appear in Unicode
would then leave us with a new version of the Eszett problem as
well as the risk that IDNA and UTR46 would diverge on how they
are handled.

> The most important feature of 'no mapping' IMO is on the
> registry side: to make certain that registries either disallow
> mapping during the registration process, or that they very
> clearly show that the resulting domain name is different than
> what the user typed. While an orthogonal issue to the
> client-side we're discussing here, it is worth a separate
> initiative.

Agreed. Most of that initiative has been underway since before
IDAN2008 was approved although application to FQDNs raises other
issues (see below).

> B. The transitional incompatibilities are:
>
> 1. Non-letter support
> 2. 4 deviation characters
>
> Both of these are just dependent on registry adoption. The
> faster that happens, the shorter the transition period can be.
> Note the transition for each of these is independent, and can
> proceed on a different timescale. Moreover, terminating the
> transition period doesn't need all registries to buy in.

Good. The question is how many. I wish, probably more often
than most, that the situation was still as it was when (and
before) RFC 1591 was published in 1994 and it was realistic to
believe that a requirement could be imposed, top-down and
recursively, on all DNS nodes. That situation no longer exists,
with decisions being made on the basis of short-term economic
interests (including the costs of trying to monitor and enforce
rules) and others being made because "registries" (zone
administrators) are too busy with other priorities to pay
attention. That, in turn, leaves us with a nasty
chicken-and-egg problem: from one point of view, it is easy to
say "transition ends when most of the registries enforce the
IDNA2008 rules". From another, the problem looks more like
"most registries will enforce the rules only when not doing so
becomes painful, i.e., when their users/customers complain that
the names they are using are not predictably accessible". If we
end up with an environment in which everyone is waiting for
everyone else, the losers are the users of the Internet.

> 1. The TR46 non-letter support can be dropped in clients
> once the major registries disallow non-IDNA2008 URLs. I say
> URLs, because the registries need to not only disallow them
> in SLDs (eg http://☃.com), they *also*need to forbid their
> subregistries from having them in Nth-level domains
> (that is, disallow http://☃.blogspot.ch/ =
> xn--n3h.blogspot.ch).

See above. It is a reality of our current situation that
"forbidding" for the DNS is ineffective, just as an effort by
IETF to "require" conformance to its standards or one by the
Unicode consortium to "forbid" applications from designing and
quietly adopting and applying a fifth normalization form would
be ineffective. We can, at most, try to persuade.

Also, as part of my mini-campaign for consistent terminology and
its consistent use, the DNS community would describe what you
are talking about as full-qualified domain names (FQDNs) in the
domain-part of URLs. When you use the term "URL" instead, you
include the path, query, and fragment parts of URLs. As others
have pointed out, the use of non-ASCII characters is popular in
those tail elements in many parts of the world and queries can,
and often do, contain domain names. To the extent that is a
problem, it is not our problem -- neither IDNA2003 (including
RFC 5895) nor UTR 46 address it.

> 2. The TR46 deviation character
> support can be dropped in clients once the major registries
> that allow them provide a bundle or block approach to
> labels that include them, so that new clients can be
> guaranteed that URLs won't go to a different location than
> they would under IDNA2003. The bundle/block needs to last
> while there are a significant number of IDNA2003 clients
> out in the world. Because newer browsers have automatic
> updates, this can be far faster than it would have been a
> few years ago.

As a strategy, I believe that "bundle or block" is the right
thing to do and that it would be better to not have similar
FQDNs that identify different systems ("go to different
locations" is a little web-specific for my taste). However,
that is part of the far more general set of "similarity",
"confusability", and "variant" problems that continue to tie
ICANN in knots. Viewing the handful of "deviation characters"
as special involves picking out a tiny fraction of the problem
and assuming it is worth solving separately. Many of the
entities that have to deal with the whole system, including
ICANN and many "major registries", just don't see things that
way because they see any general adoption of "bundle or block"
rules as involving important economic and user demand tradeoffs,
not as a technical matter associated with IDNA2003-> IDNA2008
transition.

I have one other major concern about UTR 46. More because of
the way it is written, with its own tables and operations and
use of IDNA2003 terminology, rather than its intent, it can
easily be interpreted as a substitute for IDNA2008 (with the
latter used only as a final check on label validity) rather than
a mapping and transitional add on for it. Since many of us seem
to be in agreement that it should ultimately be a collection of
IDNA2008-conformant mapping rules, it seems to me that the
specification would be stronger if it were constructed more as a
"migrating to IDNA2008" one than as a "migrating [reluctantly?]
away from IDNA2003" one. Changing the terminology and tone a
bit could go a long way in that direction.

Again, I see most of these issues as being more about details
and presentation than about fundamentals. If Mark were
interested in forming a small editorial group to make changes
along the lines I've outlined, and thought it would be useful,
I'd be happy to join in the effort.

best,
john
Gervase Markham
2013-08-23 16:49:15 UTC
Permalink
On 23/08/13 16:13, John C Klensin wrote:
> Agreed, with a few qualifications. First, for reasons
> explained by others in this thread, IDNA2008 allows mapping to
> correspond to well-understood local needs. Global and
> non-selective use of the same mapping in every instance of a
> particular browser, or by all browsers, is inconsistent with
> that intent. That distinction is purely philosophical in the
> vast majority of cases but may be quite important to the
> exceptions; we should not lose track of it.

I'm happy to maintain the distinction, while noting that "if it works in
one place, it works everywhere" has been a touchstone of Firefox's
implementation of IDN. (This is why, for example, we have avoided the
idea that a particular IDN domain name's functioning or Unicode display
should depend on which languages are installed on your computer.)

Gerv
Shawn Steele
2013-08-23 19:25:40 UTC
Permalink
> 2. The TR46 deviation character
> support can be dropped in clients once the major registries
> that allow them provide a bundle or block approach to labels that
> include them, so that new clients can be
> guaranteed that URLs won't go to a different location than
> they would under IDNA2003. The bundle/block needs to last
> while there are a significant number of IDNA2003 clients
> out in the world. Because newer browsers have automatic
> updates, this can be far faster than it would have been a
> few years ago.

That's not necessarily true. If someone's subdomain uses some IDNA2003 domain name that lands on a different machine, the client applications will get the bug reports that they broke the customer when a customer's URL stops working. Blocking at the registry could have a similar issue (if the user wants to move from one to another). Also we cannot enforce that the registrars bundle-or-block.

However if the registrars DO bundle the names with the IDNA2003 form (other variations could be blocked), and the client software continues to resolve using the transient rules for these 5 characters, the user will always land on the right machine, no matter which way they enter the name. The "worst" problem could be that the address bar looks a little different than what the user entered. Which could be considered a UI problem. And if the clients really thought that UI issue was worth fixing, they could ensure that they didn't map the form the users entered for these few characters.

-Shawn
Mark Davis ☕
2013-08-24 12:40:30 UTC
Permalink
​There's been a flurry of activity on this list. I'm on vacation, and won't
be able to respond much for a
​bit
​, b​
ut ​
I'll
​make just a couple of brief comments.

With reference to your comments below, I think that many people's views
have evolved in the last four years. I'm sure that Unicode Consortium would
be glad to work together on improving UTF46. As you say, we are in a bit of
a chicken and egg situation between registries and browsers, so a clearer
path forward to IDNA2008 would be great. (And in retrospect, I so wish that
IDNA2003 had been built along the IDNA2008 architecture—would have saved us
all so much pain!)

​The key is an effective
transition plan
​ for #2/#3​
.
I put out some strawman ideas on this list, but clearly there needs to be
more discussion. I think everyone recognizes that we won't get to zero
"breaking" IDNA2003 URLs; the goal should be to get to a small enough
number that the major players feel comfortable flipping the switch on the
remaining ones.

Back on Sept 9.

Mark


John C Klensin <***@jck.com> wrote:
[snip​...​]

(7) IMO, UTR46 needs some work. The suggestions above lay the
foundation for what I believe is the most important substantive
piece of that work, and complement Mark's recent notes. I
believe that UTR46 is in need of serious discussion of when it
is plausible to shut off the "transition" machinery. Mark's
recent notes provide most of the information and text that I
believe need to be in the spec itself. It is almost trivial by
comparison, but I think it should contain some strong language
explaining why it is unreasonable to claim conformance with or
application of UTR46 without a statement as to which (if any)
transition mechanisms are being applied (e.g., whether a domain
name containing Eszett, ZWJ, or ZWNJ will be looked up or
changed into something else that the user didn't specify. I'll
respond separately to some of the details of those notes, but
want to start with the observation that my thinking, at least,
has evolves considerably in the last three or four years and
that I think we are now quibbling about details rather than
having major disagreements.
​...​

[snip​...​]

Again, I see most of these issues as being more about details
and presentation than about fundamentals. If Mark were
interested in forming a small editorial group to make changes
along the lines I've outlined, and thought it would be useful,
I'd be happy to join in the effort.

====




Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio Ú l’inimico del bene —*
**
John C Klensin
2013-08-24 14:49:34 UTC
Permalink
Mark,

Excellent. Have a good vacation; let's talk after the 3rd.

john


--On Saturday, August 24, 2013 14:40 +0200 Mark Davis ☕
<***@macchiato.com> wrote:

> ​There's been a flurry of activity on this list. I'm on
> vacation, and won't be able to respond much for a
> ​bit
> ​, b​
> ut ​
> I'll
> ​make just a couple of brief comments.
>
> With reference to your comments below, I think that many
> people's views have evolved in the last four years. I'm sure
> that Unicode Consortium would be glad to work together on
> improving UTF46. As you say, we are in a bit of a chicken and
> egg situation between registries and browsers, so a clearer
> path forward to IDNA2008 would be great. (And in retrospect, I
> so wish that IDNA2003 had been built along the IDNA2008
> architecture—would have saved us all so much pain!)
>
> ​The key is an effective
> transition plan
> ​ for #2/#3​
> .
> I put out some strawman ideas on this list, but clearly there
> needs to be more discussion. I think everyone recognizes that
> we won't get to zero "breaking" IDNA2003 URLs; the goal should
> be to get to a small enough number that the major players feel
> comfortable flipping the switch on the remaining ones.
>
> Back on Sept 9.
Jiankang Yao
2013-08-26 06:27:01 UTC
Permalink
From: Mark Davis ☕
Date: 2013-08-24 20:40
To: John C Klensin
CC: PUBLIC-***@W3.ORG; ***@w3.org; IDNA update work; Anne van Kesteren; Vint Cerf; www-tag.w3.org
Subject: Re: Standardizing on IDNA 2003 in the URL Standard
​>>...

> With reference to your comments below, I think that many people's views have evolved in the last four years. I'm sure that Unicode Consortium would be glad to work together on improving UTF46.
>
>

Is it possible to standardlize UTS46 based on IETF process? for eample, making the main contents of UTS46 as the RFC.


Jiankang Yao
John C Klensin
2013-08-26 11:54:26 UTC
Permalink
--On Monday, August 26, 2013 14:27 +0800 Jiankang Yao
<***@cnnic.cn> wrote:

>> With reference to your comments below, I think that many
>> people's views have evolved in the last four years. I'm sure
>> that Unicode Consortium would be glad to work together on
>> improving UTF46.
>
> Is it possible to standardlize UTS46 based on IETF process?
> for eample, making the main contents of UTS46 as the RFC.

If it could get consensus, only one problem in principle: the
IETF normally insists on change control over its standards, so
the mapping tables that are effectively part of UTR46 would
either need to be frozen on the IRND2003 mappings or made part
of the IDNA2008 review and update process. Of course, I have no
idea whether The Unicode Consortium would want to give up their
change control.

However, and with the understanding that this is my personal
opinion only, some parts of UTR46 effectively say that one
should ignore the specifications of IDNA2008 and do what it says
instead. That is the issue I referred to earlier as really
being a replacement for IDNA2008 that draws on both IDNA2003 and
IDNA2008 rather than being just a supplement to IDNA2008. I
think that would need to be clarified in order to be acceptable
to the IETF (at least without reopening IDNA2008, which I don't
think anyone wants to do and which would considerably disrupt
the ICANN side of things.

Recommendation: let's see if we can focus on the possible
changes to UTR46 that Mark, myself, and others have been
pointing toward and whether consensus on them is possible rather
than the details of who standardizes what. The formalities of
standardization are much less important than who is actually
adopting what (in essence, that is the issue that started this
thread) and getting stuck on them at this point would, I
believe, not be in anyone's best interests.

john
JFC Morfin
2013-08-26 17:01:00 UTC
Permalink
At 08:27 26/08/2013, Jiankang Yao wrote:
>>With reference to your comments below, I think that many people's
>>views have evolved in the last four years. I'm sure that Unicode
>>Consortium would be glad to work together on improving UTF46.
>
>Is it possible to standardlize UTS46 based on IETF process? for
>eample, making the main contents of UTS46 as the RFC.

John has commented from an "insider" point of view. From an
external/global OpenUse point of view, this could certainly be a
good, but risky, thing endangering the IDNA2008 consensual
compromise. The reasons why it might be the case call for general
considerations that need to be documented in order to be clear to everyone.

I began doing it at:
http://iucg.org/wiki/OpenStand_context_vs._standardizing_IDNA2003
Comments are welcome before I publish it as a draft.

jfc
JFC Morfin
2013-08-26 17:01:00 UTC
Permalink
At 08:27 26/08/2013, Jiankang Yao wrote:
>>With reference to your comments below, I think that many people's
>>views have evolved in the last four years. I'm sure that Unicode
>>Consortium would be glad to work together on improving UTF46.
>
>Is it possible to standardlize UTS46 based on IETF process? for
>eample, making the main contents of UTS46 as the RFC.

John has commented from an "insider" point of view. From an
external/global OpenUse point of view, this could certainly be a
good, but risky, thing endangering the IDNA2008 consensual
compromise. The reasons why it might be the case call for general
considerations that need to be documented in order to be clear to everyone.

I began doing it at:
http://iucg.org/wiki/OpenStand_context_vs._standardizing_IDNA2003
Comments are welcome before I publish it as a draft.

jfc
JFC Morfin
2013-10-26 18:18:00 UTC
Permalink
http://www.icann.org/en/news/press/releases/release-23oct13-en
jfc
JFC Morfin
2013-10-26 18:18:00 UTC
Permalink
http://www.icann.org/en/news/press/releases/release-23oct13-en
jfc
Anne van Kesteren
2014-01-15 16:26:22 UTC
Permalink
On Sat, Aug 24, 2013 at 1:40 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> I put out some strawman ideas on this list, but clearly there needs to be
> more discussion. I think everyone recognizes that we won't get to zero
> "breaking" IDNA2003 URLs; the goal should be to get to a small enough number
> that the major players feel comfortable flipping the switch on the remaining
> ones.
>
> Back on Sept 9.

It's been a couple of months. Any updates for us?

Thinks I found not addressed by IDNA2003 that
http://url.spec.whatwg.org/#concept-host-parser papers over:

* Percent-decoding
* Rejecting certain ASCII code points to ensure idempotency, but not
e.g. "_" as that would break sites
* Lowercasing the ASCII code points as IDNA2003 only applies if
there's non-ASCII code point

I have not checked what of that can be removed if we use UTS #46
instead. Certainly referencing IDNA2008 directly does not work, as
"A.com" does not become "a.com", which would presumably break too many
scripts.


--
http://annevankesteren.nl/
Andrew Sullivan
2014-01-15 17:19:26 UTC
Permalink
On Wed, Jan 15, 2014 at 04:26:22PM +0000, Anne van Kesteren wrote:

> I have not checked what of that can be removed if we use UTS #46
> instead. Certainly referencing IDNA2008 directly does not work, as
> "A.com" does not become "a.com", which would presumably break too many
> scripts.

IDNA2008 has no effect at all on "A.com" or "a.com".

IDNA2008 does say that "Aà.com" is not PVALID. I _believe_ that under
IDNA2003 that becomes "aà.com". The reason IDNA2008 doesn't do that
is because you can't tell whether "Aà.com" is supposed to be "àà.com"
or "aà.com", so IDNA2008 tries to say "don't do that".

IDNA2008 also says that you can do some local mapping. In my opinion,
UTS #46 goes too far with compatibility attempts to IDNA2003, but I'm
prepared to accept that lots of other people disagree.

A

--
Andrew Sullivan
***@anvilwalrusden.com
Jiankang Yao
2014-01-16 02:07:36 UTC
Permalink
From: Andrew Sullivan
Date: 2014-01-16 01:19
To: Anne van Kesteren
CC: Mark Davis ?; PUBLIC-***@W3.ORG; ***@w3.org; John C Klensin; IDNA update work; Vint Cerf; www-tag.w3.org
Subject: Re: Standardizing on IDNA 2003 in the URL Standard

>IDNA2008 has no effect at all on "A.com" or "a.com".

>IDNA2008 does say that "Aà.com" is not PVALID.
>

In which section of rfc idnabis does say some meaning related to that " "Aà.com" is not PVALID"?

I check Aà.com in verisign conversion tool, it shows "xn--a-sfa.com"

http://mct.verisign-grs.com/convertServlet?input=A%C3%A0.com

does it follow idna2003 instead of idnabis?


Jiankang Yao
Andrew Sullivan
2014-01-16 02:52:27 UTC
Permalink
On Thu, Jan 16, 2014 at 10:07:36AM +0800, Jiankang Yao wrote:
>
> In which section of rfc idnabis does say some meaning related to that " "Aà.com" is not PVALID"?

Upper case characters are not PVALID. This is because of B:
toNFKC(toCaseFold(toNFKC(cp))) != cp. In appendix B.1 it's
illustrated in this entry:

003A..0060 ; DISALLOWED # COLON..GRAVE ACCENT

> I check Aà.com in verisign conversion tool, it shows "xn--a-sfa.com"
>
> http://mct.verisign-grs.com/convertServlet?input=A%C3%A0.com
>
> does it follow idna2003 instead of idnabis?

I have no idea. Maybe it has a bug. Or maybe it does case mapping
before it touches the string; that's what RFC 5895 suggests. In that
case, of course, the Verisign conversion tool includes some things
that applications are supposed to do.

A

--
Andrew Sullivan
***@anvilwalrusden.com
Jiankang Yao
2014-01-16 03:03:18 UTC
Permalink
Aà.com has been registered.
http://www.verisigninc.com/en_US/products-and-services/register-domain-names/whois/index.xhtml

Your search for Aà.com returns the below results:



Whois Server Version 2.0


Domain Name: AÀ.COM (XN--A-SFA.COM)
Registrar: INTERNET.BS CORP.
Whois Server: whois.internet.bs
Referral URL: http://www.internet.bs
Name Server: NS1.SEDOPARKING.COM
Name Server: NS2.SEDOPARKING.COM
Status: clientTransferProhibited
Updated Date: 12-sep-2013
Creation Date: 22-feb-2009
Expiration Date: 22-feb-2014








Jiankang Yao

From: Andrew Sullivan
Date: 2014-01-16 10:52
To: Jiankang Yao
CC: Anne van Kesteren; Mark Davis ?; PUBLIC-***@W3.ORG; ***@w3.org; John C Klensin; IDNA update work; Vint Cerf; www-tag.w3.org
Subject: Re: Re: Standardizing on IDNA 2003 in the URL Standard
On Thu, Jan 16, 2014 at 10:07:36AM +0800, Jiankang Yao wrote:
>
> In which section of rfc idnabis does say some meaning related to that " "Aà.com" is not PVALID"?

Upper case characters are not PVALID. This is because of B:
toNFKC(toCaseFold(toNFKC(cp))) != cp. In appendix B.1 it's
illustrated in this entry:

003A..0060 ; DISALLOWED # COLON..GRAVE ACCENT

> I check Aà.com in verisign conversion tool, it shows "xn--a-sfa.com"
>
> http://mct.verisign-grs.com/convertServlet?input=A%C3%A0.com
>
> does it follow idna2003 instead of idnabis?

I have no idea. Maybe it has a bug. Or maybe it does case mapping
before it touches the string; that's what RFC 5895 suggests. In that
case, of course, the Verisign conversion tool includes some things
that applications are supposed to do.

A

--
Andrew Sullivan
***@anvilwalrusden.com
Paul Hoffman
2014-01-16 03:42:04 UTC
Permalink
On Jan 15, 2014, at 7:03 PM, Jiankang Yao <***@cnnic.cn> wrote:

> Aà.com has been registered.

Verisign has some registration policy for some TLD. How does that affect the discussion of the URL standard? This thread seems to have started as a discussion of updating the URL standard, but has morphed badly.

--Paul Hoffman
John C Klensin
2014-01-16 06:33:09 UTC
Permalink
--On Wednesday, January 15, 2014 19:42 -0800 Paul Hoffman
<***@vpnc.org> wrote:

> On Jan 15, 2014, at 7:03 PM, Jiankang Yao <***@cnnic.cn>
> wrote:
>
>> Aà.com has been registered.
>
> Verisign has some registration policy for some TLD. How does
> that affect the discussion of the URL standard? This thread
> seems to have started as a discussion of updating the URL
> standard, but has morphed badly.

It might be relevant because Anne seems to quite often argue
that existing practice has to be preserved forever if that were
what happened. But, to add a little bit Paul's comment and
those of Andrew and John L., you (Jiankang) can't infer that
"Aà.com" has actually been registered from the tool at
http://www.verisigninc.com/en_US/products-and-services/register-domain-names/whois/index.xhtml.
First, what happens when I put "Aà.com" into the tool, what I
get back is

Your search for Aà.com returns the below results:
[...]
Domain Name: AÀ.COM (XN--A-SFA.COM)
Registrar: INTERNET.BS CORP.

Now, converting the request you made into all upper-case in that
display is, IMO, stupid: it is an answer, valid or not, to a
question different from the one that was asked. Even IDNA2003
doesn't encourage doing that.

More important, you don't know what was "registered", all you
know is what is found in the database when you look up that
string. Consider a search for fuß.com: the response says:

Your search for fuß.com returns the below results:
[...]
Domain Name: FUSS.COM

I think that is ill-advised because I think they should have
responded to that lookup with a "no, but applications will map
it into 'FUSS.COM', which has the following record..." if they
believe IDNA2003 is universal or with "no, but some applications
may map it into 'FUSS.COM' so a registration of 'fuß.com' would
be blocked. The record for "FUSS.COM" is..." if they are
conforming to IDNA2008. For the latter case, regardless of what
one might think about mapping, IDNA2008 clearly allows a
registry policy that would block "fuß.com" if "fuss.com" is
registered.

What is actually "registered" in the DNS isn't, for the first
case, "XN--A-SFA.COM". First, some issues about
case-preservation notwithstanding, the label is "xn--a-sfa".
Under IDNA2003, mapping "xn--a-sfa" through TOASCII yields
exactly the same thing that mapping it to an A-label does under
IDNA2008: "aà". I hope it is obvious that neither RFC 5895
nor UTR 46 affects the mapping from the "xn--" form to native
characters; certainly both IDNA2003 and IDNA2008 discourage or
outright prohibit making any sort of upper case conversion in
that direction (for the reason Andrew first mentioned).

So, the Verisign search may be an application that maps before
doing the actual lookup or it may be non-conforming to IDNA2008.
In either event, its report of what is going on is misleading
and, as others have pointed out, what they are doing has nothing
to do with what IDNA2008 (or IDNA2003) might require.

best,
john
Anne van Kesteren
2014-01-16 10:39:17 UTC
Permalink
On Thu, Jan 16, 2014 at 6:33 AM, John C Klensin <***@jck.com> wrote:
> It might be relevant because Anne seems to quite often argue
> that existing practice has to be preserved forever if that were
> what happened.

My main interest is indeed compatibility with deployed content and
what's also very relevant here is
http://www.w3.org/Provider/Style/URI.html I think. Phasing out domain
names goes very much counter to the notion of URL persistence.

The sentiment in this thread that we can keep changing the rules also
strikes me as bad. Going from IDNA2003 to UTS46 to just IDNA2008 to
maybe something else in the future. The whole reason we set standards
is stability. So you can build on top of a foundation you know will
not change. We take this pretty seriously in most places. Not taking
it seriously for something as fundamental as domain names strikes me
as wrong.

(There's a larger issue with ICANN just removing certain ccTLDs. If
the Netherlands ever stops being an independent entity would ".nl" and
the set of URLs related to "annevankesteren.nl" just disappear? This
happened to several other countries and it makes me sad.)


--
http://annevankesteren.nl/
Gervase Markham
2014-01-16 10:53:23 UTC
Permalink
On 16/01/14 10:39, Anne van Kesteren wrote:
> The sentiment in this thread that we can keep changing the rules also
> strikes me as bad.

I think that's unfair. IDNA2008 was created because IDNA2003 was
inadequate, in a number of documented ways. No-one is saying that we
should just keep changing the rules again and again. UTS46 is, among
other things, the mapping layer which IDNA2008 says should be
implemented, so they are not "two competing standards". Yes, there are a
couple of levels within UTS46 regarding how you deal with the four
exception characters, but that's hardly creating a whole new standard.
And more about those below.

> The whole reason we set standards
> is stability. So you can build on top of a foundation you know will
> not change. We take this pretty seriously in most places. Not taking
> it seriously for something as fundamental as domain names strikes me
> as wrong.

Fixing on IDNA2003 would permanently block all those scripts which have
been added to Unicode since 3.2 (is that right?) from ever being used in
domain names, and so block users of those languages from having IDN
names. If you decide to "fix" that, then you aren't using IDNA2003 any
more, and you are "changing the rules" in a way to which you have
indicated opposition - and worse, in a non-standard way.

It has always been my understanding, and I've had confirmation certainly
from the Germans, that the backwardly-incompatible changes in IDNA2008
relating to the four exception chars - Greek sigma, Eszett, ZWJ and ZWNJ
- are endorsed by the registries of the languages most affected. In
other words, as people closest to the problem, they still think changing
is less bad than sticking with IDNA2003. That should count for a lot.

Gerv
Anne van Kesteren
2014-01-16 11:17:10 UTC
Permalink
On Thu, Jan 16, 2014 at 10:53 AM, Gervase Markham <***@mozilla.org> wrote:
> UTS46 is, among
> other things, the mapping layer which IDNA2008 says should be
> implemented,

Not is not. Many people in this thread have voiced their opposition to
UTS46 and the desire to move away from it entirely.


> Fixing on IDNA2003 would permanently block all those scripts which have
> been added to Unicode since 3.2 (is that right?)

No that is wrong and that's not how we implement IDNA2003 in Gecko.


> If you decide to "fix" that, then you aren't using IDNA2003 any
> more, and you are "changing the rules" in a way to which you have
> indicated opposition - and worse, in a non-standard way.

It's not worse if it's fully backwards compatible and mostly
interoperable across all major clients. At that point the standard is
just wrong.


> It has always been my understanding, and I've had confirmation certainly
> from the Germans, that the backwardly-incompatible changes in IDNA2008
> relating to the four exception chars - Greek sigma, Eszett, ZWJ and ZWNJ
> - are endorsed by the registries of the languages most affected. In
> other words, as people closest to the problem, they still think changing
> is less bad than sticking with IDNA2003. That should count for a lot.

If that was all that had changed, I might be more optimistic. I refer
you to my earlier email about simple things as lowercasing.


--
http://annevankesteren.nl/
Gervase Markham
2014-01-16 11:36:06 UTC
Permalink
On 16/01/14 11:17, Anne van Kesteren wrote:
> On Thu, Jan 16, 2014 at 10:53 AM, Gervase Markham <***@mozilla.org> wrote:
>> UTS46 is, among
>> other things, the mapping layer which IDNA2008 says should be
>> implemented,
>
> Not is not. Many people in this thread have voiced their opposition to
> UTS46 and the desire to move away from it entirely.

Let me be more precise in my words. IDNA2008 suggests (and in practice
does not work usefully without) an application-level mapping layer,
which it does not define, for things like casefolding. UTS46 is one such
mapping layer, and one with the property that it retains as much
compatibility with IDNA2003 as possible.

>> Fixing on IDNA2003 would permanently block all those scripts which have
>> been added to Unicode since 3.2 (is that right?)
>
> No that is wrong and that's not how we implement IDNA2003 in Gecko.

Well, that's what IDNA2003 says to do:
https://tools.ietf.org/html/rfc3490#ref-UNICODE

> It's not worse if it's fully backwards compatible and mostly
> interoperable across all major clients. At that point the standard is
> just wrong.

And having a standard fixed to Unicode 3.2 is not also "just wrong"?

> If that was all that had changed, I might be more optimistic. I refer
> you to my earlier email about simple things as lowercasing.

And I refer you to my comments above. Problems like lowercasing (for
better or worse) are punted by IDNA2008 and are labelled as an
application-level problem. In practice, what everyone should do for best
interoperability is implement the same application-level mappings, and
implement ones which are as compatible as possible with IDNA2003.
Hence.... UTS46.

Gerv
Anne van Kesteren
2014-01-16 11:48:45 UTC
Permalink
On Thu, Jan 16, 2014 at 11:36 AM, Gervase Markham <***@mozilla.org> wrote:
> On 16/01/14 11:17, Anne van Kesteren wrote:
>> It's not worse if it's fully backwards compatible and mostly
>> interoperable across all major clients. At that point the standard is
>> just wrong.
>
> And having a standard fixed to Unicode 3.2 is not also "just wrong"?

The point is that in practice, it isn't fixed to Unicode 3.2. I have
yet to encounter an IDNA2003 implementation that does that. It turns
out the setup we have in practice is a compatible evolution.


> And I refer you to my comments above. Problems like lowercasing (for
> better or worse) are punted by IDNA2008 and are labelled as an
> application-level problem. In practice, what everyone should do for best
> interoperability is implement the same application-level mappings, and
> implement ones which are as compatible as possible with IDNA2003.
> Hence.... UTS46.

I think I did mention earlier on UTS46 might be okay, depending on the
details. I am hoping to hear from Mark on the matter.


--
http://annevankesteren.nl/
Mark Davis ☕
2014-01-16 13:24:07 UTC
Permalink
> The point is that in practice, it [IDNA2003] isn't fixed to Unicode 3.2.

It is not unlikely that an implementation that you think is following
IDNA2003 (with a non-standard, larger repertoire) is actually following UTS
46.

If you were reverse-engineering to find out which standard an
implementation was following, you'd need to query certain characters to see
if they were supported, and how. UTS 46 also allows two 'modes', for
transitional and not, that you'd have to test. There is a table in
http://unicode.org/reports/tr46/#Table_IDNA_Comparisons that illustrates
this. (You'd have to look at the data tables to get a full listing.) And,
of course, it is clearly possible for an implementation to be
non-conformant to all of the standards we are talking about (IDNA2003, UTS
46, and IDNA2008).

As previously noted, however, casing differences and the 4 deviation
characters take some careful checking, since there is a difference between
what the implementation accepts and what goes out 'over the wire'. And the
implementation may also not be using the latest version of Unicode, which
would make a difference for UTS 46 and IDNA2008.

BTW, there's an online demo of Unicode properties that can be used to see
differences. The categories are slightly different than what is shown in
the above chart, but you can get a sense for the differences:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{any}&abb=on&g=idna2003+uts46+idna2008

One way to look at UTS 46 is as a migration layer to support client
implementations during the transition of registries from IDNA2003 to
IDNA2008, plus a mapping layer that can be used with straight IDNA2008.

> I think I did mention earlier on UTS46 might be okay, depending on the
details. I am hoping to hear from Mark on the matter.

​I'm not sure what specific​ questions you have about UTS 46. Can you
reiterate them?




Mark <https://google.com/+MarkDavis>

*— Il meglio Ú l’inimico del bene —*


On Thu, Jan 16, 2014 at 12:48 PM, Anne van Kesteren <***@annevk.nl>wrote:

> On Thu, Jan 16, 2014 at 11:36 AM, Gervase Markham <***@mozilla.org>
> wrote:
> > On 16/01/14 11:17, Anne van Kesteren wrote:
> >> It's not worse if it's fully backwards compatible and mostly
> >> interoperable across all major clients. At that point the standard is
> >> just wrong.
> >
> > And having a standard fixed to Unicode 3.2 is not also "just wrong"?
>
> The point is that in practice, it isn't fixed to Unicode 3.2. I have
> yet to encounter an IDNA2003 implementation that does that. It turns
> out the setup we have in practice is a compatible evolution.
>
>
> > And I refer you to my comments above. Problems like lowercasing (for
> > better or worse) are punted by IDNA2008 and are labelled as an
> > application-level problem. In practice, what everyone should do for best
> > interoperability is implement the same application-level mappings, and
> > implement ones which are as compatible as possible with IDNA2003.
> > Hence.... UTS46.
>
> I think I did mention earlier on UTS46 might be okay, depending on the
> details. I am hoping to hear from Mark on the matter.
>
>
> --
> http://annevankesteren.nl/
>
>
Anne van Kesteren
2014-01-16 14:27:03 UTC
Permalink
On Thu, Jan 16, 2014 at 1:24 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> It is not unlikely that an implementation that you think is following
> IDNA2003 (with a non-standard, larger repertoire) is actually following UTS
> 46.

I know for a fact that Gecko has not changed its implementation (but
has updated Unicode since the release of IDNA2003, doh). It "passes"
the Pile of Poo Test™:

<a href="http://💩.com/">test</a>
<script>alert(document.querySelector("a").host)</script>

Alerts: xn--ls8h.com

Chrome alerts the same and reportedly has updated to UTS46 (compatible
mode), so as you point out the differences are probably minor and
require checking of some obscurer code points.


> There is a table in
> http://unicode.org/reports/tr46/#Table_IDNA_Comparisons

That is an interesting table. Ⅎ (line c) seems indeed disallowed in
Chrome, yet 㛼 (line d) which should also be disallowed per that table
works fine. Both work fine in Firefox. Both Chrome and Firefox map !
(line b) to ! and do not cause parsing to fail because of it, even
though the table suggests it should. (Presumably do it making
assumptions about ASCII that browsers do not share.)

Firefox and Safari map ؂ (line i) and Chrome does not.


> One way to look at UTS 46 is as a migration layer to support client
> implementations during the transition of registries from IDNA2003 to
> IDNA2008, plus a mapping layer that can be used with straight IDNA2008.

I'm not sure what this means. Do you think we will ever stop mapping
U+3002 to U+002E? Or A to a?


>> I think I did mention earlier on UTS46 might be okay, depending on the
> details. I am hoping to hear from Mark on the matter.
>
> I'm not sure what specific questions you have about UTS 46. Can you
> reiterate them?

You keep talking about UTS 46 as if it were a migration layer, which
suggests it might go away. That does not really seem acceptable to me.

It enforces DNS length restrictions on domain names (IDNA2003 did the
same), which does not appear to be implemented in browsers. They're
fine with a label longer than a hundred code points. I don't think
this should be outlawed at the parsing layer because the name might be
used outside the DNS.

I wish it contained the actual ASCII restrictions we need in practice
rather than deferring those to the application, but I suppose I can
define those in the URL Standard and use UseSTD3ASCIIRules=false.

Another wish I have is that the algorithms are a bit clearer in terms
of input and output. What argument does ToASCII take? What about
ToUnicode?

E.g. how would you replace "domain to ASCII" and "domain to Unicode"
in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
ensure the algorithm still has the same kind of expected output? It's
not entirely clear to me how to make use of your work.


--
http://annevankesteren.nl/
Mark Davis ☕
2014-01-16 15:32:25 UTC
Permalink
I will be brief, because I don't have much time for this topic this week.
(It should teach me to be quiet...)

Mark <https://google.com/+MarkDavis>

*— Il meglio Ú l’inimico del bene —*


On Thu, Jan 16, 2014 at 3:27 PM, Anne van Kesteren <***@annevk.nl> wrote:

> On Thu, Jan 16, 2014 at 1:24 PM, Mark Davis ☕ <***@macchiato.com> wrote:
> > It is not unlikely that an implementation that you think is following
> > IDNA2003 (with a non-standard, larger repertoire) is actually following
> UTS
> > 46.
>
> I know for a fact that Gecko has not changed its implementation (but
> has updated Unicode since the release of IDNA2003, doh). It "passes"
> the Pile of Poo Test™:
>
> <a href="http://💩.com/">test</a>
> <script>alert(document.querySelector("a").host)</script>
>

The problem is, as Andrew and others have said, IDNA2003 does not specify
*how* one would update to a new version of Unicode: that is, exactly which
new characters would be accepted and which not, and how to case-map them.


> Alerts: xn--ls8h.com
>
> Chrome alerts the same and reportedly has updated to UTS46 (compatible
> mode), so as you point out the differences are probably minor and
> require checking of some obscurer code points.
>
>
> > There is a table in
> > http://unicode.org/reports/tr46/#Table_IDNA_Comparisons
>
> That is an interesting table. Ⅎ (line c) seems indeed disallowed in
> Chrome, yet 㛌 (line d) which should also be disallowed per that table
> works fine. Both work fine in Firefox. Both Chrome and Firefox map 
> (line b) to ! and do not cause parsing to fail because of it, even
> though the table suggests it should. (Presumably do it making
> assumptions about ASCII that browsers do not share.)
>

I'd have to look at those cases.
​​

>
> Firefox and Safari map ؂ (line i) and Chrome does not.
>
>
> > One way to look at UTS 46 is as a migration layer to support client
> > implementations during the transition of registries from IDNA2003 to
> > IDNA2008, plus a mapping layer that can be used with straight IDNA2008.
>
> I'm not sure what this means. Do you think we will ever stop mapping
> U+3002 to U+002E?




> Or A to a?
>

I'm assuming that you mean the ascii characters (I'm not going to check
whether you have just look-alikes.). ASCII case mapping is covered at a
different level.​​

I don't think clients would stop
​mapping, and IDNA2008 permits it. That's why I said "
plus a mapping layer that can be used with straight IDNA2008
​"​




>
> >> I think I did mention earlier on UTS46 might be okay, depending on the
> > details. I am hoping to hear from Mark on the matter.
> >
> > I'm not sure what specific questions you have about UTS 46. Can you
> > reiterate them?
>
> You keep talking about UTS 46 as if it were a migration layer, which
> suggests it might go away. That does not really seem acceptable to me.
>

UTS 46 will stay around, if only for the mapping layer.

Whether the rest would be used by clients really depends on the progress
made by registries. As for the deviation-character support, I think
implementations could stop supporting them if the affected
registries
enforced bundle-or-block. As to the additional symbols,
implementations could stop
​supporting
them
if the
registries
forbade them.
​
​​

>
> It enforces DNS length restrictions on domain names (IDNA2003 did the
> same), which does not appear to be implemented in browsers. They're
> fine with a label longer than a hundred code points. I don't think
> this should be outlawed at the parsing layer because the name might be
> used outside the DNS.
>

That was never a topic of discussion in any of the standards discussions.
​​

>
> I wish it contained the actual ASCII restrictions we need in practice
> rather than deferring those to the application, but I suppose I can
> define those in the URL Standard and use UseSTD3ASCIIRules=false.
>
> Another wish I have is that the algorithms are a bit clearer in terms
> of input and output. What argument does ToASCII take? What about
> ToUnicode?
>
> E.g. how would you replace "domain to ASCII" and "domain to Unicode"
> in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
> ensure the algorithm still has the same kind of expected output?


http://unicode.org/reports/tr46/#ToASCII

If there are specific areas where you find the spec unclear, I suggest that
you provide feedback as instructed at the top of the spec. Subsequent
versions can then clarify those points.


> It's
> not entirely clear to me how to make use of your work.
>

You may not have meant a singular 'you', but just for clarity: it's not
"my" work; it is the work of the Unicode consortium, with many individuals
and companies involved.
​​

>
>
> --
> http://annevankesteren.nl/
>
Shawn Steele
2014-01-16 17:15:05 UTC
Permalink
> UTS 46 will stay around, if only for the mapping layer.

> Whether the rest would be used by clients really depends on the progress made by registries.
> As for the deviation-character support, I think implementations could stop supporting them if the affected
> registries enforced bundle-or-block.

I’m not sure that’s trivial. Would all of the next layers enforce them? (Blogspot.com for example, I have no clue what, if anything, blogspot does for IDN, but it’s a place that allows random users to create domain names). How would we know?

Some of the registrars also originally stated that they didn’t want to bundle.

> As to the additional symbols, implementations could stop ​supporting them if the registries forbade them.

Same thing.

-Shawn
​
Shawn Steele
2014-01-16 17:18:17 UTC
Permalink
It bounced sort of, trying again with fewer recipients?

From: Shawn Steele [mailto:***@microsoft.com]
Sent: Thursday, January 16, 2014 9:15 AM
To: Mark Davis ☕; Anne van Kesteren
Cc: Gervase Markham; yaojk; Paul Hoffman; PUBLIC-***@W3.ORG; ***@w3.org; John C Klensin; IDNA update work; www-tag.w3.org
Subject: RE: Standardizing on IDNA 2003 in the URL Standard

> UTS 46 will stay around, if only for the mapping layer.

> Whether the rest would be used by clients really depends on the progress made by registries.
> As for the deviation-character support, I think implementations could stop supporting them if the affected
> registries enforced bundle-or-block.

I’m not sure that’s trivial. Would all of the next layers enforce them? (Blogspot.com for example, I have no clue what, if anything, blogspot does for IDN, but it’s a place that allows random users to create domain names). How would we know?

Some of the registrars also originally stated that they didn’t want to bundle.

> As to the additional symbols, implementations could stop ​supporting them if the registries forbade them.

Same thing.

-Shawn
​
Anne van Kesteren
2014-01-17 14:03:38 UTC
Permalink
On Thu, Jan 16, 2014 at 4:32 PM, Mark Davis ☕ <***@macchiato.com> wrote:
>> Or A to a?
>
> I'm assuming that you mean the ascii characters (I'm not going to check
> whether you have just look-alikes.). ASCII case mapping is covered at a
> different level.

It is covered by the UTS46 mapping table actually.


> I don't think clients would stop
> mapping, and IDNA2008 permits it. That's why I said "
> plus a mapping layer that can be used with straight IDNA2008"

Okay. It is sometimes a bit unclear to me if people are considering
the whole of UTS46 to be transitional or just its Transitional
Processing.


>> Another wish I have is that the algorithms are a bit clearer in terms
>> of input and output. What argument does ToASCII take? What about
>> ToUnicode?
>>
>> E.g. how would you replace "domain to ASCII" and "domain to Unicode"
>> in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
>> ensure the algorithm still has the same kind of expected output?
>
> http://unicode.org/reports/tr46/#ToASCII

Right, that algorithm does not really define what input it takes, that
it takes UseSTD3ASCIIRules as flag, that it takes some kind of marker
to indicate the processing desired, and that it returns something.


> If there are specific areas where you find the spec unclear, I suggest that
> you provide feedback as instructed at the top of the spec. Subsequent
> versions can then clarify those points.

I just did. And I think I might have before as I remember seeing that
form. Might I suggest getting a public list or Bugzilla instance so
both sides can track what is going on?


--
http://annevankesteren.nl/
Andrew Sullivan
2014-01-16 14:13:06 UTC
Permalink
Hi,

On Thu, Jan 16, 2014 at 11:48:45AM +0000, Anne van Kesteren wrote:
> The point is that in practice, it isn't fixed to Unicode 3.2. I have
> yet to encounter an IDNA2003 implementation that does that. It turns
> out the setup we have in practice is a compatible evolution.

Maybe. First, see Mark Davis's remarks. Second, please tell me what
these implementations are supposed to do with, say, the 2,088
characters that were added in Unicode 6.0, among which are the emoji
symbols and the Rupee Sign. Do they all do the same thing? How do
you know? Why do you think they always will? The behaviour is
undefined under IDNA2003. As you noted in this thread, the point of
having standards is exactly that you have an answer to these things so
that everyone can interoperate without asking everyone else who has
ever implemented the same functionality, "Pssst! What did you do with
U+1F301?"

The reason to go to IDNA2008 is that it is supposed to provide an
answer to this sort of question in a completely general way. Despite
the fact that IDNA2008 came out before Unicode 6.0, it has an answer
to the question, "What do I do with the emoji?" And sure enough, the
(non-normative) derived properties database that IANA helpfully
provides lists exactly what you would expect those characters to do.
But your implementation wouldn't need to wait for the registry to be
updated to know.

I cannot take seriously the argument that this is all about
compatibility if that argument depends on using a standard that simply
leaves out thousands of characters, and under which applications have
to make up their own handling rules for those. That is no promise of
compatibility at all.

I believe stable of URIs are really important, and I think backward
compatibility with deployed code is just as important. But if there
is any opportunity to fix this properly, it is now. If we don't
embrace that, the problem will be much worse in the future --
especially as more IDNs show up at the top level of the DNS. In my
opinion, there is a responsibility to embrace IDNA2008 now, because it
is the best approach we were able to come up with given the conflicts
between internationalization and localization.

Best regards,

Andrew


--
Andrew Sullivan
***@crankycanuck.ca
Andrew Sullivan
2014-01-16 16:11:08 UTC
Permalink
Apologies for any duplicates; I originally sent this from an
unsubscribed address.

Hi,

On Thu, Jan 16, 2014 at 11:48:45AM +0000, Anne van Kesteren wrote:
> The point is that in practice, it isn't fixed to Unicode 3.2. I have
> yet to encounter an IDNA2003 implementation that does that. It turns
> out the setup we have in practice is a compatible evolution.

Maybe. First, see Mark Davis's remarks. Second, please tell me what
these implementations are supposed to do with, say, the 2,088
characters that were added in Unicode 6.0, among which are the emoji
symbols and the Rupee Sign. Do they all do the same thing? How do
you know? Why do you think they always will? The behaviour is
undefined under IDNA2003. As you noted in this thread, the point of
having standards is exactly that you have an answer to these things so
that everyone can interoperate without asking everyone else who has
ever implemented the same functionality, "Pssst! What did you do with
U+1F301?"

The reason to go to IDNA2008 is that it is supposed to provide an
answer to this sort of question in a completely general way. Despite
the fact that IDNA2008 came out before Unicode 6.0, it has an answer
to the question, "What do I do with the emoji?" And sure enough, the
(non-normative) derived properties database that IANA helpfully
provides lists exactly what you would expect those characters to do.
But your implementation wouldn't need to wait for the registry to be
updated to know.

I cannot take seriously the argument that this is all about
compatibility if that argument depends on using a standard that simply
leaves out thousands of characters, and under which applications have
to make up their own handling rules for those. That is no promise of
compatibility at all.

I believe stable of URIs are really important, and I think backward
compatibility with deployed code is just as important. But if there
is any opportunity to fix this properly, it is now. If we don't
embrace that, the problem will be much worse in the future --
especially as more IDNs show up at the top level of the DNS. In my
opinion, there is a responsibility to embrace IDNA2008 now, because it
is the best approach we were able to come up with given the conflicts
between internationalization and localization.

Best regards,

Andrew


--
Andrew Sullivan
***@anvilwalrusden.com
John C Klensin
2014-01-16 17:24:57 UTC
Permalink
Hi.

With the understanding that I'm not really saying anything that
Mark, Andrew, and a few others haven't said but that a different
perspective may be worthwhile...

(1) If only because there are other protocols and actors in this
drama than web browsers, this continuing discussion leads us in
the direction of having four "standards":

(i) IDNA2008, plus or minus
application-instance-specific or platform-specific use
of RFC 5895.
(ii) IDNA2003
(iii) IDNA2008 + the mapping (as distinct from
compatibility) part of UTR46
(iv) IDNA2003 + Unspecified adaptations for Unicode
versions later than 32 + UTR 46

Given that there are non-web i18n applications --notably the
now-deploying email specs and the work on various
security-related and other specs in PRECIS -- simply having four
"standards" is not going to be popular with users who be
certainly be astonished when what they see as "the same thing"
behaves differently in different contexts. IMO, the only thing
that has saved us from an explosion about that so far is that
the significantly different behaviors among the above are mostly
edge cases.

The important difference between case (iv) and the others is
that, as others have pointed out, case (iv) is not one case and
no one actually knows what it actually means. Yet, as I
understand it, that is precisely what Anne is proposing to
specify. In terms of a standard, that comes pretty close to
"Unicode 3.2 is standardized and we hope that no properties of
it will change; for characters included in later versions of
Unicode, do what you like". I can't think of anything kind to
say about that.

As to the first three, I remain concerned that there are a few
characters that are PVALID (or CONTEXTJ) under IDNA2008 that
UTS46 essentially prohibits using in any separate and distinct
ways. There is no doubt in my mind that the maximally
conservative path is precisely that prohibition, preferably
enforced by registry rules that prevent separate registration of
both the IDNA2008-permitted character and whatever it would be
mapped to under IDNA2008 or UTS46. But those who decide to go
with that plan need to recognize two things, for better or worse:

(i) There are hundreds of thousands, if not millions, of
separately-administered and controlled registries in the DNS.
If the criterion for getting rid of mappings that preempt the
use of the relevant IDNA2008-permitted characters becomes "all
DNS registries prohibit independent registration of both them
and the characters that formerly mapped to them" (or even "proof
that most registries prohibit...", then anyone who believes that
point is different from "never" is deluding themselves. Worse,
each succeeding year in which web page authors believe that they
can and should depend on the mappings being present makes
discontinuing those mappings (ever) in browsers less possible.

(ii) Some people feel very strongly about the independent
availablity of those characters and, regardless of what "we"
might believe, do not see confusion or conflicts within the
context of their languages (or, e.g., "their" new gTLDs). We
also know that disagreements about how a particular language is
represented in Unicode have led, in a few places, to very
serious discussions of legislative or judicial action against
the Unicode Consortium or banning the use of Unicode in those
areas. Fortunately for those of us who favor open international
standards, those efforts have never gone anywhere. But,
especially where there are conflicting standards, I see a real
possibility of some government taking the position that a
browser that de facto prohibits characters that they think
necessary and that are allowed by one of the standards is
anti-competitive and/or insulting to the national culture. If
the country or region involved were in any way economically or
culturally significant, I'd assume that browser vendors --
especially those whose existence depends on either market share
in relevant areas or on the perception that they are "good guys"
that leads to contributions, would rapidly discover a need to
either be compatible with the the standard that supported the
relevant national characters or to got to the considerable
expense and aggravation of creating a one-off implementation
that would accommodate the national demands.

---------------

FWIW, I continue to believe that the right way forward is one
that is largely consistent with all of the present approaches in
the long run. It would be something like:

(1) Advise web page authors and tool-builders that hrefs, things
that map into them (e.g., IRIs), or equivalent that depend on
mappings are just a bad idea, have been a bad idea since
IDNA2003 was introduced, and that uses of them should be revised
out of existence as quickly as possible. In other words,
unambiguously deprecate the practice without necessarily
stopping uses of it from working.

(2) Advise browser implementers to support a pair of "no
mapping" switches, one for user input and the other for hrefs
and equivalent. Ideally, those switched should have values of
"yes, map", "no, don't map", and "warn in cases where mapping is
about to be applied and then do it". By default, the "user
input" one should start at "yes" and the "href" one should start
with "warm" with the expectation of possibly migrating the "no"
over time, but it should be possible for users and those
specifying system configurations or national localizations to
set them differently.

That combination allows everyone to move forward and lets
browsers be agile relative to evolving usage and demands. For
example, if a government did impose a requirement wrt
independent use of characters in a particular language, that
could be handled as a localization matter rather than a browser
revision, regardless of what one thought of the merits of their
position. People working with sufficiently old HTML files could
set switches appropriately so that those pages would continue to
work in their environments. And it would allow us to start
moving away from the "four competing standards" situation
because it really does provide the migration path that we don't
have now (and that has led to various versions of what some of
us describe as "IDNA2003, more or less, forever".

best,
john
Anne van Kesteren
2014-01-17 13:23:44 UTC
Permalink
On Thu, Jan 16, 2014 at 6:24 PM, John C Klensin <***@jck.com> wrote:
> The important difference between case (iv) and the others is
> that, as others have pointed out, case (iv) is not one case and
> no one actually knows what it actually means. Yet, as I
> understand it, that is precisely what Anne is proposing to
> specify. In terms of a standard, that comes pretty close to
> "Unicode 3.2 is standardized and we hope that no properties of
> it will change; for characters included in later versions of
> Unicode, do what you like". I can't think of anything kind to
> say about that.

That is not what I'm proposing though. It might be important to
distinguish UI from DNS.

What's important for interoperability in domain names is translation
of a sequence of code points to a sequence of bytes that can be used
within the DNS. If you take IDNA2003, an updated version of Unicode,
and assume the same algorithms defined in IDNA2003 apply you have an
algorithm that defines just that. (UTS46 in compatibility mode appears
to be basically that, minus a couple of exceptions I should probably
investigate at some point.)

Then there's another aspect which is UI. Making sure the user is not
spoofed, etc. Browsers already differ what they are willing to show to
the user in Unicode and what they will show in "ASCII". E.g. Chrome
has a policy where it will only use ToUnicode if the code points can
reasonably be assumed to be within a range that the user's locale
matches. See http://wiki.whatwg.org/wiki/URL#UI for some pointers.

I guess both of these is what you later call "href" and "user input".
Given that these are already decoupled I don't really see why we
should not have mapping in "href" consistent with what we provide now.
And frankly, I don't see it going away.


--
http://annevankesteren.nl/
Andrew Sullivan
2014-01-17 15:56:51 UTC
Permalink
On Fri, Jan 17, 2014 at 02:23:44PM +0100, Anne van Kesteren wrote:
>
> What's important for interoperability in domain names is translation
> of a sequence of code points to a sequence of bytes that can be used
> within the DNS.

This is part of where we disagree. What is important for
interoperability is not only what you say, but also a reversible
translation so that when you get the octets used in the DNS back, they
can always be turned back into the sequence of code points you started
with. IDNA2003 doesn't have that property, which is the reason for
the backward incompatibility.

> If you take IDNA2003, an updated version of Unicode,
> and assume the same algorithms defined in IDNA2003 apply you have an
> algorithm that defines just that.

No, because there aren't algorithms defined in IDNA2003. There's a
list of code points that are "out"; everything else is allowed. We'd
actually have to go over the new code points in order to get the new
definition you're talking about.

To get the kind of algorithmic-based approach you're talking about
there, you have to move to IDNA2008.

> Then there's another aspect which is UI. Making sure the user is not
> spoofed, etc.

Surely this is quite a different problem to the above, though, no?
(You're never going to be able to "make sure", of course. All you can
do is get more or less right.)

Different user agents of course do these things differently. I sort
of hate the approaches widely used, but I acknowledge that they're
better than nothing.

A

--
Andrew Sullivan
***@anvilwalrusden.com
Bjoern Hoehrmann
2014-01-17 16:11:05 UTC
Permalink
* Andrew Sullivan wrote:
>On Fri, Jan 17, 2014 at 02:23:44PM +0100, Anne van Kesteren wrote:
>>
>> What's important for interoperability in domain names is translation
>> of a sequence of code points to a sequence of bytes that can be used
>> within the DNS.
>
>This is part of where we disagree. What is important for
>interoperability is not only what you say, but also a reversible
>translation so that when you get the octets used in the DNS back, they
>can always be turned back into the sequence of code points you started
>with. IDNA2003 doesn't have that property, which is the reason for
>the backward incompatibility.

I read Anne as saying, for the purposes of this discussion, he cares
about the definition of a `uint8_t* f(codepoint_t* input) { ... }`
function and not user interface or other issues. There was no impli-
cation in the quoted text whether he cares about `f` being injective.
(He might have said something about this elsewhere, but not here).
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
John C Klensin
2014-01-17 16:22:59 UTC
Permalink
--On Friday, January 17, 2014 10:56 -0500 Andrew Sullivan
<***@anvilwalrusden.com> wrote:

>...
>> If you take IDNA2003, an updated version of Unicode,
>> and assume the same algorithms defined in IDNA2003 apply you
>> have an algorithm that defines just that.
>
> No, because there aren't algorithms defined in IDNA2003.
> There's a list of code points that are "out"; everything else
> is allowed. We'd actually have to go over the new code points
> in order to get the new definition you're talking about.

Just to clarify, there is a general assumption about where that
list of exclusions came from and how to derive the mappings
listed in Stringprep. That assumption is nearly enough true
that people assume they can generate lists equivalent to
Stringprep/Nameprep for Unicode versions after 3.2. That second
assumption is, I assume, where the "IDNA2003 plus updates" claim
comes from. The difficulty is that the assumption is not quite
correct: some tuning was done to meet the needs of IDNs that
don't apply to other i18n cases and that tuning is Unicode
3.2-specific. And that, in turn, is why various of us have
described "IDNA2003 plus updates for later versions" as a
non-specification that has elements of "do what you like and
leave everyone else guessing".

I agree with your main point, however: IDNA2008 was driven by
two fundamental design decisions different from those underlying
IDNA2003:

(i) Reversibility of the two label representations for
the reasons you summarized.

(ii) Shifting from normative tables tied to a version of
Unicode (i.e., Stringprep/Nameprep) to a rule set
intended to be largely independent of version changes.

Just about everything else is either details or side-effects.
Those may still be troublesome, but the above were, and remain,
the main differences between the two generations of IDNA
standards.

best,
john

p.s. I know I still owe the lists responses to a couple of other
notes -- will get to them as soon as possible.
Patrik Fältström
2014-01-17 18:28:58 UTC
Permalink
On 17 jan 2014, at 17:22, John C Klensin <***@jck.com> wrote:

> I agree with your main point, however: IDNA2008 was driven by
> two fundamental design decisions different from those underlying
> IDNA2003:
>
> (i) Reversibility of the two label representations for
> the reasons you summarized.
>
> (ii) Shifting from normative tables tied to a version of
> Unicode (i.e., Stringprep/Nameprep) to a rule set
> intended to be largely independent of version changes.

Let me say that I think the main problem for this discussion to move forward is that too many things are discussed at the same time. And that was another reason why IDNA2008 was developed to replace IDNA2003.

Let me try to give my "from the top of my head" perspective of the various issues. Others (Andrew, John, Mark) might add things of course:

1. Algorithmic definition of what status each Unicode Codepoint has

IDNA2003 is defined by an explicit list of code points, based on Unicode 3.2. Because of this, it can formally not be applied to other versions of Unicode. Sure, it is possible to try to guess what algorithm was behind the tables, and then apply those algorithms to later versions of Unicode, but that is not for certain.

In reality, that is exactly what IDNA2008 is. A set of rules that leads to as much backward compatibility as possible.

2. Mapping, like case folding, NFC etc

IDNA2003 did include some mapping. IDNA2008 does not, for various reasons.

Some people do have the view it is really important mapping is uniform across applications, operating systems and cultures. Some do think a subset of the mappings must be 1:1. Some think the best mappings are done with the help of a locale (that by definition is different for different users).

3. Backward compatibility

A few code points have changed status so that they are, when applying IDNA2008 algorithms, not backward compatible. For each such code point (character) some think it would be preferred to have the same management of them.

This should include both information on what to do with them, and how to one day phase out these special rules.

...and as I said, possibly more.

For me personally, I think the most important thing is (1). Having algorithms is good, having 1:1 between A-label and U-label is good. Separating 1, 2 and 3 above is good.

For 2, I think we will never get the same mappings in all contexts. But within a context (cultural and/or application) we might have the same.

For 3, I think we have not heard enough how for example .DE have taken care of the issue(s). Much more important to listen to them than for example .COM.

But I am pretty sure we need to keep the things separated to be able to move forward.

Patrik
Patrik Fältström
2014-01-17 18:34:10 UTC
Permalink
On 17 jan 2014, at 19:28, Patrik Fältström <***@frobbit.se> wrote:

> On 17 jan 2014, at 17:22, John C Klensin <***@jck.com> wrote:
>
>> I agree with your main point, however: IDNA2008 was driven by
>> two fundamental design decisions different from those underlying
>> IDNA2003:
>>
>> (i) Reversibility of the two label representations for
>> the reasons you summarized.
>>
>> (ii) Shifting from normative tables tied to a version of
>> Unicode (i.e., Stringprep/Nameprep) to a rule set
>> intended to be largely independent of version changes.
>
> Let me say that I think the main problem for this discussion to move forward is that too many things are discussed at the same time. And that was another reason why IDNA2008 was developed to replace IDNA2003.

I forgot one thing. IRIs do add a whole set of separate issues, but I am sure other people on this list do know more about those details than I do.

I limited this email to domain names.

Patrik

> Let me try to give my "from the top of my head" perspective of the various issues. Others (Andrew, John, Mark) might add things of course:
>
> 1. Algorithmic definition of what status each Unicode Codepoint has
>
> IDNA2003 is defined by an explicit list of code points, based on Unicode 3.2. Because of this, it can formally not be applied to other versions of Unicode. Sure, it is possible to try to guess what algorithm was behind the tables, and then apply those algorithms to later versions of Unicode, but that is not for certain.
>
> In reality, that is exactly what IDNA2008 is. A set of rules that leads to as much backward compatibility as possible.
>
> 2. Mapping, like case folding, NFC etc
>
> IDNA2003 did include some mapping. IDNA2008 does not, for various reasons.
>
> Some people do have the view it is really important mapping is uniform across applications, operating systems and cultures. Some do think a subset of the mappings must be 1:1. Some think the best mappings are done with the help of a locale (that by definition is different for different users).
>
> 3. Backward compatibility
>
> A few code points have changed status so that they are, when applying IDNA2008 algorithms, not backward compatible. For each such code point (character) some think it would be preferred to have the same management of them.
>
> This should include both information on what to do with them, and how to one day phase out these special rules.
>
> ...and as I said, possibly more.
>
> For me personally, I think the most important thing is (1). Having algorithms is good, having 1:1 between A-label and U-label is good. Separating 1, 2 and 3 above is good.
>
> For 2, I think we will never get the same mappings in all contexts. But within a context (cultural and/or application) we might have the same.
>
> For 3, I think we have not heard enough how for example .DE have taken care of the issue(s). Much more important to listen to them than for example .COM.
>
> But I am pretty sure we need to keep the things separated to be able to move forward.
>
> Patrik
>
> _______________________________________________
> Idna-update mailing list
> Idna-***@alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
John C Klensin
2014-01-16 17:46:21 UTC
Permalink
--On Thursday, January 16, 2014 11:36 +0000 Gervase Markham
<***@mozilla.org> wrote:

>...
>> If that was all that had changed, I might be more optimistic.
>> I refer you to my earlier email about simple things as
>> lowercasing.
>
> And I refer you to my comments above. Problems like
> lowercasing (for better or worse) are punted by IDNA2008 and
> are labelled as an application-level problem. In practice,
> what everyone should do for best interoperability is implement
> the same application-level mappings, and implement ones which
> are as compatible as possible with IDNA2003. Hence.... UTS46.

Two things:

While there are certainly people who find other aspects of UTR46
distasteful, there are, IMO, only two core objections, neither
of which relates to UTR46 as a simple mapping layer for user
input. One is that UTR 46 specifies, as transitional, some
relationships that map IDNA2008-permitted characters into other
IDNA2008-permitted characters (an effective blocking function
for the former set) without spelling out plausible and
relatively short term plans for ending that blocking situation.
The use of mappings that hide or block IDNA2008-permitted labels
unquestionable violates the intent of IDNA2008 and the intent of
making those characters available.

The other is that, after earning the scars of many years of
experience with a variety of systems and user interfaces, some
of us have concluded that the use of unambiguous and stable
canonical forms "on the wire" and in contexts that are supposed
to be persistent is really important. The distinction between
mapping for something typed or otherwise specified directly by
the user and a mapping requirement for domains or URLs/URIs
stored in documents, search or DNS examination programs, and the
like keeps getting lost in this set of discussions, but is
really, seriously, important.

best,
john

And, again,
John Cowan
2014-01-16 17:55:58 UTC
Permalink
John C Klensin scripsit:

> The distinction between mapping for something typed or otherwise
> specified directly by the user and a mapping requirement for domains
> or URLs/URIs stored in documents, search or DNS examination programs,
> and the like keeps getting lost in this set of discussions, but is
> really, seriously, important.

I'm not so sure. In the end, URLs in documents tend to be typed by
the user too, it's just a different kind of user. You could argue that
document editors should do the mapping themselves, but then you're back
to the old stand.

> best,
> john
>
> And, again,

Ceterum censeo ...

--
A few times, I did some exuberant stomping about, John Cowan
like a hippo auditioning for Riverdance, though ***@ccil.org
I stopped when I thought I heard something at http://ccil.org/~cowan
the far side of the room falling over in rhythm
with my feet. --Joseph Zitt
John C Klensin
2014-01-16 19:04:21 UTC
Permalink
--On Thursday, January 16, 2014 12:55 -0500 John Cowan
<***@mercury.ccil.org> wrote:

> John C Klensin scripsit:
>
>> The distinction between mapping for something typed or
>> otherwise specified directly by the user and a mapping
>> requirement for domains or URLs/URIs stored in documents,
>> search or DNS examination programs, and the like keeps
>> getting lost in this set of discussions, but is really,
>> seriously, important.
>
> I'm not so sure. In the end, URLs in documents tend to be
> typed by the user too, it's just a different kind of user.

But there is always something of a transformation process to get
it into the document. For example, users don't type UTF-8, they
type stuff that gets mapped via various procedures into UTF-8 or
something else.

> You could argue that document editors should do the mapping
> themselves, but then you're back to the old stand.

Maybe I am "back to the old stand" -- I'm just trying to explain
a perspective that has some history of being useful. That
history, for me and even for i18n issues specifically, extends
back to the last 60s, which is, indeed. very "old stand".
However, I think there are ultimately two cases as far as
document editors are concerned:

(1) The mapping that might be used is trivial -- either the
ASCII cases or things like full-width East Asian character (many
ASCII characters fall into this category only if one is willing
to assume that, e.g., "A" always means/maps to "a" rather than
any of the decorated lower-case forms that, in various localized
writing system contexts, lose their decorations when being
mapped to upper case. For most or all of these cases, it ought
to be trivial for document editors to simply enter the canonical
forms. If there is some reasons why they don't and mapping is
needed, that is ok too.

(2) The more complex cases in which mappings can turn a
character into a non-obvious alternative. For these cases, the
document author/ editor better know what she is doing. The
reality is that those mappings may be done or not done,
unpredictably and depending on environment and circumstances and
the decisions may have inadvertent blocking side-effects. If,
for example, a label that contains ZWNJ is registered and (as
UTS46 and other things recommend as a reasonable option) the
same string with ZWNJ is blocked. then an IDN resolving engine
that maps ZWNJ to nothing prevents use of the name (similarly
for sharp-S, etc.). For these cases, if the document editors
knows what is going on, then specifying exactly what is intended
(in A-label or at least U-label form) is the best and least
risky thing she can do. We betray the trust that implies
--trust that she is, in fact, smart enough to know what she is
doing-- if we second-guess here canonical strings by mapping
them to something else. Conversely, if the document editor
doesn't have a clue, it is not clear to me that we are doing
either him or his users/readers a favor by encouraging ambiguity
in an identifier that they have been told is not ambiguous.

At least that is what it looks like from here. YMMD.

john
Bjoern Hoehrmann
2014-01-16 16:44:24 UTC
Permalink
* Anne van Kesteren wrote:
>The sentiment in this thread that we can keep changing the rules also
>strikes me as bad. Going from IDNA2003 to UTS46 to just IDNA2008 to
>maybe something else in the future. The whole reason we set standards
>is stability. So you can build on top of a foundation you know will
>not change. We take this pretty seriously in most places. Not taking
>it seriously for something as fundamental as domain names strikes me
>as wrong.

Browser vendors care about not making changes they will have to revert.
Arguments that something should not be changed because of standards and
stability beyond that are rarely made by them. In fact, many of them
want to institutionalise that people cannot know what will change next
in their "Living Standards". Try finding examples of things that we can
rely on not changing that do not fall under my first sentence above.
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Vint Cerf
2013-08-22 11:38:00 UTC
Permalink
I think it is time to start a serious campaign to move to the IDNA2008
standard for the simple reason that it decouples dependence on a fixed and
now very old version of UNICODE. Opinions about backward compatibility
vary. I am more sanguine about accepting incompatibility with past choices
than others - the non-letter characters may be cute but their cost is too
high and utility too low, as I see it.

As the TLD space expands and IDNs become more popular, canonical
representations and decoupling from versions of UNICODE are essential for
stability, uniformity and interoperability. It will only get more messy
with time if we don't get going on this objective.

vint


On Thu, Aug 22, 2013 at 7:02 AM, Gervase Markham <***@mozilla.org> wrote:

> On 22/08/13 11:37, Anne van Kesteren wrote:
> >> Shame for them. The writing has been on the wall here for long enough
> >> that they should not be at all surprised when this stops working.
> >
> > I don't think that's at all true. I doubt anyone realizes this. I
> > certainly didn't until I put long hours into investigating the IDNA
> > situation.
>
> It's not been possible to register names like ☺☺☺.com for some time now;
> that's a big clue. The fact that Firefox (and other browsers, AFAIAA)
> refuses to render such names as Unicode is another one. (Are your
> friends really using http://xn--74h.example.com/ ?)
>
> Those two things, plus the difficulty of typing such names, means that
> their use is going to be pretty limited. (Even the guy who is trying to
> flog http://xn--19g.com/ , and is doing so on the basis of the fact that
> this particular one is actually easy to type on some computers, has not
> in the past few years managed to find a "Macintosh company with a
> vision" to take it off his hands.)
>
> > Furthermore, we generally preserve compatibility on the web so URLs
> > and documents remain working.
> > http://www.w3.org/Provider/Style/URI.html It's one of the more
> > important parts of this platform.
>
> (The domain name system is about more than just the web.)
>
> IIRC, we must have broken a load of URLs when we decided that %-encoding
> in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
> beforehand it depended on the charset of the page or form producing the
> link. Why did we do that? Because the new way was better for the future,
> and some breakage was acceptable to attain that goal.
>
> So what is the justification for removal of non-letter characters?
> Reduction of attack surface. When characters are divided into scripts,
> we can enforce no-script-mixing rules to keep the number of possible
> spoofs, lookalikes and substitutions tractable for humans to reason
> about in the case of a particular TLD and its allowed characters. If we
> allowed 3,254 extra random glyphs in every TLD, this would not be so.
>
> Gerv
>
Markus Scherer
2013-08-21 19:39:10 UTC
Permalink
On Wed, Aug 21, 2013 at 8:01 AM, Mark Davis ☕ <***@macchiato.com> wrote:

> The key migration issue is whether people are comfortable having
> implementations go to different IP addresses for IDNs containing 'ß' (or
> the other 3 related characters). The transitional form in TR46 is for those
> who are concerned with that problem. If the registries either bundled 'ss'
> with 'ß' or blocked (once either was registered the other could not), then
> the ambiguous addressing issue would not be a problem. So it is a matter of
> waiting for the significant registries to do that.
>

That's right. For ß vs. ss in particular, German speakers have always been
confused about exactly when to use which, and the rules changed with the
1996 reform, and Swiss users never use ß -- therefore, distinguishing them
in an otherwise case-insensitive context is just a mistake.

See also http://xn--amt-golener-land-mlb.de/

(Apparently the original http://amt-golssener-land.de/ has since turned
into a redirect to the homepage for a larger administrative unit.)

markus
Loading...