Discussion:
why use IRIs?
Peter Saint-Andre
2012-06-21 18:28:42 UTC
Permalink
<hat type='individual'/>

I've been thinking about IRIs, and I'm wondering: why would a protocol
"upgrade" from URIs to IRIs? (If it really is an "upgrade" -- a topic
for another time.)

Consider HTTP. It has always used URIs for retrieving documents and
linking and such. Why would it change to use IRIs? Section 1.2 of
3987bis describes some necessary conditions for such a change, but
doesn't really motivate why the HTTP community would want to do so. Yes,
there is text in Section 1.1 about representing the words of natural
languages, but URIs can be used to represent those words right now. I
grant that the current mechanism for such representation isn't pretty,
but do the addressing elements of a protocol like HTTP need to be
pretty, or can we simply depend on the presentation software (e.g., web
browsers) to make things look nice for the user? (Certainly we do that
with structural elements like the HTML document format, why not also
with addressing elements like URIs?) I realize that these questions get
back to the matter of "protocol element" vs. "presentation", but I guess
what I'm saying is that I don't yet think we've really explained why we
need to make IRIs a first-class protocol element (or why a given
protocol would want to make the switch from URI-only to IRI).

Furthermore, 3987bis doesn't really explain what would be involved in
the change from URI-only to IRI in any given protocol. I suppose spec
writers in a technology community like HTTP would need to figure it out,
but IMHO some guidelines would be helpful.

Peter
--
Peter Saint-Andre
https://stpeter.im/
Bjoern Hoehrmann
2012-06-21 20:01:12 UTC
Permalink
Post by Peter Saint-Andre
I've been thinking about IRIs, and I'm wondering: why would a protocol
"upgrade" from URIs to IRIs? (If it really is an "upgrade" -- a topic
for another time.)
Looking at http://lists.w3.org/Archives/Public/uri/2001Jun/0027.html I
wonder whether you are two days late or one day early with the question,
depending on whether you ignore leap days. The "upgrade" question does
not seem very relevant to me, I would rather ask about new protocols and
go from there. I see no reason why http://björn.höhrmann.de/ should be
an error in any new protocol or format that does not suffer compatibili-
ty problems if it allows non-ASCII literals in "other places".
Post by Peter Saint-Andre
I guess what I'm saying is that I don't yet think we've really explained
why we need to make IRIs a first-class protocol element (or why a given
protocol would want to make the switch from URI-only to IRI).
URIs are technical debt. If we could wish them away, we would, as having
them and also IRIs as "first-class" protocol elements is very expensive.

How would you like it if URIs could use only 20 of the 26 letters in the
english alphabet and you would have to encode, decode and convert them
all the time, or use awkward transliterations to avoid having to do so?
Post by Peter Saint-Andre
Furthermore, 3987bis doesn't really explain what would be involved in
the change from URI-only to IRI in any given protocol. I suppose spec
writers in a technology community like HTTP would need to figure it out,
but IMHO some guidelines would be helpful.
You just change specifications and software and content as needed. If
there are problems in doing so, there does not seem to be much that we
could say on how to address those as they would be technology-specific.
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Martin J. Dürst
2012-06-25 09:22:23 UTC
Permalink
Hello Peter,

I think Björn already gave very good answers to your questions.
Post by Peter Saint-Andre
<hat type='individual'/>
I've been thinking about IRIs, and I'm wondering: why would a protocol
"upgrade" from URIs to IRIs?
As Björn said, it's really more about new protocols than about upgrades.
Also, different protocols (and formats) can upgrade in different ways.
Sometimes, this can be done formally with extensions, at other times
it's done gradually and sooner or later gets accepted in a spec. For
other cases, of course, it may never happen.
Post by Peter Saint-Andre
(If it really is an "upgrade" -- a topic
for another time.)
Consider HTTP. It has always used URIs for retrieving documents and
linking and such.
[There are some reports of clients just sending UTF-8, which I think
would mean using IRIs. But that has never reached the spec.]
Post by Peter Saint-Andre
Why would it change to use IRIs? Section 1.2 of
3987bis describes some necessary conditions for such a change, but
doesn't really motivate why the HTTP community would want to do so. Yes,
there is text in Section 1.1 about representing the words of natural
languages, but URIs can be used to represent those words right now. I
grant that the current mechanism for such representation isn't pretty,
but do the addressing elements of a protocol like HTTP need to be
pretty, or can we simply depend on the presentation software (e.g., web
browsers) to make things look nice for the user?
I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.

In general, the motivation to use IRIs is highest closer to end users
and content-oriented people such as document authors, and gets lower the
lower one gets in the protocol stack.

Another motivation may be compression.
http://ja.wikipedia.org/wiki/青山学院大学 is quite a bit shorter than
http://ja.wikipedia.org/wiki/%E9%9D%92%E5%B1%B1%E5%AD%A6%E9%99%A2%E5%A4%A7%E5%AD%A6.
So maybe we can sell that to HTTP 2.0. But I'm somewhat skeptical. Only
a tiny bit of creative thinking would have been needed to transition
various header fields in HTTP from the hopelessly outdated iso-8859-1
(Latin-1) to UTF-8, but it didn't happen :-(.

The best motivation would be streamlining. EAI does a lot of
streamlining for e-mail; if it weren't for all the legacy baggage, it
would be a joy to implement. For HTTP, if browsers use Unicode
internally, and servers use it internally, what's the need for this
weird %HH stuff anyway? (It's still needed to escape reserved
characters, though.)
Post by Peter Saint-Andre
(Certainly we do that
with structural elements like the HTML document format, why not also
with addressing elements like URIs?) I realize that these questions get
back to the matter of "protocol element" vs. "presentation", but I guess
what I'm saying is that I don't yet think we've really explained why we
need to make IRIs a first-class protocol element (or why a given
protocol would want to make the switch from URI-only to IRI).
Furthermore, 3987bis doesn't really explain what would be involved in
the change from URI-only to IRI in any given protocol. I suppose spec
writers in a technology community like HTTP would need to figure it out,
but IMHO some guidelines would be helpful.
As I said at the start of this mail, I think it depends a lot on the
specific protocol. The conditions we give in Section 1.2 are general
considerations that apply to any protocol/format. Protocol-specific
considerations should do the rest, and I'm not sure it makes sense to
write much about this.

But when looking at Section 1.2, I realized that the first sentence
might have been the motivation for your mail. This sentence says:
IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs.
I think that this puts too much emphasis on "update", but I'm not yet
sure how to fix that.

Regards, Martin.
Peter Saint-Andre
2012-06-29 18:27:45 UTC
Permalink
Hi Martin, thanks for the clarification. I have a few comments inline.
Post by Martin J. Dürst
Hello Peter,
I think Björn already gave very good answers to your questions.
Post by Peter Saint-Andre
<hat type='individual'/>
I've been thinking about IRIs, and I'm wondering: why would a protocol
"upgrade" from URIs to IRIs?
As Björn said, it's really more about new protocols than about upgrades.
Also, different protocols (and formats) can upgrade in different ways.
Sometimes, this can be done formally with extensions, at other times
it's done gradually and sooner or later gets accepted in a spec. For
other cases, of course, it may never happen.
Post by Peter Saint-Andre
(If it really is an "upgrade" -- a topic
for another time.)
Consider HTTP. It has always used URIs for retrieving documents and
linking and such.
[There are some reports of clients just sending UTF-8, which I think
would mean using IRIs. But that has never reached the spec.]
Do you think it should reach the spec?
Post by Martin J. Dürst
Post by Peter Saint-Andre
Why would it change to use IRIs? Section 1.2 of
3987bis describes some necessary conditions for such a change, but
doesn't really motivate why the HTTP community would want to do so. Yes,
there is text in Section 1.1 about representing the words of natural
languages, but URIs can be used to represent those words right now. I
grant that the current mechanism for such representation isn't pretty,
but do the addressing elements of a protocol like HTTP need to be
pretty, or can we simply depend on the presentation software (e.g., web
browsers) to make things look nice for the user?
I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.
In general, the motivation to use IRIs is highest closer to end users
and content-oriented people such as document authors, and gets lower the
lower one gets in the protocol stack.
It seems to me that end users can be shielded from what you call "this
weird %HH stuff" (after all, we don't show them "this weird
angle-bracket stuff" either), but what you say about document authors
and operations people makes sense. Perhaps it would be good to capture
that in the spec.
Post by Martin J. Dürst
Another motivation may be compression.
http://ja.wikipedia.org/wiki/青山学院大学 is quite a bit shorter than
http://ja.wikipedia.org/wiki/%E9%9D%92%E5%B1%B1%E5%AD%A6%E9%99%A2%E5%A4%A7%E5%AD%A6.
So maybe we can sell that to HTTP 2.0. But I'm somewhat skeptical. Only
a tiny bit of creative thinking would have been needed to transition
various header fields in HTTP from the hopelessly outdated iso-8859-1
(Latin-1) to UTF-8, but it didn't happen :-(.
The best motivation would be streamlining. EAI does a lot of
streamlining for e-mail; if it weren't for all the legacy baggage, it
would be a joy to implement. For HTTP, if browsers use Unicode
internally, and servers use it internally, what's the need for this
weird %HH stuff anyway? (It's still needed to escape reserved
characters, though.)
Post by Peter Saint-Andre
(Certainly we do that
with structural elements like the HTML document format, why not also
with addressing elements like URIs?) I realize that these questions get
back to the matter of "protocol element" vs. "presentation", but I guess
what I'm saying is that I don't yet think we've really explained why we
need to make IRIs a first-class protocol element (or why a given
protocol would want to make the switch from URI-only to IRI).
Furthermore, 3987bis doesn't really explain what would be involved in
the change from URI-only to IRI in any given protocol. I suppose spec
writers in a technology community like HTTP would need to figure it out,
but IMHO some guidelines would be helpful.
As I said at the start of this mail, I think it depends a lot on the
specific protocol. The conditions we give in Section 1.2 are general
considerations that apply to any protocol/format. Protocol-specific
considerations should do the rest, and I'm not sure it makes sense to
write much about this.
But when looking at Section 1.2, I realized that the first sentence
IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs.
I think that this puts too much emphasis on "update", but I'm not yet
sure how to fix that.
Well, "update" is not "upgrade", so perhaps I have read too much into
the text. However, I think we could change it to read:

IRIs are designed to allow protocols and software that deal with URIs
to also handle IRIs if desired.

Peter
--
Peter Saint-Andre
https://stpeter.im/
John C Klensin
2012-07-02 23:33:43 UTC
Permalink
--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
Post by Martin J. Dürst
Hello Peter,
I think Björn already gave very good answers to your
questions.
Martin, Björn, Peter,
Post by Martin J. Dürst
Post by Peter Saint-Andre
<hat type='individual'/>
I've been thinking about IRIs, and I'm wondering: why would a
protocol "upgrade" from URIs to IRIs?
As Björn said, it's really more about new protocols than
about upgrades. Also, different protocols (and formats) can
upgrade in different ways. Sometimes, this can be done
formally with extensions, at other times it's done gradually
and sooner or later gets accepted in a spec. For other cases,
of course, it may never happen.
...
For whatever it is worth, I don't find that answer particularly
helpful. My problem with it is one that we have discussed
pieces of before. If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues. The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework? We
already know that causes problems. It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent. That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points. If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like. Again, as long as IRIs
were just an UI overlap, it made no difference. But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI. If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
john
John C Klensin
2012-07-02 01:03:48 UTC
Permalink
(sorry - sent from wrong address)

--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
Post by Martin J. Dürst
Hello Peter,
I think Björn already gave very good answers to your
questions.
Martin, Björn, Peter,
Post by Martin J. Dürst
Post by Peter Saint-Andre
<hat type='individual'/>
I've been thinking about IRIs, and I'm wondering: why would a
protocol "upgrade" from URIs to IRIs?
As Björn said, it's really more about new protocols than
about upgrades. Also, different protocols (and formats) can
upgrade in different ways. Sometimes, this can be done
formally with extensions, at other times it's done gradually
and sooner or later gets accepted in a spec. For other cases,
of course, it may never happen.
...
For whatever it is worth, I don't find that answer particularly
helpful. My problem with it is one that we have discussed
pieces of before. If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues. The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework? We
already know that causes problems. It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent. That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points. If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like. Again, as long as IRIs
were just an UI overlap, it made no difference. But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI. If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
john
Martin J. Dürst
2012-07-03 06:19:22 UTC
Permalink
Hello John,
Post by John C Klensin
(sorry - sent from wrong address)
[Sorry for forwarding as a moderator, I missed this one at first.
I have added the other address to the ignore list, so you should be able
to post from either in the future.]
Post by John C Klensin
--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
Post by Martin J. Dürst
As Björn said, it's really more about new protocols than
about upgrades. Also, different protocols (and formats) can
upgrade in different ways. Sometimes, this can be done
formally with extensions, at other times it's done gradually
and sooner or later gets accepted in a spec. For other cases,
of course, it may never happen.
...
For whatever it is worth, I don't find that answer particularly
helpful. My problem with it is one that we have discussed
pieces of before. If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues. The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.
It looks as if some of the discussion in the IRI WG might have led to
the assumption that we are moving to calling IRIs a "Protocol Element"
starting with the revision of RFC 3987. This is wrong.

RFC 3987 defines IRIs as a protocol element. Please see the first line
of the abstract at http://tools.ietf.org/html/rfc3987.

Also, please note that IRIs have been working, and are working, in
protocols/formats that are in no way new since a long time. The prime
example here is HTML (of course, there it works with some warts, but
that's not more warts than the average HTML feature).
Post by John C Klensin
But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework? We
already know that causes problems.
First, as said above, your presumption is wrong. Second, other solutions
have been shown to have problems too.
Post by John C Klensin
It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent. That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.
The fact that URI/IRI characters are sometimes special and sometimes not
comes from the fact that URIs/IRIs combine a lot of different
components, and from the desire of people to not have to escape more
than absolutely necessary. You can always just go ahead and escape all
delimiters, and be on the safe side, if you don't want to complicate
your life. This is completely independent of IRIs.

The problem with non-Latin-script (or for that matter, even Latin
script) lookalikes is already present (and not solved (*)) in domain
names. It's also a problem in internationalized email addresses, because
there's @, a full-width variant of @.

[(*) IDNA 2003 had a partial solution, but IDNA 2008 abandoned it.]

As for method-dependent parsing, do you mean scheme-dependent parsing?
Given the wide variety of different syntax that all the various URI/IRI
schemes deal with, the amount of parsing that can be done generically is
actually pretty amazing, I'd think.
Post by John C Klensin
If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.
There is no "only IRIs". IRIs always include URIs. With that tweak,
let's rewrite the above sentence in two different ways:

If some protocols/formats/applications are to permit only ASCII domain
names, and others both ASCII and internationalized domain names, it
would also be beneficial to be able to determine which is which, rather
than wondering whether an IDN that actually contains only ASCII
characters is actually an IDN or is just the ASCII domain name it looks
like.

If some protocols/formats/applications are to permit only ASCII email
addreses, and others both ASCII and internationalized email addresses,
it would also be beneficial to be able to determine which is which,
rather than wondering whether an internationalized email address that
actually contains only ASCII characters is actually an internationalized
email address or is just the ASCII email address it looks like.

I don't see a problem, but if IRIs have a problem, so do IDNs and
internationalized email addresses.
Post by John C Klensin
I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI. If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.
This problem has three levels of difficulty.

1) For those schemes that follow the generic syntax (e.g. http,
ftp,...), the domain name is easy to find.

2) There are a few schemes that don't use generic syntax, but use
domain names. A typical example is mailto:. Here you need
scheme-specific processing.

3) Many URI schemes are open-ended. The typical example is the query
part of the http scheme, which can contain domain names or even
(suitably encoded) whole URIs. This is an example, please not the
"www.ietf.org" at the end:
http://www.google.com/search?as_q=URI&as_sitesearch=www.ietf.org

It is rather trivial to come up with a kind of format/data structure for
this. I'll give a concrete example using XML, but of course, JSON or
some other popular format would also work. The details are mostly
bike-shedding.

<IRI>
<scheme>http</scheme>
<host type='dns'>
<label>www</label>
<label>google</label>
<label>com</label>
</host>
<path>
<segment>search</segment>
<path>
<query>
<parameter>
<name>as_q</name>
<value>URI</value>
</parameter>
<parameter>
<name>as_sitesearch</name>
<value type='dns'>
<label>www</label>
<label>ietf</label>
<label>org</label>
</value>
</parameter>
</query>
</IRI>

Note that this duly identifies DNS 'stuff'. It's probably not too
difficult for anybody to figure out why
people/applications/formats/protocols use URIs/IRIs rather than
something like the example above. I'm leaving this as an "exercise for
the reader".
Post by John C Klensin
Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.
I hesitate to refer to the charter of the IRI WG
(http://datatracker.ietf.org/wg/iri/charter/) because some aspects of it
(in particular the milestones) are hopelessly out of date. I see no
indication whatsoever about removing the URI syntax constraint, and many
indications that strongly (although not explicitly) that are
contradicting such a proposal.


Please note that while IRIs are intended for new protocols (in the sense
that new protocols should preferably use IRIs and not just URIs), they
are also intended for "gradual" updates where that's appropriate, and
they are already used in many protocols/formats.


Regards, Martin.
Martin J. Dürst
2012-07-02 01:52:51 UTC
Permalink
[Moderator forward]

-------- Original Message --------
Subject: [Moderator Action] Re: why use IRIs?
Date: Mon, 02 Jul 2012 01:02:59 +0000
From: John C Klensin <***@jck.com>
To: "Martin J. Dürst" <***@it.aoyama.ac.jp>, Peter Saint-Andre
<***@stpeter.im>
CC: public-***@w3.org



--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
Post by Martin J. Dürst
Hello Peter,
I think Björn already gave very good answers to your
questions.
Martin, Björn, Peter,
Post by Martin J. Dürst
Post by Peter Saint-Andre
<hat type='individual'/>
I've been thinking about IRIs, and I'm wondering: why would a
protocol "upgrade" from URIs to IRIs?
As Björn said, it's really more about new protocols than
about upgrades. Also, different protocols (and formats) can
upgrade in different ways. Sometimes, this can be done
formally with extensions, at other times it's done gradually
and sooner or later gets accepted in a spec. For other cases,
of course, it may never happen.
...
For whatever it is worth, I don't find that answer particularly
helpful. My problem with it is one that we have discussed
pieces of before. If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues. The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework? We
already know that causes problems. It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent. That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points. If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like. Again, as long as IRIs
were just an UI overlap, it made no difference. But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI. If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
john
Mark Nottingham
2012-07-04 06:13:12 UTC
Permalink
I tend to agree with Peter.

The experience of using IRIs as identifiers in Atom was, IME, a disaster. Identifiers need to be resistant to spoofing and mistakes. Exposing a significant portion of the Unicode character plane in them doesn't do anyone any good.

As a presentation element? Fine, but AFAIK we Don't Do That Here. In places where Users touch (e.g., HTML)? Sure, but We Don't Do That here.

There may be a *few* places in protocols that are user-visible, but AFAICT we're not doing a lot of new protocols recently (thank goodness).
Post by Bjoern Hoehrmann
How would you like it if URIs could use only 20 of the 26 letters in the
english alphabet and you would have to encode, decode and convert them
all the time, or use awkward transliterations to avoid having to do so?
URIs already have a constrained syntax; you can't use certain characters in certain places. As long as people can put IRIs into HTML and browser address bars, I don't think they'll care.
Post by Bjoern Hoehrmann
I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.
Is this use case really worth the pain, inefficiency, and very likely security vulnerabilities caused by transcoding from IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I don't think so.


My English-centric .02; ŸṀṂṼ.

Regards,


--
Mark Nottingham http://www.mnot.net/
Bjoern Hoehrmann
2012-07-04 07:42:36 UTC
Permalink
Post by Mark Nottingham
I tend to agree with Peter.
[...]
This doesn't really help me understand where you see problems with IRIs.
Could you take a simple example like http://björn.höhrmann.de/ and tell
me of some places where I should be unable to use that even though I can
use http://bjoern.hoehrmann.de/ in the same place, without arguing about
limitations of deployed protocols, software, or hardware, and without
arguing about issues that would arise anyway when displaying URIs, and
why I should be unable to use the non-URI IRI there?

Unhelpful arguments in the sense above would be "HTTP/2.0 should stick
to URIs because using IRIs there is a hassle when HTTP/2.0 implementa-
tions interact with HTTP/1.1 implementations", as that relies on limi-
tations of HTTP/1.1 implementations, or "IRIs with zero-width spaces
can be confused with ones without such spaces" as you'd have the same
issue when you turn URIs into IRIs "for display", and so on.
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Mark Nottingham
2012-07-05 00:42:47 UTC
Permalink
Post by Bjoern Hoehrmann
This doesn't really help me understand where you see problems with IRIs.
Could you take a simple example like http://björn.höhrmann.de/ and tell
me of some places where I should be unable to use that even though I can
use http://bjoern.hoehrmann.de/ in the same place, without arguing about
limitations of deployed protocols, software, or hardware, and without
arguing about issues that would arise anyway when displaying URIs, and
why I should be unable to use the non-URI IRI there?
In protocols as identifiers, like an entry ID in Atom. They aren't exposed to end users, and really not to authors either.

Humans have lots of ideas about equivalence and transcription in text which machines are blissfully unaware of.

Regards,

--
Mark Nottingham http://www.mnot.net/
Roy T. Fielding
2012-07-05 20:10:52 UTC
Permalink
Post by Bjoern Hoehrmann
This doesn't really help me understand where you see problems with IRIs.
Could you take a simple example like http://björn.höhrmann.de/ and tell
me of some places where I should be unable to use that even though I can
use http://bjoern.hoehrmann.de/ in the same place, without arguing about
limitations of deployed protocols, software, or hardware, and without
arguing about issues that would arise anyway when displaying URIs, and
why I should be unable to use the non-URI IRI there?
The harm in the above example is how many aliases are created by
inconsistent encoding of the characters, how difficult we make
it for servers to route based on Host (or equivalents), and how
much risk we want to allow for less-interoperable forms. These
are all trade-offs; not hard rules.

The main problem with IRIs as protocol elements is aliasing and invalid
characters, not spoofing. Aliases create security holes if various
routines within the server + OS normalize them in different ways,
reduce cache efficiency, and interfere with page rank. Invalid UTF-8
sometimes results in the whole code sequence being ignored and other times
results in only the valid part of sequence being ignored (leaving the
next byte to be misinterpreted by the next round of parsing).

These problems can exist with pct-encoded UTF-8 as well, but they are
usually harmless if the origin server consistently redirects non-encoded
non-ASCII to the pct-encoded form and then uses a consistent routine
to do name mapping from URI form to native labels. In other words,
they are less of a problem because only the origin server needs to
deal with invalid or aliased pct-encodes, and intermediaries that
secure or load-balance based on the target URI can just work on the
pct-encoded patterns (leaving the UTF-8 form to be redirected by the
origin or some server-side intermediary).

IRIs are not used in HTML or XML. All references in those languages
are parsed as arbitrary strings with language-specific delimiting
and then converted to either a URI or something vaguely like it.
IRIs are not used in browser Location bars -- those are just arbitrary
string parsers that occasionally spit out a URI reference as a result.
IRIs are not used in waka because they would make gateways and fast
pattern matching more difficult and error-prone, which I consider
more of a concern than the potential saving in bytes.

In short, I believe that what potential users of the IRI protocol want
is a set of consistent presentation rules for displaying arbitrary
strings that might include pct-encodes and IDNA, and a simple routine
for converting an arbitrary string reference to a URI reference.
I think the idea of treating IRIs as a separate identifier space has
been harmful to its adoption by folks who already implement non-ASCII
identifiers via presentation and conversion. It is also confusing
to those who want to create new URI schemes but think that they also
need to define IRI schemes.

....Roy

Martin J. Dürst
2012-07-04 08:49:19 UTC
Permalink
Hello Mark,
Post by Mark Nottingham
I tend to agree with Peter.
The experience of using IRIs as identifiers in Atom was, IME, a disaster.
Can you be specific? Can you provide pointers?
Post by Mark Nottingham
Identifiers need to be resistant to spoofing and mistakes.
It's easy to create spoofing identifiers using ASCII/English only.

It's also not too difficult to create spoofing/mistake-resistant
identifiers in other scripts or languages, for people who are better
versed in these scripts/languages. This may be difficult to understand
for "English-centric" people, but it's indeed the case.
Post by Mark Nottingham
Post by Bjoern Hoehrmann
How would you like it if URIs could use only 20 of the 26 letters in the
english alphabet and you would have to encode, decode and convert them
all the time, or use awkward transliterations to avoid having to do so?
URIs already have a constrained syntax; you can't use certain characters in certain places.
Yes. But not being able to use certain punctuation is different from not
being able to use characters in the basic alphabet/character repertoire
of the language. It's easy to replace spaces with hyphens or whatever.
It's a different thing to replace one letter with another, or just drop it.
Post by Mark Nottingham
As long as people can put IRIs into HTML and browser address bars, I don't think they'll care.
Post by Bjoern Hoehrmann
I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.
Is this use case really worth the pain,
For that specific case, I'm not sure. That's why I used "would". But I
also don't think the pain would be that high.
Post by Mark Nottingham
inefficiency,
Conversion would indeed cost some cycles. But using raw bytes instead of
%-encoding would save bytes (which, these days, as far as I have
followed the SPDY debates so far, seems to be the more important side of
the tradeoff).
Post by Mark Nottingham
and very likely security vulnerabilities caused by transcoding from IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I don't think so.
There are quite a lot of places where security blunders can happen. That
conversion step wouldn't be the first one and wouldn't be the last one.
And using %-encoding for basic ASCII characters is already allowed
today, so the basic security vulnerability (firewalls can't just check
on character strings) already exists today.
Post by Mark Nottingham
My English-centric .02; ŸṀṂṼ.
您里可变 (this is not real Chinese, but just four roughly corresponding
characters put together).

Regards, Martin.
David Clarke
2012-07-04 09:45:16 UTC
Permalink
I've been reading this thread with interest. I'm wondering how the
originator would feel if URIs had been defined to use digits and
punctuation only with no alphabetic characters?

From the point of view of someone who doesn't natively use the Latin
alphabet, that is equivalent to what he is proposing. Most literate
people in the world are able to use the Latin alphabet, but will be
better at recognising errors in their native script (when programming
etc), and more likely to be able to remember host names, without error,
that are in their native script.

As far as spoofing goes, in most typefaces, there are already confusions
between 1 (DIGIT ONE), l (LOWER CASE LATIN LETTER L), I (UPPER CASE
LATIN LETTER I) and between 0 (UPPER CASE LATIN LETTER O) and 0 (DIGIT
ZERO). Would it be reasonable propose removal of those characters from
URLs to reduce spoofing?
Post by Martin J. Dürst
Hello Mark,
Post by Mark Nottingham
I tend to agree with Peter.
The experience of using IRIs as identifiers in Atom was, IME, a disaster.
Can you be specific? Can you provide pointers?
Post by Mark Nottingham
Identifiers need to be resistant to spoofing and mistakes.
It's easy to create spoofing identifiers using ASCII/English only.
It's also not too difficult to create spoofing/mistake-resistant
identifiers in other scripts or languages, for people who are better
versed in these scripts/languages. This may be difficult to understand
for "English-centric" people, but it's indeed the case.
Post by Mark Nottingham
Post by Bjoern Hoehrmann
How would you like it if URIs could use only 20 of the 26 letters in the
english alphabet and you would have to encode, decode and convert them
all the time, or use awkward transliterations to avoid having to do so?
URIs already have a constrained syntax; you can't use certain
characters in certain places.
Yes. But not being able to use certain punctuation is different from
not being able to use characters in the basic alphabet/character
repertoire of the language. It's easy to replace spaces with hyphens
or whatever. It's a different thing to replace one letter with
another, or just drop it.
Post by Mark Nottingham
As long as people can put IRIs into HTML and browser address bars, I
don't think they'll care.
Post by Bjoern Hoehrmann
I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.
Is this use case really worth the pain,
For that specific case, I'm not sure. That's why I used "would". But I
also don't think the pain would be that high.
Post by Mark Nottingham
inefficiency,
Conversion would indeed cost some cycles. But using raw bytes instead
of %-encoding would save bytes (which, these days, as far as I have
followed the SPDY debates so far, seems to be the more important side
of the tradeoff).
Post by Mark Nottingham
and very likely security vulnerabilities caused by transcoding from
IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I
don't think so.
There are quite a lot of places where security blunders can happen.
That conversion step wouldn't be the first one and wouldn't be the
last one. And using %-encoding for basic ASCII characters is already
allowed today, so the basic security vulnerability (firewalls can't
just check on character strings) already exists today.
Post by Mark Nottingham
My English-centric .02; ŸṀṂṼ.
您里可变 (this is not real Chinese, but just four roughly
corresponding characters put together).
Regards, Martin.
Bjoern Hoehrmann
2012-07-04 10:36:33 UTC
Permalink
Post by David Clarke
I've been reading this thread with interest. I'm wondering how the
originator would feel if URIs had been defined to use digits and
punctuation only with no alphabetic characters?
The spoofing problem seems to be a sidetrack here, as far as humans go
only the "cookie domain" really matters; and for machines there is less
of a spoofing and more of a robustness problem: machines would not be
fooled, but they might implement conversions and comparisons incorrect-
ly. And domain names can have non-ASCII even in URIs, whether you dis-
play them and how is an issue either way.
Post by David Clarke
As far as spoofing goes, in most typefaces, there are already confusions
between 1 (DIGIT ONE), l (LOWER CASE LATIN LETTER L), I (UPPER CASE
LATIN LETTER I) and between 0 (UPPER CASE LATIN LETTER O) and 0 (DIGIT
ZERO). Would it be reasonable propose removal of those characters from
URLs to reduce spoofing?
The typical response to that would be that people do not want to make it
any worse. And as you note, forcing people to choose characters from a
rather limited set might actually make it harder for them to avoid some
spoofed address as they do not readily recognize what is being encoded.
--
Björn Höhrmann · mailto:***@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Continue reading on narkive:
Loading...