Hello John,
Post by John C Klensin(sorry - sent from wrong address)
[Sorry for forwarding as a moderator, I missed this one at first.
I have added the other address to the ignore list, so you should be able
to post from either in the future.]
Post by John C Klensin--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
Post by Martin J. DürstAs Björn said, it's really more about new protocols than
about upgrades. Also, different protocols (and formats) can
upgrade in different ways. Sometimes, this can be done
formally with extensions, at other times it's done gradually
and sooner or later gets accepted in a spec. For other cases,
of course, it may never happen.
...
For whatever it is worth, I don't find that answer particularly
helpful. My problem with it is one that we have discussed
pieces of before. If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues. The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.
It looks as if some of the discussion in the IRI WG might have led to
the assumption that we are moving to calling IRIs a "Protocol Element"
starting with the revision of RFC 3987. This is wrong.
RFC 3987 defines IRIs as a protocol element. Please see the first line
of the abstract at http://tools.ietf.org/html/rfc3987.
Also, please note that IRIs have been working, and are working, in
protocols/formats that are in no way new since a long time. The prime
example here is HTML (of course, there it works with some warts, but
that's not more warts than the average HTML feature).
Post by John C KlensinBut, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework? We
already know that causes problems.
First, as said above, your presumption is wrong. Second, other solutions
have been shown to have problems too.
Post by John C KlensinIt is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent. That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.
The fact that URI/IRI characters are sometimes special and sometimes not
comes from the fact that URIs/IRIs combine a lot of different
components, and from the desire of people to not have to escape more
than absolutely necessary. You can always just go ahead and escape all
delimiters, and be on the safe side, if you don't want to complicate
your life. This is completely independent of IRIs.
The problem with non-Latin-script (or for that matter, even Latin
script) lookalikes is already present (and not solved (*)) in domain
names. It's also a problem in internationalized email addresses, because
there's @, a full-width variant of @.
[(*) IDNA 2003 had a partial solution, but IDNA 2008 abandoned it.]
As for method-dependent parsing, do you mean scheme-dependent parsing?
Given the wide variety of different syntax that all the various URI/IRI
schemes deal with, the amount of parsing that can be done generically is
actually pretty amazing, I'd think.
Post by John C KlensinIf some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.
There is no "only IRIs". IRIs always include URIs. With that tweak,
let's rewrite the above sentence in two different ways:
If some protocols/formats/applications are to permit only ASCII domain
names, and others both ASCII and internationalized domain names, it
would also be beneficial to be able to determine which is which, rather
than wondering whether an IDN that actually contains only ASCII
characters is actually an IDN or is just the ASCII domain name it looks
like.
If some protocols/formats/applications are to permit only ASCII email
addreses, and others both ASCII and internationalized email addresses,
it would also be beneficial to be able to determine which is which,
rather than wondering whether an internationalized email address that
actually contains only ASCII characters is actually an internationalized
email address or is just the ASCII email address it looks like.
I don't see a problem, but if IRIs have a problem, so do IDNs and
internationalized email addresses.
Post by John C KlensinI continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI. If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.
This problem has three levels of difficulty.
1) For those schemes that follow the generic syntax (e.g. http,
ftp,...), the domain name is easy to find.
2) There are a few schemes that don't use generic syntax, but use
domain names. A typical example is mailto:. Here you need
scheme-specific processing.
3) Many URI schemes are open-ended. The typical example is the query
part of the http scheme, which can contain domain names or even
(suitably encoded) whole URIs. This is an example, please not the
"www.ietf.org" at the end:
http://www.google.com/search?as_q=URI&as_sitesearch=www.ietf.org
It is rather trivial to come up with a kind of format/data structure for
this. I'll give a concrete example using XML, but of course, JSON or
some other popular format would also work. The details are mostly
bike-shedding.
<IRI>
<scheme>http</scheme>
<host type='dns'>
<label>www</label>
<label>google</label>
<label>com</label>
</host>
<path>
<segment>search</segment>
<path>
<query>
<parameter>
<name>as_q</name>
<value>URI</value>
</parameter>
<parameter>
<name>as_sitesearch</name>
<value type='dns'>
<label>www</label>
<label>ietf</label>
<label>org</label>
</value>
</parameter>
</query>
</IRI>
Note that this duly identifies DNS 'stuff'. It's probably not too
difficult for anybody to figure out why
people/applications/formats/protocols use URIs/IRIs rather than
something like the example above. I'm leaving this as an "exercise for
the reader".
Post by John C KlensinStated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.
I hesitate to refer to the charter of the IRI WG
(http://datatracker.ietf.org/wg/iri/charter/) because some aspects of it
(in particular the milestones) are hopelessly out of date. I see no
indication whatsoever about removing the URI syntax constraint, and many
indications that strongly (although not explicitly) that are
contradicting such a proposal.
Please note that while IRIs are intended for new protocols (in the sense
that new protocols should preferably use IRIs and not just URIs), they
are also intended for "gradual" updates where that's appropriate, and
they are already used in many protocols/formats.
Regards, Martin.