iri issue tracker
2012-07-19 22:04:09 UTC
#131: Using document charset causes interoperability problems
As reported by Dave Thaler...
URIs and/or IRIs can appear in many contexts.
In normal text in an email message, or in a PDF file or Word doc or
whatever else.
Allowing it to vary complicates frameworks considerably since now the doc
charset has to be passed from whatever extracts the URI from the document
(HTML or otherwise) and whatever else needs to know the interpretation
(normalizer code, comparison code, whatever). Various API frameworks
already have various sorts of "Uri" classes that take in a URI-like string
and let you do things like get the URI form or the IRI form, or various
components or whatever. This means the constructor needs to change since
you cannot correctly interpret an IRI(bis) without knowing the document
charset.
I'm not yet convinced that's a change worth making. Currently everything
assumes UTF-8. With this change, we'll get random behavior until
everything is updated, which is a state worse than today in my view.
Example:
http://www.sw.it.aoyama.ac.jp/non-existent?é
If the charset were iso-8859-1 then under RFC 3987 as I understand it,
this would become:
http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9
In other words, you have to convert iso-8859-1 to UTF-8 and then pct-
encode the UTF-8.
But as I understand 3987bis it would become:
http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9
which would then be passed around via various APIs and protocols that
would not pass the charset along with it. As such it would be interpreted
by the receiving code as pct-encoded UTF-8:
http://www.sw.it.aoyama.ac.jp/non-existent?é
which of course it isn't.
As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
required for everything that doesn't explicitly pass the charset along
with the URI.
--
-----------------------+--------------------------------------
Reporter: stpeter@… | Owner: draft-ietf-iri-3987bis@…
Type: defect | Status: new
Priority: major | Milestone:
Component: 3987bis | Version:
Severity: - | Keywords:
-----------------------+--------------------------------------
Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
iri <http://tools.ietf.org/wg/iri/>
As reported by Dave Thaler...
URIs and/or IRIs can appear in many contexts.
In normal text in an email message, or in a PDF file or Word doc or
whatever else.
Allowing it to vary complicates frameworks considerably since now the doc
charset has to be passed from whatever extracts the URI from the document
(HTML or otherwise) and whatever else needs to know the interpretation
(normalizer code, comparison code, whatever). Various API frameworks
already have various sorts of "Uri" classes that take in a URI-like string
and let you do things like get the URI form or the IRI form, or various
components or whatever. This means the constructor needs to change since
you cannot correctly interpret an IRI(bis) without knowing the document
charset.
I'm not yet convinced that's a change worth making. Currently everything
assumes UTF-8. With this change, we'll get random behavior until
everything is updated, which is a state worse than today in my view.
Example:
http://www.sw.it.aoyama.ac.jp/non-existent?é
If the charset were iso-8859-1 then under RFC 3987 as I understand it,
this would become:
http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9
In other words, you have to convert iso-8859-1 to UTF-8 and then pct-
encode the UTF-8.
But as I understand 3987bis it would become:
http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9
which would then be passed around via various APIs and protocols that
would not pass the charset along with it. As such it would be interpreted
by the receiving code as pct-encoded UTF-8:
http://www.sw.it.aoyama.ac.jp/non-existent?é
which of course it isn't.
As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
required for everything that doesn't explicitly pass the charset along
with the URI.
--
-----------------------+--------------------------------------
Reporter: stpeter@… | Owner: draft-ietf-iri-3987bis@…
Type: defect | Status: new
Priority: major | Milestone:
Component: 3987bis | Version:
Severity: - | Keywords:
-----------------------+--------------------------------------
Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
iri <http://tools.ietf.org/wg/iri/>