Hello Dave,
Sorry to be very late with my answer.
Post by Dave ThalerPersonally I dislike the change to allow using the document charset and prefer the 3987
behavior.
I very much also dislike this! I very much wish we could fix this!
Just in case you know a way to convince the IE team at Microsoft to fix
this, please tell us.
Some background on why browsers got to where they are about query
encoding later is in a P.S. to this mail.
Post by Dave ThalerOn the question of "other than HTML", URIs and/or IRIs can appear in many contexts...
In normal text in an email message, or in a PDF file or Word doc or whatever else.
Yes indeed.
Post by Dave ThalerAllowing it to vary complicates frameworks considerably since now the doc charset
has to be passed from whatever extracts the URI from the document (HTML or otherwise)
and whatever else needs to know the interpretation (normalizer code, comparison code,
whatever). Various API frameworks already have various sorts of "Uri" classes that
take in a URI-like string and let you do things like get the URI form or the IRI form,
or various components or whatever. Of course those would have to change for
any bis, but this also means the constructor needs to change since you cannot
correctly interpret an IRI(bis) without knowing the document charset.
This is indeed a very important point. Libraries and tooling are too
often overlooked.
I think the current draft also doesn't say anything about cases where
"document charset" information is not available (e.g. when you type in a
query part into a browser bar, or when a query part appears on a napkin.
We should make sure it says that in that case, use UTF-8.
Post by Dave ThalerI'm not yet convinced that's a change worth making.
Do you see a chance to convince the IE team to fix this?
We'd then also have to convince Mozilla and Webkit folks.
If we can't convince them, then our only hope is that UTF-8 content is
increasing steadily on the Web (IEEE Spectrum showed a graph provided by
Mark Davis that had UTF-8 (without pure ASCII) at 60%). I think we
should be careful to make sure that we write the spec so that it doesn't
make things overly complicated in a world where essentially all Web
pages are UTF-8.
Regards, Martin.
P.S.: And here is the story of why query parts are treated the (odd!)
way they are in browsers.
In the mid '90ies, Web pages in all kinds of encodings started to show
up. CGI scripts took in data from forms, and there was a serious
problem: In what encoding should the form data be sent back to the
server? It was the most frequent question asked on mailing lists related
to I18N and the Web, and at the Unicode conference.
RFC 2070 (HTML I18N, now historic) introduced the accept-charset
attribute (see http://tools.ietf.org/html/rfc2070#section-5.1), but that
was not implemented by browsers. A convention started to emerge, which
was that the character encoding of the document containing the from
would be used.
This was taken over by HTML4 (see
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset (*)),
although the accept-charset attribute was moved from individual fields
to the form element itself. The accept-charset attribute for a long time
was not implemented, but it finally got implemented when Mozilla got
totally re-implemented, mostly according to spec, and then it spread to
other browsers, to the extent that it ended up in HTML5 (see
http://www.w3.org/TR/html5/the-form-element.html#attr-form-accept-charset).
So for forms, we are all set: you can have a page in Shift_JIS with a
form that uses UTF-8 for application/x-www-form-urlencoded, which means
that you can display the query part as an IRI, or you could have the
reverse, which means that you have to use %-encoding for the Shift_JIS
bytes.
The problem with all this is that browser makers thought that a query
part in a complete IRI (e.g. in the href attribute of an <a> element or
the src attribute of an <img> element) is just like a form, and so
should use the document charset. RFC 3987 nowhere mentions that the
query part should be treated differently from the rest of the IRI, but
in hindsight, it might have been a good idea to put a big reminder into
RFC 3987, saying "all this also applies to the query part". And of
course there's no accept-charset attribute for <a> or <img>.
(*) There was a small tweak, in that in RFC 2070, the accept-charset
attribute was on each (textual) form element, but for HTML4, we moved it
to the form element itself.
Post by Dave Thaler-Dave
-----Original Message-----
Sent: Tuesday, July 10, 2012 4:27 AM
To: Larry Masinter
Subject: Re: [iri] #128: use of the term 'origin'
Post by Larry Masinterdoes this apply to any format other than HTML? I'm not sure that this
applies to anything else... Within image/svg+xml, for example? The notion of
document charset doesn't apply to some formats.
Hello Larry,
Very good idea to test this. I tested the various browsers that I have, looking
at the actual requests in Wireshark, everything on Windows 7.
The test consisted of the attached SVG file in iso-8859-1 with a link to an
existing domain but a non-existing page with a query part with non-ASCII
characters.
GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n This means the
query part is sent as percent-encoded UTF-8.
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
part is sent as percent-encoded iso-8859-1.
GET /non-existent?r\351sum\351 HTTP/1.1\r\n This means that the query
part is sent as RAW iso-8859-1.
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
part is sent as percent-encoded iso-8859-1.
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
part is sent as percent-encoded iso-8859-1.
With the exception of Opera, SVG seems to follow HTML. But there are SVG
user agents that are not browsers. If somebody has one of these, please run
this test and tell us what you got.
Also, there are formats other than HTML and SVG.
Regards, Martin.
Post by Larry MasinterConnected by DROID on Verizon Wireless
-----Original message-----
Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
Subject: Re: [iri] #128: use of the term 'origin'
#128: use of the term 'origin'
#choose ticket.new
#when True
While reviewing 3987bis for i18n terminology, I came across this
For compatibility with existing deployed HTTP infrastructure, the
following special case applies for schemes "http" and "https" and
IRIs whose origin has a document charset other than one which is UCS-
based (e.g., UTF-8 or UTF-16). In such a case, the "query" component
of an IRI is mapped into a URI by using the document charset rather
than UTF-8 as the binary representation before pct-encoding. This
mapping is not applied for any other scheme or component.
The term 'origin' could be ambiguous here. It doesn't seem to be
referencing the Web Origin Concept (RFC 6454) but instead seems to be
based on the "document" (broadly construed) in which the http or https
URL is found (e.g., as a hyperlink in an HTML document or perhaps as
running text in an email message). It would be good to make that clear.
#end
#otherwise
#if changes_body
#end
#if changes_descr
#if not changes_body and not change.comment and change.author
#end
--
#end
#if change.comment
One way to remove the ambiguity would be to change "origin" here to
something else, but even then I think we'd need additional text. I
For compatibility with existing deployed HTTP infrastructure, the
following special case applies for the schemes "http" and "https"
when an IRI is found in a document whose charset is not based on UCS
(e.g., not UTF-8 or UTF-16). In such a case, the "query" component
of an IRI is mapped into a URI by using the document charset rather
than UTF-8 as the binary representation before pct-encoding. This
mapping is not applied for any other scheme or component.
#end
#end
#end
--
-----------------------+---------------------------------------
Type: defect | Status: new
Keywords: |
-----------------------+---------------------------------------
Ticket
URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
iri<http://tools.ietf.org/wg/iri/>