Discussion:
Clarifying the URL Standard goals
Anne van Kesteren
2012-11-07 21:51:46 UTC
Permalink
I listened to the audio recording of the meeting and I feel my emails
have been largely misunderstood. I thought I would try again.

As far as the interests of the IETF seem to go, this is what
http://url.spec.whatwg.org/ attempts to do:

* Define the string syntax for URLs (IRI-references, if you wish).

* Define parsing the string syntax into a model (IRI-references ->
URI, if you wish).

* Define serializing the model back into a string syntax.


The string syntax seems non-controversial.

The parsing seems controversial, but the plan is to add an option
there to only parse conforming string syntax and bail on the first
error. (Going past the first error is useful for e.g. browsers /
search engines / wget so they can interoperate with content, but also
for URL validators that want to highlight more than one error in the
URL.)

The model is currently only documented as a function of the parsing
algorithm. My plan was to align implementations on parsing first
before documenting what model that implied. I already indicated how
this model is incompatible with URI. E.g. a query with a lone "%" can
be the output of the parser, but is definitely not an URI. (This is
why "fixup" is the wrong way to look at this algorithm I think.)

Serializing seems non-controversial.


(I am not sure if it is worth mentioning, but I am doing this work
largely by myself and thus far unpaid. Although browser vendors have
indicated they appreciate what I am doing, I am not representing any
of them, and I definitely care about software other than browsers. My
experience is that what browsers do leaks throughout the ecosystem and
I rather have it documented and confined what is leaked than everyone
having to reverse engineer each other in a race to the bottom.)
--
http://annevankesteren.nl/
Julian Reschke
2012-11-08 08:22:52 UTC
Permalink
Post by Anne van Kesteren
...
* Define parsing the string syntax into a model (IRI-references ->
URI, if you wish).
...
But then, many would prefer if this was exposed as different
functions/algorithms -- parsing into components, and relative
resolution. I hear you don't need that in browsers, but that shouldn't
mean you don't need it elsewhere.

Best regards, Julian
Martin J. Dürst
2012-11-08 10:11:58 UTC
Permalink
Hello Anne,

Many thanks for your email.

[somewhat reordered]
Post by Anne van Kesteren
(I am not sure if it is worth mentioning, but I am doing this work
largely by myself and thus far unpaid. Although browser vendors have
indicated they appreciate what I am doing,
Not only is it worth mentioning, I suspect it may also show a
symptomatic problem in the area of URI/IRI/URLs: Everybody (including of
course the browser vendors) thinks it's important, but nobody feels it's
important enough to spend valuable employee time. Or is it pure
coincidence that you have time for this work now that you are not
employed? (Sorry if I'm reading too much into the situation.)

I would also want to take this occasion to thank you again for your
work. This is something I wish had been happening two or three years
ago, but we can't change that anymore.
Post by Anne van Kesteren
I am not representing any
of them, and I definitely care about software other than browsers. My
experience is that what browsers do leaks throughout the ecosystem and
I rather have it documented and confined what is leaked than everyone
having to reverse engineer each other in a race to the bottom.)
Even if it's only for the browsers, documenting and confining makes a
lot of sense.
Post by Anne van Kesteren
I listened to the audio recording of the meeting and I feel my emails
have been largely misunderstood. I thought I would try again.
As far as the interests of the IETF seem to go, this is what
* Define the string syntax for URLs (IRI-references, if you wish).
My understanding is that this is a "procedural" definition, i.e. one
finds out whether something is an URL by running through a certain
number of steps (written in pseudocode in the spec).

I think this is something good to have. But some people want a more
top-down description, for which the syntax in RFC 3986/3987/3987bis
should be more suited.

So it would be good to do the following:
- Make sure that URL syntax and IRI-reference syntax are aligned.
(we started working on that)
- Verify that both descriptions are indeed the same
(this is somewhat more formal than "make sure", but doesn't
have to be done for every small change)
- Make it explicit that these two are the same by cross-references
Post by Anne van Kesteren
* Define parsing the string syntax into a model (IRI-references ->
URI, if you wish).
If you say "model", it looks more like "URI pieces" than just "URI".
Also, if this is about the DOM, and you say "URI", does this mean that
when accessing URL parts in the DOM, these (e.g. the path part) are all
%-escaped? Or are Unicode characters preserved in the DOM? (the later
would be much better for many usages)
Post by Anne van Kesteren
* Define serializing the model back into a string syntax.
The string syntax seems non-controversial.
Yes, we just have to agree on some details, and on what descriptions of
the syntax we provide (and where).
Post by Anne van Kesteren
The parsing seems controversial, but the plan is to add an option
there to only parse conforming string syntax and bail on the first
error. (Going past the first error is useful for e.g. browsers /
search engines / wget so they can interoperate with content, but also
for URL validators that want to highlight more than one error in the
URL.)
You talk about errors here. Are these strict validity errors? My
impression was that at least in an earlier version of your spec, there
was a distinction between "not valid" and something I might call here
"absolutely hopeless". The former might include spaces and e.g. "\". I'm
not sure about the later, but maybe something like "www......com" would
be an example.

Do you still have this (essentially three-level) distinction, or did I
get that wrong?
Post by Anne van Kesteren
The model is currently only documented as a function of the parsing
algorithm. My plan was to align implementations on parsing first
before documenting what model that implied. I already indicated how
this model is incompatible with URI. E.g. a query with a lone "%" can
be the output of the parser, but is definitely not an URI. (This is
why "fixup" is the wrong way to look at this algorithm I think.)
What do you mean by "a query with a lone "%" can be the output of the
parser"? Does that mean that it is sent as such to the server? What do
servers do with it? Would it hurt to escape it to %25? Or is that done
in a later stage?
Post by Anne van Kesteren
Serializing seems non-controversial.
One thing I'm worried is the dependency of the query part on the
document encoding. In the (very) long term, the Web seems to converge on
using UTF-8. You are a very vocal proponent of that direction, and I of
course fully agree. But if we stay with the current spec, we may end up
that we will have to write http://www.google.com?q=%E6%97%A5%E6%9C%AC
forever rather than the much more readable (for those who care)
http://www.google.com?q=日本.


Regards, Martin.
Anne van Kesteren
2012-11-08 11:52:14 UTC
Permalink
On Thu, Nov 8, 2012 at 11:11 AM, "Martin J. Dürst"
Or is it pure coincidence that you
have time for this work now that you are not employed? (Sorry if I'm reading
too much into the situation.)
Yeah, coincidence. In fact, I started work in the month before leaving
Opera after hearing about yet another bug report in the URL layer and
realising we did not come far with URLs since 2008.
Post by Anne van Kesteren
As far as the interests of the IETF seem to go, this is what
* Define the string syntax for URLs (IRI-references, if you wish).
My understanding is that this is a "procedural" definition, i.e. one finds
out whether something is an URL by running through a certain number of steps
(written in pseudocode in the spec).
I do not think it is procedural, but it does not use ABNF:
http://url.spec.whatwg.org/#writing It uses a style we have been using
in WHATWG/W3C land to define how to write things. For
parsing/processing we then use a procedural definition.
- Make sure that URL syntax and IRI-reference syntax are aligned.
(we started working on that)
- Verify that both descriptions are indeed the same
(this is somewhat more formal than "make sure", but doesn't
have to be done for every small change)
- Make it explicit that these two are the same by cross-references
I think our main problem will be the concept of "relative schemes" the
URL Standard has. Only if the base URL has such a scheme a relative
reference can be resolved successfully. I'm not really sure how to
bring that any closer to what STD 66 expects to happen. E.g.

base: customscheme://test
input: ?test

results in failure rather than customscheme://test/?test.
If you say "model", it looks more like "URI pieces" than just "URI".
I thought that is the way Roy described what is defined in STD 66.
Also, if this is about the DOM, and you say "URI", does this mean that when
accessing URL parts in the DOM, these (e.g. the path part) are all
%-escaped? Or are Unicode characters preserved in the DOM? (the later would
be much better for many usages)
They are not preserved.
You talk about errors here. Are these strict validity errors? My impression
was that at least in an earlier version of your spec, there was a
distinction between "not valid" and something I might call here "absolutely
hopeless". The former might include spaces and e.g. "\". I'm not sure about
the later, but maybe something like "www......com" would be an example.
Do you still have this (essentially three-level) distinction, or did I get
that wrong?
You got it right.

I replaced the "invalid flag" with a "fatal error flag" and plan to
introduce the concept of errors. Fatal errors are hopeless (e.g. input
does not have a scheme and there's no base either), errors are about
not matching the string syntax.
What do you mean by "a query with a lone "%" can be the output of the
parser"? Does that mean that it is sent as such to the server? What do
servers do with it? Would it hurt to escape it to %25? Or is that done in a
later stage?
Servers appear to not pay much attention to it. http://www.w3.org/?%
is an example. Escaping it to %25 seems dangerous compatibility wise.
Post by Anne van Kesteren
Serializing seems non-controversial.
One thing I'm worried is the dependency of the query part on the document
encoding. In the (very) long term, the Web seems to converge on using UTF-8.
You are a very vocal proponent of that direction, and I of course fully
agree. But if we stay with the current spec, we may end up that we will have
to write http://www.google.com?q=%E6%97%A5%E6%9C%AC forever rather than the
much more readable (for those who care) http://www.google.com?q=日本.
Well, we can write the latter in documents encoded as either utf-8 or
utf-16 (but please don't encode your documents as utf-16), because we
know the www.google.com endpoint speaks utf-8. It will still be parsed
into percent-encoded bytes which is somewhat useless and wasteful, but
not a huge problem if we provide API surface to get utf-8 out of it
again.

Whether we want to improve that situation even more is kinda tricky.
On the one hand yes, because it would obviously be cleaner to just
transmit utf-8 rather than the ugly percent-encoded bytes, but on the
other hand a lot of existing infrastructure would have to change. So
much so that it is not entirely clear to me the ROI is worth it.
--
http://annevankesteren.nl/
Julian Reschke
2012-11-08 12:26:45 UTC
Permalink
Post by Anne van Kesteren
...
Post by Martin J. Dürst
What do you mean by "a query with a lone "%" can be the output of the
parser"? Does that mean that it is sent as such to the server? What do
servers do with it? Would it hurt to escape it to %25? Or is that done in a
later stage?
Servers appear to not pay much attention to it. http://www.w3.org/?%
is an example. Escaping it to %25 seems dangerous compatibility wise.
...
That's something we should test (and potentially eliminate) instead of
making it mandatory.
Post by Anne van Kesteren
...
Best regards, Julian
Martin J. Dürst
2012-11-13 09:13:46 UTC
Permalink
Post by Julian Reschke
Post by Anne van Kesteren
...
Post by Martin J. Dürst
What do you mean by "a query with a lone "%" can be the output of the
parser"? Does that mean that it is sent as such to the server? What do
servers do with it? Would it hurt to escape it to %25? Or is that done in a
later stage?
Servers appear to not pay much attention to it. http://www.w3.org/?%
is an example. Escaping it to %25 seems dangerous compatibility wise.
I don't think http://www.w3.org/?% is a good example, because that page
doesn't accept query parts at all. A better example would be a page that
actually depends on a query part.

I just tested with http://www.google.com/?q=% and
http://www.google.com/?q=%25.

On Opera, I get the same result (a page with "%" in the search box), but
the difference in the address field stays. When requesting the actual
search, the difference (% vs. %25) stays if I keep Javascript on. It
disappears when I switch it off.

On Safari, an input of http://www.google.com/?q=% gets turned into
http://www.google.com/?q=%25. As confirmed with Wireshark, the query
leaves the browser as GET /?q=%25 HTTP/1.1. That means that Safari does
convert the % to a %25.

Chrome behaves more like Opera. I didn't find a way to switch off
JavaScript (maybe I didn't look long enough), but that part isn't at the
core of what we are testing here.

Firefox is the same as Opera. IE also seems to be the same or similar.
Post by Julian Reschke
That's something we should test (and potentially eliminate) instead of
making it mandatory.
Did you mean the above tests, or something else?

I'd definitely also like to see some tests on the server side. For
example, what happens in Apache when it receives "%" vs. "%25"? What
about other servers, frameworks, and so on.

Regards, Martin.
Post by Julian Reschke
Post by Anne van Kesteren
...
Best regards, Julian
Continue reading on narkive:
Loading...