URLs should express output-format, not source-format

I’m the kind of guy who always tries to optimize my ways, especially when it comes programming. Over the last years I’ve worked on/against a multitude of CMS, frameworks and API’s, and quite often make some thoughts how things could be better, easier and more intuitive.

One of these things are regarding how to format a URL.

This post isn’t really server-technically focused, as it is more a discussion of an idea. I’ll none the less assume that the reader has some familiarity with mime-types, basic webserver-settings (like handlers and URL-rewrite-techniques).

Why

As far as I’m concerned, any URL should give a hint of what you will find there, and by stripping the file-extension and just sitting left with something that’ll look like a directory, or using the scripting-language’s extension, I believe that we works against this. File-extensions gives an important hint regarding what the content is. What do you expect when you see “.jpg”, “.gif” or “.png”? Yes, images. Then what about “.html”, “.txt” and “.odf”? It’s most likely some kind of text, perhaps formatted. So what does “.php”, “.py” and “.aspx” tell us? Not much – it mostly a directive to the webserver, telling it to use some kind of parser to compile/process the contents before outputting it. And it might output html, xml or pdf as well as anything else.

A couple of years ago it had a “coolism” about it, saying that “I know how to develop fancy websites”, but nowadays people mostly expect some kind of interactivity anyway.

How

All relevant webservers today have to possibility to map any kind of extensions to any kind of (supported) language-parsers. I.e. with Apache you can use the AddType-directive like this:

addhandler php5-script html xml json

– to parse all .html-, .xml and .json-files in the given directory as PHP 5.

Now, let’s say you have an API which allows you (or perhaps other users), to query a service for information – and allows it to be delivered in xml, json and html. This is usually solved by adding some parameter like “of” (output-format) or something like that in the request to tell which format you’d like. I prefer to keep everything as logical and reducing the need to look up in manuals as far as reasonably possible, so instead of sending GET-requests like this:

http://example.com/gateway.php?id=15&key=11341&of=xml

http://example.com/gateway.php?id=15&key=11341&of=json


http://example.com/gateway.php?id=15&key=11341&of=html

I’d rather prefer:

http://example.com/gateway.xml?id=15&key=11341

http://example.com/gateway.json?id=15&key=11341


http://example.com/gateway.html?id=15&key=11341

This applies just as much for those sexy search-friendly URLs – just append .html to that pretty URL. Google says that it atleast won’t affect negatively on your ranking – so why not?

Possible problems

There are some obvious problem to this approach:

1: What if you want different parsers (let’s say php and python) for different files but which will serve the same kind of output? How do we tell the server which to use when?

– Possible solutions to this is to either separate them in different folders (the cleanest approach), or expand the extension to ie. .php.html and .py.html. Perhaps in combination with some mod_rewrite-kind of tricks.

2: What about the overhead for static files, which still is sent through an interpreter although nothing is changed?

– If you are running a service which is sensitive to these kind of margins, you should just use dedicated servers or CDN’s to deliver the static content no matter what.

3: How far shall we take it? Should we use extensions like .html5 and .xhtml etc?

–  I do believe that these differences are mostly relevant to the software parsing them, so in these cases it may suffice with keeping with i.e. .html, and use the MIME-type to send further details.

Summary

Although all my examples might not pinpoint all the benefits I do believe that there is an idea in keeping the file-extension, both to assist external users as well as might reduce some clutter.

What are the pros and cons of an approach like this? I’ve barely touched a few in this post, but I hope I was able to mention some of the most essential ones.

Be Sociable, Share!

22 Comments

  1. evmar says:

    Bad title: post has the how but none of the why.

  2. Rofl_Waffler says:

    Fourth paragraph.

  3. > File-extensions gives an important hint regarding what the content is. What do you expect when you see “.jpg”, “.gif” or “.png”? Yes, images. Then what about “.html”, “.txt” and “.odf”? It’s most likely some kind of text, perhaps formatted. So what does “.php”, “.py” and “.aspx” tell us? Not much [...]

  4. hobbified says:

    Not URLs, not even URL’s, but URL’s. Bleah.

  5. kzielinski says:

    Apache Cocoon http://cocoon.apache.com takes these Ideas to their logical conclusion. they have the neat idea of being able to output the same page in multiple format so ask for it as html thats what you get, change the extension to .pdf and you get a pdf.

    The project has been doing this for years, and I thought it was a brilliant solution the first time I saw it. Also they completely separate the URL space from the underlying filesystem.

  6. modden says:

    That’s indeed something I should look closer into. Thanks!

  7. lol-dongs says:

    1. mod_rewrite
    2. front controller

    the era of people using separate files to handle each URL path should be long gone. Besides, keeping PHP in a file that ends in .json is annoying for the developer. 1 + 2 is TRWTDI.

  8. modden says:

    As I’ve also commented to the post, it’s not the server-side files I’m mainly thinking about (although my examples unfortunatly implies that) – it’s the URLs as they’re presented to the users.

  9. modden says:

    Haha, not good at all. Fixed in the original title now.

  10. I’m not sure I entirely agree with this.

    URLs are meant to be identifiers or “locations” for resources. Does the identifier / location of a resource change because we get it in a different format?

    I can agree that it makes sense in UIs where end users are exposed to URLs and seeing an extension clarifies what they are going to get in a useful way. It’s going too far, however to conclude this is the “right” thing to do generally.

  11. hobbified says:

    I guess it’s WordPress’s fault. Thanks for fixing it. :)

  12. bishopolis says:

    Yeah. I stop reading English when the writer stops writing English. This one shouldn’t have survived past the title, but I kept reading.

    No one’s perfect, but I do hope that someone trying to promote a new method will spend 2 minutes proofing the write-up before self-posting to something like Reddit. I got nearly a BINGO, scoring for word-like letter combinations like “URI’s”. I hoped that simple gaffes like “atleast”, “it mostly” or “we works” would give bonus points, but I didn’t see what I had on my Bingo card.

    Did anyone see the usual Tysonic kidgin blunders like “emails” when used as a noun, “irregardless” and “incidences”? I think that’s all I needed.

    Kudos for getting “URLs” and “scripting-language’s extension”, but considering the thrown gaunlet of a self-post, do we applaud the would-be tenor for clearing his proverbial throat?

  13. Adam says:

    The issue here is that extensions are for the host server not the client browser. The client browser should use the Content-type HTML header to decide what to do with it.

    Even then, browsers are built to expect cruddy data, so if you feed it an image/png and the image _actually_ is a jpeg, the browser will figure it out from the image.

    I don’t believe the extension should define the output format, since the extension is defined on the server which, as you say, could interpret it in different ways.

    AFAIK, according to the W3C specs, URLs should be pretty much disconnected from the whole data that comes from it. i.e. You should be able to navigate to http://wobble.blip.com/hooo?haa#hee!!!.xml=woo and the document to render correctly.

  14. eadmund says:

    Extraordinarily bad advice: fail to ignore it at your own peril!

    URLs are uniform _resource_ locators; a resource doesn’t have a particular format. That should be obvious: an image could be a JPEG or PNG or TIFF or GIF and yet contain exactly the same data–and thus be the same resource.

  15. michaelo says:

    Hey Adam, and thanks for a good response.

    I completely agree with you that the Content-Type is the most important one, especially with regards to the UA’s. And I also agree that the URL’s shouldn’t necessarily (ideally) have to match any real files at all (I see from my examples that it didn’t come through good enough) – but that’s not the real case here. I’m thinking of the URL’s as presented to the clients (which might be parsed as anything you’d like for the servers). UA’s should also completely ignore the file-extension (except for i.e “application/octet-stream” where the extension might provide a hint if none good hints is given by the file-content).

    Indeed browsers are some very forgiving creatures.

    My intensions with this post were primarily for the users, not the user-agents. In essence I’d like the extension to comply with the content, not define (while it, although, indirectly will if used consistently).

  16. S Jones says:

    Isn’t your idea completely handled by Apache (or IIS) URL rewriting? mod_rewrite: http://httpd.apache.org/docs/2.0/misc/rewriteguide.html

    You can use that to make static-looking URLS (mydomain.com/graphics/somefile.jpg) which actual point to dynamic data (mydomain.com/graphicserver.php?get=somefile.jpg).

    Also, in my opinion, making the scripted/dynamic content look like a static file is OK until it is parameterized. Once you start adding foo=bar parameters, you might as well be honest about it being scripted.

  17. michaelo says:

    Hey S,

    That’s correct – and probably one of the best as it helps with the separation of URL’s and the physical files.

    Now, I do not think that it isn’t honest to server “static” file-types with parameters, although I agree that it might seem a little weird as it’s – as far as I’ve experienced – mainly because it’s not what we’re used to. But that’s something I’d like to see change. I think it’s a very good point you’re making there none the less.

  18. Alejo says:

    I don’t think you should aim for URLs that specify the particular representation (what you refer to as ‘output’) used.

    For instance, if the user is loading an image, he shouldn’t have to use different URLs depending on whether he wants to get the version in JPEG or in PNG format. He should be able to just give the URL that uniquely identifies the image; it is the job of the HTTP headers (ie. the content negotiation) to figure out which of the available representations is preferred.

    The same goes for the language, he shouldn’t have to switch from ‘index.en’ (or, worse, ‘index.en.html’) to ‘index.es’ (or ‘index.es.html’) to change the language; instead the right headers for picking the language should be used.

    For my CMS (svnwiki, http://wiki.freaks-unidos.net/svnwiki) the approach that I’m taking is to not show the format in the URL by default and use Apache’s Content Negotiation module to handle these things smartly. I really like the approach offered by Apache’s Content Negotiation module, where you can say ‘index’ (pick the best language and format), ‘index.en’ (pick the best format, but force that language) or ‘index.en.html’ (to fully specify language and format). This works very beautifully with the ‘Baked, not Fried’ approach (http://wiki.freaks-unidos.net/pre-rendered%20model).

  19. dwdwdw says:

    There are a number of sides to this, none of which I’m really versed in.

    I guess it depends on your notion of a resource: “microsoft.com” could be thought of as a resource, however in practice it is useful to include subaddresses like “msdn” or “/docs/blah.html” when referring to this resource. Similarly, notions such as file format, natural language and character encoding used can be thought of as indexes for projections of a multi-dimensional root resource.

    There is a greyish line here that isn’t well defined across the different Internet protocols. The closest we have to a standard would be HTTP, which most would rather not duplicate the “Content-Accept” and “Accept-Language” functionality of when defining their HTTP URLs. However, that doesn’t make doing so invalid.

    Including as much specifics of a resource in the URL has several advantages, for example, simplistic software that handles URLs can be trivially taught to address language and encoding variations of a resource without extra code. You can see this today e.g. in the W3C’s feed validator. In this scenario the resource’s projections are often included as child parts of the URL’s path. You can see some examples of this in the GData APIs, or anywhere you’ve ever seen something like “/docs/help/en.html”. Such a subaddress does not occlude the existence of “/docs/help”, in fact there’s no reason why the two styles can’t be used as part of a single scheme.

    There are other issues, some theoretical and some practical, but I have no idea where I read about any of this, sorry.

  20. eadmund says:

    > The closest we have to a standard would be HTTP, which most would rather not duplicate the “Content-Accept” and “Accept-Language” functionality of when defining their HTTP URLs.

    Well, those are there for _exactly_ this kind of usage pattern. Tools which don’t support them are borken, don’t you think?

  21. dwdwdw says:

    If you take that approach, then there probably isn’t a single conforming HTTP implementation on the entire Internet. I wouldn’t call it broken, no.

  22. michaelo says:

    Hey Alejo,

    Thanks for a well though out comment.

    I don’t think we are too far away from each other. I agree totally regarding the language, as this is possible to check i.e. the browser-settings. And the language doesn’t change the file format. But language and format-headers are from two different sides – the format is “forced” from the server, while the language is requested by the client.

    What I didn’t mention is that I believe it’s good practice to have fallbacks, either by “best fit from what I can sniff – content negotiaton”, or sometimes by directly defined defaults. What I still do mean is that the extension is an important hint to the users when considering what the resource is. And that when data is available in multiple formats it should/could be possible to overrule that by the extension as the URI points to the same data, but we’d might want it in a different format.

    It’s alsom important to mention that the software requesting the resource should of course consider the Content-Type-header over the extension.

    I’m checking out svnwiki, it seems like a good project that might come handy. Thanks for the links!

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>