Monday, May 11th, 2009

Hixie discusses the addition of HTML5 “microdata”

Category: HTML, Standards

Ian Hickson has chatted about an addition to HTML5, “microdata”:

Annotate structured data that HTML has no semantics for, and which nobody has annotated before, and may never again, for private use or use in a small self-contained community.

He goes on to detail a number of scenarios such as this subset:

  • A group of users want to mark up their iguana collections so that they can write a script that collates all their collections and presents them in a uniform fashion.
  • A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he teaches to add it to their custom applications.
  • The list of specifications produced by W3C, for example, and various lists of translations, are produced by scraping source pages and outputting the result. This is brittle. It would be easier if the data was unambiguously obtainable from the source pages. This is a custom set of properties, specific to this community.

and then shows how one could take:

  1. <section>
  2.  <h1>Hedral</h1>
  3.  <p>Hedral is a male american domestic shorthair, with a fluffy black
  4.  fur with white paws and belly.</p>
  5.  <img src="hedral.jpeg" alt="" title="Hedral, age 18 months"
  6. class="photo"/>
  7. </section>

and extract:

Cat name:    "Hedral"
Description: "Hedral is a male american domestic shorthair, with afluffy black fur with white paws and belly."
Image:       ""

Here is where the fun begins as Ian walks through the issues with the microformat-esque approach, namely the overloaded “class”:

there is no way for a parser to know which classes are properties of cats and which are just for styling (e.g. ‘photo’ used in this example).

I have to admit that I think the baby could be thrown out with the bathwater here. I would hate to see class="cat" type="cat" for example!

Many iterations on, and we see:

  1. Page 1:
  2.    <section item="">
  3.     <h1 property="">Hedral</h1>
  4.     <p property="com.damowmow.desc">Hedral is a male american domestic
  5.     shorthair, with a fluffy black fur with white paws and belly.</p>
  6.     <img property="com.damowmow.img" src="hedral.jpeg" alt="" title="Hedral, age 18 months"/>
  7.    </section>
  9.    Page 2:
  10.    <body item="">
  11.     <p>I love my cats. My oldest cat is <span property="">Silver</span>. <span property="com.damowmow.desc">Silver is <span property="com.damowmow.age">11</span> years old and refuses to eat
  12.     alone, always waiting for either Yellow or Blue to eat with
  13.     him.</span></p>
  14.    </body>
  16.    Page 3:
  17.    <h2>My Cats</h2><h2>
  18.    <dl>
  19.     <dt>Schr&ouml;dinger
  20.     <dd item="">
  21.      <meta property="" content="Schr&ouml;dinger">
  22.      </meta><meta property="com.damowmow.age" content="9">
  23.      <p property="com.damowmow.desc">Orange male.
  24.     <dt>Erwin
  25.     <dd item="">
  26.      <meta property="" content="Lord Erwin">
  27.      </meta><meta property="com.damowmow.age" content="3">
  28.      <p property="com.damowmow.desc">Siamese color-point.
  29.      <img property="com.damowmow.img" alt="" src="/images/erwin.jpeg"/>
  30.    </p></meta></dd></dt></p></meta></dd></dt></dl>

I don’t miss the com.* world of Java. I hate the verboseness. It looks so ugly to compare “” to “cat”. Is it just me?

Hixie then concludes:

To address this use case and its scenarios, I’ve added to HTML5 a simple
syntax (three new attributes) based on RDFa. It doesn’t have the full
power of RDF, because that didn’t seem to be necessary to address the use
cases. It doesn’t really have anything in common with Microformats; I
didn’t find the Microformats syntax to be very convenient. (This was also
the experience with eRDF.)

I expect the syntax will need adjustments over the coming weeks to address
issues that I overlooked. I look forward to such feedback.

I also found the following Tweets interesting (via @kevinmarks and @diveintomark) as I wrote this :)

@hixie @kidehen URLs are useful, as they resolve. all else is stamp collecting.

@kidehen In practice few people really understand the subtlties of URN vs URI vs IRI vs URL vs Web Address vs Hypertext Reference vs…

All this crap about HTML5 “gatekeepers” is hiLARious. For 6 years, they BEGGED the W3C to work on HTML+1. The W3C said no.

Posted by Dion Almaer at 6:13 am

2.2 rating from 33 votes


Comments feed TrackBack URI

I may be way off-mark here, but is this not trying to overload a mark-up language into something to hold information form more than simple web display.

An alternative would be something along the lines of the S1000D specification which is aimed at storing the data in its entirety and then you use players or transforms to display the information which you are interested in.

Of course whilst S1000D is open, there aren’t many players out their and it is heavily biased towards technology.

Comment by jjs105 — May 11, 2009

I’m a big fan of microformats. But why can’t we learn the lesson we learned from CSS and use a separate file to associate properties to DOM Nodes?

Something along the lines of:

section : “” {

h1: “”;
h1 + p: “com.damowmow.desc”;
img: “com.damowmow.pic”;


Of course this could be polished… but I think if we’re gonna write something in stone, why can’t we “keep it separated”, but tied?

Hmmm.. I should cross-post this to the mailing-list, I guess. But I just don’t have the attention to spare to keep up with replies. :-(

Comment by andr3 — May 11, 2009

The link to Ian’s message is broken… it should point here:

Comment by davidlantner — May 11, 2009

I think we need accept that there are a lot of different needs on the web, and that we can’t shoehorn everything into html. And its not just the same old complaint of “not everything is a document, what about the web app love?”. Here too, we clearly see the failings of html as a data format.

I know that this had been tried before with xslt and just never took off, but I think we need a better way of separating content from visuals. And css is just not there.

I know we’ll likely always need some simple way of making a web page, but really, is it even simple now?

I would suggest an alternative, but it seems futile. Sorry, I just get kind of depressed about web technology sometimes :(

Comment by genericallyloud — May 11, 2009

If the following is too ugly, verbose, and brings back com.* from Java land:


Maybe if an ancestor element specifies item="", then all descendant elements could have “com.damowmow.*” inherited as the default “namespace” by all descendant property attributes so you could simply write property="img". Name collisions could happen, but I don’t believe they happen so frequently, but when they do, the “fully qualified” property name could be written as specified: property="com.damowmow.img"

Comment by westonruter — May 11, 2009

Couldn’t xmlns be used for this?

Comment by TNO — May 11, 2009

As we speak I’m overloading HTML to meet a requirement of the webapp I’m developing to be able to parse HTML with a view to the data being part of a class. I’ve had to do this many times in the past and therefore am very glad to see standards being developed for this.

TNO, only problem with NS is that you can only create new elements, not outline NS’s attributes on existing HTML elements. In theory you could add an extra DTD to expand on XHTML but without a standard for this, browsers treatment of this approach might not be “as expected”.

Comment by RoryH — May 11, 2009

This seems like one of the least bad things about Java to bash. The com.* format is just convention. Java doesn’t force you to put your packages in ‘com.* or ‘org.*’ this is just a trategy for avoiding namespace collisions in a large ecosystem by using domains as unique identifiers. Import directives take the pain away.

Since there is no import directive in this proposal, it does seem to make reading/writing worse.

Comment by cromwellian — May 11, 2009

What cromwellian said. Furthermore, reverse domain is not limited to Java, it’s also used by the OS X preference/plist system (generally), by a lot of ActionScript libraries, and elsewhere. And it’s great, for what it is.

I would argue that, if this is going to be in HTML 5, it should be inferred that where a reverse domain is not specified, it is that of the current domain (eg. <section item=”cat”> on would be interpreted as

Comment by eyelidlessness — May 11, 2009

XML Data Islands is the answer

Comment by Ajaxerex — May 11, 2009

Designers are never going to use or care about crap like this.

Comment by mjuhl — May 11, 2009

I thought this was what the data-* attributes in HTML5 were for. I don’t quite see the point behind adding a more specific alternative to the data attribute family.

I say let everyone solve it in their own way. The spec doesn’t need to cater to this use case more than it already does.

Comment by Joeri — May 12, 2009

The verbosity isn’t that bad, though I agree it is a bit ugly if you have a long domain name. The problem is that there isn’t really a better alternative — prefixes, imports, and defaulting names to particular domains based on other declarations all have even worse problems, like authors getting them wrong (many people just don’t understand prefixes), to copy-and-paste failing (when people forget to copy the import declarations), to conflicts (e.g. when you’re using multiple vocabularies).

@andr3: I considered using Selectors for this (it was suggested a couple of times in the discussion), the problem is that it doesn’t really work. People really want something they can parse trivially, while streaming through the markup, they don’t want something that they need to parse into a DOM, cascade selectors over, etc. Using an external “semantic sheet” would mean that the document would lose meaning when you lose the connection to that sheet. It would also involve a lot of indirection, which confuses authors (see how many authors only use .class, for instance) — instead of just saying “this is a name”, you’d often say “this is a name” in one place and “names are com.example.names” in another.

@mjuhl: I agree with you that most people won’t use this. Some will, though. There was enough demand for something like this to justify putting it in the spec, IMHO.

@Joeri: It’s different than data-*=”” in that the data-* attributes are not intended for reuse by other people. data-*=”” is just for scripts to have somewhere to hang information that they need (e.g. “is this row open?”, “where was I up to when animating this?”), whereas microdata is for remixing information.

Comment by Hixie — May 12, 2009
Send gifts to Singapore, Online delivery of flowers to Singapore, gift to Singapore, chocolates, cakes, watches, teddy, sweets, fresh fruits, dry fruits.
Anniversary, birthday, wedding gifts, cakes to Singapore, Same day delivery to Singapore, Gift Shop.

Comment by aani — May 12, 2009

Why not just move to XML + CSS? HTML will never be able to encompass all possible data structures. Keep some reserved tagnames and attributes for basic structures and special elements (img, canvas, etc) and that’s all. My dream mark-up:

I'm a cat

Privilege identifiers, use attributes for data. Or something like that.

Comment by ricardobeat — May 12, 2009

Gaa. useless code tag. Let’s try again:

< type="meow" color="white">
<p.description>I'm a cat</p>
< src="cat.jpg" />

Comment by ricardobeat — May 12, 2009

@TNO, westonruter: this is basically what RDFa does; property=”f:img”. The prefixes are defined as xmlns attributes. Short, simple, looks nice, I don’t get why some people are so opposed to that (especially not if this verbose crap is the alternative).

Comment by Grauw — May 13, 2009

It seems like its just reinventing xml namespaces minus all the good stuff such as schema validation. fyi, when is com.* anything a verbose issue? It’s used once at the top of your java file, and usually automatically imported by your IDE. That was a funny comment.

Comment by ilazarte — May 13, 2009

The comments here are quite interesting:

Comment by TNO — May 13, 2009

Leave a comment

You must be logged in to post a comment.