Ian Bicking: a blog :: Of Microformats and the Semantic Web

{ 2007 08 14 }

Of Microformats and the Semantic Web

I was talking a little with Daniel Krech (author of rdflib) about Semantic Web stuff and microformats and what they all mean. And he was saying that microformats were nice, because you could do something with them, but it would be nice to see that generalized.

By "generalized" I think he meant a general way of expressing arbitrary relationships. As an example, in hCard you can do:

home:
773-555-3821

The hCard specification (itself leaning heavily on vCard) defines tel, type, and there’s a general pattern of what value means. But if you want to describe some new kind of structure, there’s no way to do that really; there’s no marital status format, for instance (which would be useful for a singles search engine, as an example).

So I started thinking: can you really generalize it? And I started to think about Joe Gregorio’s attack of WADL:

Here is the very first example in the WADL specification.

That WADL file is a description of a search interface. But here is how you should really do it. That’s an OpenSearch document, that also describes a search interface.

Q: What’s the difference?

A: A mime-type.

Q: That doesn’t seem like much, does it make a difference?

A: Yes, it makes a big difference. When you get an OpenSearch document there is a whole data model and a set of interactions you know are possible because you read the OpenSearch specification. By reading that spec you know how to construct search queries. When I get a WADL document it might describe anything, from how to construct a search, to the APP, to JEP, to XML-RPC.

…

So when I say the difference is a ‘mime-type’, what I mean is that there is an entire spec somewhere which describes what that document means, and that meaning may include hypertext functionality, ala (X)HTML, XForms, and OpenSearch.

This made me think of shared understanding more than explicit descriptions. OpenSearch, APP, and Atom are very well described, but I think that’s only half of it: they are useful when they describe something that many people already understand.

Digressing slightly, one "semantic markup" ideal that still bugs me is  and  vs.  and . When I compose text I choose to make some words bold and some italic. I have no idea what "strong" and "emphasis" are even supposed to mean. When I’m composing text, I don’t actually know why I choose one or the other. If I sat down and thought about it I’m sure I could come up with a set of rules that describe when bold is appropriate and when italic is appropriate. But that is reflecting on my choice, it is not describing my choice. There is no intermediate semantic meaning between what I am saying and bold and italic. I think in bold and italic. Readers in turn find meaning in the text itself; they do not parse my writing into semantic markup in their brain.

I think there’s some connection between this and the shared understanding that microformats represents, and a more generalized RDF model does not represent. I know what hCard means; not just in an intellectual way, but I can imagine a dozen functional uses of it without hardly trying, and of course I am entirely clear on what contact information means. Moreover, I know what it means without actually figuring out what it means; if you asked me to articulate what contact information means I’d have to think a little, and I’m sure many people would come up with bad answers or be stumped. And yet they all actually understand what it means.

Bringing this back to Joe’s post, if I write something that produces or consumes Atom, Atompub, or OpenSearch, I understand the why of my code. With both WADL and RDF my code is divorced of the why. This isn’t about my personal understanding either; explaining it to me doesn’t serve any purpose, because with any exchange format it has to make sense to many many people to be useful. Even an education campaign won’t fix this: education by description is far inferior to education by doing, and there’s no "doing" to WADL and RDF right now.

That said, what is sufficiently obvious in the future may not be obvious now. Maybe we’ll all get smarter. Maybe someone will pioneer this stuff in a way that is really useful (Facebook?), and grow the public’s intuition about describing relationships in an abstract way. But until then I think microformats are going about this the right way, describing those things that are most easily describable.

Automatically generated list of related posts:

A new way to deploy web applications Deployment is one of the things I like least about...
The Web Server Benchmarking We Need Another WSGI web server benchmark was published. It’s a decent...
A Python Web Application Package and Format (we should make one) At PyCon there was an open space about deployment, and...
The Shrinking Python Web Framework World When I was writing the summary of differences between WebOb...
lxml: an underappreciated web scraping library When people think about web scraping in Python, they usually...

7 Comments

Jeff Shell says:

August 14, 2007 at 1:27 pm

Regarding bold, italics, strong, and em:

On your comments form, you make special emphasis of “never” in “Your email is never published nor shared.” You didn’t do that because you thought ‘never’ would be more pretty in italics, right? You most likely did it to emphasize the point that you’d never share an email address. And that’s how I interpreted it. In fact, before I filled in my email address, I looked up there to see whether or not it would be published if I didn’t provide a URL, and took comfort in quickly seeing the emphasized never.

You parse semantic markup in rich text all the time. When formatting changes, you apply a reason. RFC’s don’t capitalize MUST and SHOULD because the author is thinking in upper-case versus lower-case. They’re putting a strong emphasis on those words. As a reader, you take special notice of those words being formatted that way and immediately recognize that they contain a special importance. So I think that readers do parse writing into semantic markup inside their brains. We just don’t recognize it as such. But the way in which our brains scan text and assembles the words and phrases together does take into account when, where, and why words are highlighted. If there’s no meaning to bold and italic phrasing in text, then randomly apply them to bits of phrases and see how your brain interprets it.

I think that ‘strong’ and ‘emphasis’ are very valid and usable tags – far more so than ‘b’ and ‘i’. I trained my brain to use them a long long time ago as CSS was first appearing on the scene. I realized that I would want to emphasize things differently according to the writing, and instead of italics I might do a yellow background (like a highlighter pen). Or, especially in print media, might change typefaces entirely and jump from Filisofia to Base 12 Small Caps Bold Italic.
Bruce says:

August 14, 2007 at 3:34 pm

What Jeff said on semantic vs. presentational tags. You might have internalized b and i to the point where you forget the semantics, but they’re still there. And try asking a blind user what “bold” or “italic” means.

As for generalizing microformats, it’s not rocket science. RDFa does just this. But it does require reexamining some sacred cows in the microformats world (like namespaces are bad, real users don’t care about extensibility, that overloading existing properties like class and title is a good long-term solution, etc.).

To get a small sense of the problem with the current microformats approach, consider this: the in-development citation microformat cannot use the most obvious and intuitive property in the world for a book or an article — title — because it’s already taken by hCard. Instead they use “fn” from hCard. Likewise with the new hAudio effort, which had to invent new properties to indicate a title.

Never mind if you want the marital status stuff you note, or want to mix that with some hCard content, which is a thoroughly practical thing that real users want to do but cannot with microformats absent going through an absolutely tortuous process.
Luke Opperman says:

August 14, 2007 at 3:59 pm

Ok, so I agree that content types that refer to a shared why are the way to go. But the question is really what model we use to process and describe a given why, (and ultimately, what shared syntax, such as RDFa vs microformat).
- While I agree with Joe’s example, unfortunately HTTP content types are explicitly not fine-grained enough when we’re talking about microformats or (more generally) any approach for layering multiple semantic descriptions into a document.
- Joe’s concern with generalized content types (“when I get an X document, it could describe anything”) applies directly to HTML once you add microformats. If mixed documents are a goal then we need a way of specifying sub-document types. (I think it’s a desirable goal, even if the sources of the mixed content also live in their own canonical documents which have a single most-appropriate content type – an alternative half-way approach would be a mechanism to specify “this block contains a vCard that is available over there in a vCard-typed document”.)
- In hCard (and microformats more generally?), sub-document types are specified by containers with a class attribute that is an agreed-upon string (“vCard” in the case of hCard).
- In RDF, types are specified with XML namespaces, and those types are defined as ontologies. Using RDFa in the case of vCard, rather than a class “tel” contained in a class “vCard”, you would have an element with property=”v:tel” in a document with xmlns:v=”http://www.w3.org/2006/vcard/ns”.
That is, I think you’ve got the levels wrong if your concern with RDF is that it’s “too general” – RDF(a) maps to microformats-in-general, a specific RDF ontology (broadly, content type) maps to a given microformat such as hCard. The why is no less shared – as you say, your processing code needs to decide “Ah, I know what vCard means, I will process this”.

From a modelling perspective, I’m much more confident that RDF is a solid general approach for describing mixed data than microformats is. In particular, microformats fail for me in appearing to be strictly hierarchical (in large part by having only implied-by-parent namespaces) – if we accept mixed content types in a single document, is there any reason to expect the semantic content will not overlap in a given document? (The other big win for RDF for me here is the ability to be clear about whether the semantic content we’re describing is “about” the document we’re currently in or some other resource.)

(Also, everything Bruce said. :) )
infidel says:

August 14, 2007 at 4:09 pm

… but wouldn’t you hope everyone on a singles search site were single? What good would a marital status indicator do other than to point out the cheaters?
Ian Bicking says:

August 14, 2007 at 4:36 pm

To Jeff:

I should note here that I replied with a [whole other post](https://ianbicking.org/2007/08/14/reflection-and-description-of-meaning/).

To Bruce:

What Jeff said on semantic vs. presentational tags. You might have internalized b and i to the point where you forget the semantics, but they’re still there. And try asking a blind user what “bold” or “italic” means.

If I don’t know what the semantics are, they aren’t there. At least they aren’t there in my writing, because I write with the semantics I understand.

As for blind users, I don’t know. I assume if they are using a screen reader there is some intonation which represents these styles. If we change every  on the web to  it won’t help them any. If we change just the right  tags to  then it would help. But that’s unlikely to happen. Are we going to have two buttons on every HTML composer? Plus little popup warnings “did you really mean to use that kind of italic?”

As for generalizing microformats, it’s not rocket science. RDFa does just this.

My argument, at least here, is not that you couldn’t make the formats extensible. And you are right, as a format it may very well be reasonable to do so. As a means of exchanging information it might not be as feasible. That is, being able to describe information isn’t the same as people being able to usefully exchange information. Microformats are tackling the low-hanging fruit of information exchange, where the non-extensibility doesn’t (at least yet) seem so bad. That doesn’t mean it’s the right choice, but it might still be the right strategy. That they take over a global namespace (CSS classes) might seem rude, but it might also be the only way to force some sense of shared meaning. URIs alone don’t build shared meaning.

To Luke:

There is some support for the “over there” model in Microformats, using <a rel="something" href="location_of_metadata" rel="nofollow">. Unfortunately rel is not very extensible. There’s also <a rev="something" href="location_of_what_this_metadata_describes" rel="nofollow">, (hah, the dumb regex puts in nofollow) which is kind of an interesting means of annotation, though it seems to be on the way out in Microformats, and isn’t much used anywhere else either.

I would really like to see an extensible ref/rel. And maybe a more general for, like <label for="id-of-thing-being-labelled">, maybe like <a rel="tag" for="piece-of-content" href="tag_uri" rel="nofollow">. Right now in a Microformat you can only do that through containment, I think.

But if you link to something that isn’t HTML, you are basically creating a dead link for most users. This is where Microformats beats XML formats. Except, of course, that most Microformats are dead for most readers ;)

A more technical problem is that HTML is a naive format handled in naive ways by most consumers and producers. XML namespaces require sophisticated parsing to map between the serialization and the “real” name of something — where the serialization is something like v:tel and the real name is {http://www.w3.org/2006/vcard/ns}tel. RDF makes that even worse if it puts serialization in attributes, so XML parsing alone still doesn’t provide real names. This is just too sophisticated for HTML.

To infidel:

but wouldn’t you hope everyone on a singles search site were single? What good would a marital status indicator do other than to point out the cheaters?

A singles search site that searched the entire web for single people! On the web no one can look for your wedding band. But on the web with a marital status microformat…?
Luke Opperman says:

August 15, 2007 at 7:30 pm

For the moment my reply on namespaces is just going to be that there is at least a subset of tools that handle namespaces, since as you say the semantic content is currently dead to users let’s choose the appropriate conceptual model and push tools to meet us rather than hobbling ourselves to re-invent namespaces in an ad-hoc fashion. I’m arguing that this is not a case of YAGNI, historically we’ve been down the “hierarchical is good enough!” path too many times and found it lacking, and as I argued in my last comment in this particular domain it’s all too easy to see structural overlap occurring in mixed documents, so in this case an ad-hoc low-hanging fruit policy works for me only if there’s a visible path out. I don’t see that in microformats, I see it already solved in RDF(a).

My pessimism says you’re backing the right horse for what’s going to take off first, without some heavy evangelizing from the RDFa side and maybe even then.

Based on your response to Bruce, I think I hear you saying the point of your post is less about this discussion of extensible formats but about extensible ontologies (shared why) – in the post, how are we all going to agree to add marital status to vCard. My answer is clear: RDF, overlapping HTML-style-ignore-what-you-don’t-know semantic overlays via RDFa, and overlapping/mapping between ontologies. You find a spec/ontology that models marital status, you describe your content to the extent you care about in both vCard and that ontology, and then we’re just faced with mapping between ontologies for those who have reason to model both.

I find thinking in terms of RDF appropriate because I can easily map it to relational terms where different ontologies are views and the resources we’re describing are keys. RDF (and certainly relational theory) at least gives a fairly clear model for creating mappings between ontologies, even in the likely case of both being views on top of each application’s specific entity model. For me, the semantic web is the relational web – hence, hierarchy is insufficient and clear “about” (key) support is fundamental.
Aristotle Pagaltzis says:

August 20, 2007 at 11:29 am

There’s a very simple way to decide whether you should use  or : imagine that someone read the text out loud. Should they speak the italicized words flatly or with emphatically?

Actually, that is precisely what screen readers do when they encounter these tags. And that is why I do make a distinction between them. Picking among the two options isn’t just ivory-tower faffing; it has real value for a segment of the audience.

Ian Bicking: a blog

Of Microformats and the Semantic Web

7 Comments

Home

About

Archives

Categories

Recent Posts

Recent Comments