158 weeks agoHow much metadata do we need?
I was surprised to find out that most of the work labeled "semantic web" was in fact done on metadata. Metadata are data about data, basically a description of data. But why does anyone think it matters? We had a feature in MS Word to enter metadata about document years ago and I still haven"t seen that feature used even once. One can argue that if once we’ll have search engines that will make use of such descriptions, then people will start filling metadata out, but that"s simply not true! We have metadata fields in html, but they're discarded by search engines for a reason. When data is used by computer to calculate relevance but is not seen by human most of the time, then it is a natural target for abuse! As Clay Shirky says: "People lie."
On a side note I am sure that collective ontology is not practical either. Just ask youself what would you use it for? When the web was born it was adopted because it had at least one killer app: telephone book. I can see how standadized ontology format can benefit certain user groups, but the masses do not care.
Back to metadata. Chances are you’ve heard about Dublin Core Metadata Initiative, they even have a Firefox extension that lits up when a page has embeded dublincore metadata. Then, when you click it a native window pops up giving a nice listing of description fields available. This is nifty, yet useless. I didn't stumble upon many pages that had such info available and when I did, clicking that nice dublincore icon was obligatory (after all that's what I installed the extension for), the window readily showed, but it never gave me any new information I wanted to know. Actually, if something mattered it was already on the page itself, there was no need to dig down into metadata. Does this technology suggest people should stop signing documents, showing titles on pages and just shove all this information another click away, sure no, so I guess we are expected to duplicate the data so AI can pick it up. It might be good for some things, but has nothing to offer to mainstream.
I insist that if we want to make data more semantic in a practical way, then we should build conventions and technology that displays the same dataset that is processed by semantic-aware software. This is important: such technology would make sure the same data are shown to the user and processed by search engines and the like. And that means more presentation control should shift to user (and this way we would get all-skinnable web too, yay!).
Let me take this idea further: as the display logic is more user-controlled than it is in HTML world then we will need client-side software that is somewhat aware about data it is displaying (can you smell semantics yet?). So basically the article in such hypothetical data format would need to expose the following fields (remember we want to keep things atomic and separated), keep in mind this is just an example, a draft and as such it is bound to be missing some things, please speak up in the comments:
- datatype definition: http://standard-entity.org/schemas/text/std-article.xml
This would be an url that is recognised by client, if it has special rules of how to display documents in such schema it will use its custom code, if not it can fall back to somewhat generic display with template extracted from the definition itself. Please note, that I am not suggesting to use XML Schema format for this. - title: How much metadata do we need?
This the article title and will be used as “title” tag contents in HTML, to give shortest identificator possible when returning search results, listings and such. Also shown as the title of the article in the used template (so it's implausible to cheat here). - preview: Metadata can be handled differently, this article describes how.
Optional field used to give somewhat more extended preview of what the article is about. - author: Sergey
Used in the template as well as search. - ??: ??
Probably more generic fields. - body: http://server.com/article-itself.html
This gives an address where to fetch article text itself (this should to be separated not to make the format a mess). Article-body file should be "plain" html -- just a set of paragraphs with neccesary formatting, anything above that is against the whole idea of neatness. - body-content-type: text/html
This specifies what format to expect from article body, can be different from text/html. - default-stylesheet: ...
Only if you wish to see article as the author suggested. User is welcome to override this.This is just an idea, but it is obvious how this would improve web experience while having considerably fewer drawbacks that the whole RDF thing, also, I can tell exactly why this is useful and why metadata will be relevant -- it's good because it gives user more control, and helps content creators keep thing tidy, while keeping metadata true because it"s the same data as anything else, it"s not separate entity anymore, correct metadata are inherent in this system.
This seems like it"s miles away, but in fact it"s not. I"ll develop ideas presented here next time.