ginger's thoughts

Silvia's blog

Tag: media-fragments

Media Fragment URI Specification in Last Call WD

After two years of effort, the W3C Media Fragment WG has now created a Last Call Working Draft document. This means that the working group is fairly confident that they have addressed all the required issues for media fragment URIs and their implementation on HTTP and is asking for outside experts and groups for input. This is the time for you to get active and proof-read the specification thoroughly and feed back all the concerns that you have and all the things you do not understand!

The media fragment (MF) URI specification specifies two types of MF URIs: those created with a URI fragment (”#”), e.g. video.ogv#t=10,20 and those with a URI query (”?”), e.g. video.ogv?t=10,20. There is a fundamental difference between the two that needs to be appreciated: with a URI fragment you can specify a subpart of a resource, e.g. a subpart of a video, while with a URI query you will refer to a different resource, i.e. a “new” video. This is an important difference to understand for media fragments, because only some things that we want to achieve with media fragments can be achieved with ”#”, while others can only be achieved by transforming the resource into a different new bitstream.

This all sounds very abstract, so let me give you an example. Say you want to retrieve a video without its audio track. Say you’d rather not download the audio track data, since you want to save on bandwidth. So, you are only interested to get the video data. The URI that you may want to use is video.ogv#track=video. This means that you don’t want to change the video resource, but you only want to see the video. The user agent (UA) has two options to resolve such a URI: it can either map that request to byte ranges and just retrieve those - or it can download the full resource and ignore the data it has not been requested to display.

Since we do not want the extra bytes of the audio track to be retrieved, we would hope the UA can do the byte range requests. However, most Web video formats will interleave the different tracks of a media resource in time such that a video track will results in a gazillion of smaller byte ranges. This makes it impractical to retrieve just the video through a ”#” media fragment. Thus, if we really want this functionality, we have to make the server more intelligent and allow creation of a new resource from the existing one which doesn’t contain the audio. Then, the server, upon receiving a request such as video.ogv#track=video can redirect that to video.ogv?track=video and actually serve a new resource that satisfies the needs.

This is in fact exactly what was implemented in a recently published Firefox Plugin written by Jakub Sendor - also described in his presentation “Media Fragment Firefox plugin”.

Media Fragment URIs are defined for four dimensions:

  • temporal fragments
  • spatial fragments
  • track fragments
  • named fragments

The temporal dimension, while not accompanied with another dimension, can be easily mapped to byte ranges, since all Web media formats interleave their tracks in time and thus create the simple relationship between time and bytes.

The spatial dimension is a very complicated beast. If you address a rectangular image region out of a video, you might want just the bytes related to that image region. That’s almost impossible since pixels are encoded both aggregated across the frame and across time. Also, actually removing the context, i.e. the image data outside the region of interest may not be what you want - you may only want to focus in on the region of interest. Thus, the proposal for what to do in the spatial dimension is to simply retrieve all the data and have the UA deal with the display of the focused region, e.g. putting a dark overlay over the regions outside the region of interest.

The track dimension is similarly complicated and here it was decided that a redirect to a URI query would be in order in the demo Firefox plugin. Since this requires an intelligent server - which is available through the Ninsuna demo server that was implemented by Davy Van Deursen, another member of the MF WG - the Firefox plugin makes use of that. If the UA doesn’t have such an intelligent server available, it may again be most useful to only blend out the non-requested data on the UA similar to the spatial dimension.

The named dimension is still a largely undefined beast. It is clear that addressing a named dimension cannot be done together with the other dimensions, since a named dimension can represent any of the other dimensions above, and even a combination of them. Thus, resolving a named dimension requires an understanding of either the UA or the server what the name maps to. If, for example, a track has a name in a media resource and that name is stored in the media header and the UA already has a copy of all the media headers, it can resolve the name to the track that is being requested and take adequate action.

But enough explaining - I have made a screencast of the Firefox plugin in action for all these dimensions, which explains things a lot more concisely than word will ever be able to - enjoy:

And do not forget to proofread the specification and send feedback to public-media-fragment@w3.org.

W3C Media Annotations API standard

Recently, I was asked to review the W3C Media Annotations specifications as they are about to go into Last Call (a state that comes before the request for implementations at the W3C).

The W3C Media Annotations group has defined a set of metadata that they believe is representative and common for media resources. The ontology consist of the following fields:

  • ma:identifier: a URI or string to identify a resource
  • ma:title: a string providing the title of the resource
  • ma:language: a language code describing the language used in the resource
  • ma:locator: the URI at which the resource can be accessed
  • ma:contributor: a URI or string identifying the contributor and the nature of the contribution
  • ma:creator: a URI or string identifying an author
  • ma:createDate: a date of creation or publication of the resource
  • ma:location: a string or geo code identifying where the resource has been shot/recorded
  • ma:description: a string describing the content of the resource
  • ma:keyword: a word or word combination providing a topic, keyword or tag representing the resource
  • ma:genre: a string providing the genre of the resource
  • ma:rating: rating value, including the rating scale
  • ma:relation: a URI and string identifying a related resource and the relationship
  • ma:collection: a URI or string providing the name of a collection to which the resource belongs
  • ma:copyright: a URI or string with the copyright statement.
  • ma:license: a string or URI with the usage license
  • ma:publisher: a string or URI with the publisher of the resource
  • ma:targetAudience: a URI and classification string providing the issuer of the classification and the classification value
  • ma:fragments: a list of string and URI values that identify media fragments and their type
  • ma:namedFragments: a list of string and URI values the provide names to media fragments
  • ma:frameSize: a width - height pair in pixels
  • ma:compression: a string providing the compression algorithm
  • ma:duration: a float to provide the resource duration in seconds
  • ma:format String: the mime type of the resource
  • ma:samplingrate: a float with the audio sampling rate
  • ma:framerate: a float with the video frame rate
  • ma:bitrate: a float providing the average bit rate in kbps
  • ma:numTracks: an int of the number of tracks

Note that some of these fields are not single values, but simple constructs of multiple values. Thus, they are actually more complex than name-value pairs that, e.g. are typically used in HTML meta headers or in Dublin Core. I regard this as an issue for implementations.

The fields were chosen as typical metadata being available about media resources. The media fragments fields are a bit dubious in this respect, but could be useful in future.

The metadata is determined either from within the resource itself or from a metadata collection about the resource. As such, the document maps several existing metadata and media resource formats to this interface, amongst them:

As they didn’t have a mapping table for Ogg content, I offered the following:

MAWGRelationOgg propertiesHow to do the mappingDatatype
Descriptive Properties (Core Set)
Identification
ma:identifierexactNameName field in skeleton header (new)String
ma:titleexactTitleTITLE field in vorbiscomment headerString
exactTitleTitle field in skeleton header (new)String
relatedAlbumALBUM title in vorbiscomment headerString
ma:languageexactLanguageLanguage field in skeleton header (new)language code
ma:locatorexactfile URI from systemURI
Creation
ma:contributorexactArtist, PerformerARTIST and PERFORMER vorbiscomment headersStrings
ma:creatorrelatedOrganizationORGANIZATION field in vorbiscomment header
ma:createDateexactDateDATE field in vorbiscomment headerISO date format
ma:locationexactLocationLOCATION field in vorbiscomment headerString
Content description
ma:descriptionexactDescriptionDESCRIPTION field in vorbiscomment headerString
ma:keywordN/A
ma:genreexactGenreGENRE field in vorbiscomment headerString
ma:ratingN/A
Relational
ma:relationrelatedVersion, TracknumberVERSION (version of a title), TRACKNUMBER (CD track) fields in vorbiscomment headerStrings
ma:collectionrelatedAlbumALBUM field of vorbiscomment headerString
Rights
ma:copyrightexactCopyrightCOPYRIGHT field of vorbiscomment headerString
ma:licenseexactLicenseLICENSE field of vorbiscomment headerString
Distribution
ma:publisherrelatedOrganizationORGNIZATION field of vorbiscomment headerString
ma:targetAudiencemore specificRoleRole field of Skeleton header (new)String
Fragments
ma:fragmentsN/A
ma:namedFragmentsN/A
Technical Properties
ma:frameSizeexactextract from binary header of video trackint, int (width x height)
ma:compressionexactContent-typeContent-type field of Skeleton headerMIME type
ma:durationexactcalculate as duration = last_sample_time - first_sample_time of OggIndex header of skeletonFloat (or rather: rational - rational)
ma:formatexactContent-typeContent-type field of Skeleton headerMIME type
ma:samplingrateexactcalculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton headerRational (or rather int / int)
ma:framerateexactcalculate as granulerate = granulerate_numerator / granulerate_denominator of Skeleton headerRational (or rather int / int)
ma:bitrateexactcalculate as bitrate = length_of_segment / duration from OggIndex headers of skeletonFloat
ma:numTracksexactTracknumberTRACKNUMBER field of vorbiscomment header (track number on album)Int

You will notice that the table mentions 4 fields in skeleton with a “new” marker - they are actually proposed fields in skeleton - a bit of coding will be necessary to introduce them into software. The space for these fields already exists in message header fields, so it won’t require a change of the skeleton format.

In the second specification of the Media Annotations WG, the group offers a standard API to access (i.e. read) the defined fields. They also intend to create an API to write the fields, but I doubt that will be easy because of the vast amount of file types they intend to support.

There is basically a single function that allows the extraction of metadata: MAObject[] getProperty(in DOMString propertyName, in optional DOMString sourceFormat, in optional DOMString subtype, in optional DOMString language, in optional DOMString fragment );

I proposed it may be possible to include this into HTML5 as follows: interface HTMLMediaElement : HTMLElement { ... getter MAObject getProperty(in DOMString propertyName, in optional unsigned long trackIndex); ... }

This would either extract the property for a particular track in a media resource or for the complete resource if no track index is given. The only problem I see is that the returned object is different depending on the requested property - the MAObject is only a parent class for the returned object types. I am not sure it is therefore possible to specify this easily in HTML5.

Overall I thought the specification was a nice piece of work. I am not sure I agree with all the chosen fields, but that is always an issue with metadata. The most important fields are there and that’s what matters.

W3C Technical Plenary / Advisory Committee Meetings Week 2008

I spent last week in France, near Cannes, at the W3C TPAC meeting. This is the one big meeting that the W3C has every year to bring together all (or most) of the technical working groups and other active groups at the W3C.

It was not my first time at a standards body meeting - I have been part of ISO/MPEG before and also of IETF, and spoken with people at IEEE and SMPTE. However, this time was different. I felt like I was with people that spoke my language. I also felt like my experience was valued and will help solving some of the future challenges for the Web. I am very excited to be an invited expert on the Media Fragments and Media Annotations working groups and be able to provide input into HTML5.

In the Media Fragments working group we are developing a URI addressing scheme that enables direct linking to media fragments, in particular temporal and spatial segments. Experience from our earlier temporal URI scheme is one of the inputs to the scheme. Currently it looks likely that we will choose a scheme that has ”#” in it and then require changes to browsers, Web proxys, and servers to enable delivery of media fragments.

In the Media Annotations working group we are deciding upon an ontology to generically describe media resources - something based on Dublin Core but more extended and more appropriate for audio and video. We are currently looking at Adobe’s XMP specification.

As for HTML5 - there was not much of a discussion at the TPAC meeting about the audio and video elements (unless I missed it by attending the other groups). However, from some of the discussions it became clear to me that they are still in very early stage of specification and much can be done to help define the general architecture of how to publish video on the Web and its metadata, help define javascript APIs and DOM models, and help define accessibility.

I actually gave a lightning talk about the next challenges of HTML5 video at TPAC (see my “video slides”) which points out the need for standard definitions of video structure and annotations together with an API to reach them. I had lots of discussions with people afterwards and also learnt a lot more about how to do accessibility for Web video. I should really write it up in an article…

Of course, I also met a lot of cool people at TPAC, amongst them Larry Masinter, Ian Hickson, and Tim Berners-Lee - past and new heros of Web standards. :-) It was totally awesome and I am very grateful to Mozilla for sending me there and enabling me to learn more about the greater picture of video accessibility and the role it plays on the Web.