ginger's thoughts

Silvia's blog

Tag: video-accessibility

WebVTT as a W3C Recommendation

Three weeks ago I attended TPAC, the annual meeting of W3C Working Groups. One of the meetings was of the Timed Text Working Group (TT-WG), that has been specifying TTML, the Timed Text Markup Language. It is now proposed that WebVTT be also standardised through the same Working Group.

How did that happen, you may ask, in particular since WebVTT and TTML have in the past been portrayed as rival caption formats? How will the WebVTT spec that is currently under development in the Text Track Community Group (TT-CG) move through a Working Group process?

I’ll explain first why there is a need for WebVTT to become a W3C Recommendation, and then how this is proposed to be part of the Timed Text Working Group deliverables, and finally how I can see this working between the TT-CG and the TT-WG.

Advantages of a W3C Recommendation

TTML is a XML-based markup format for captions developed during the time that XML was all the hotness. It has become a W3C standard (a so-called “Recommendation”) despite not having been implemented in any browsers (if you ask me: that’s actually a flaw of the W3C standardisation process: it requires only two interoperable implementations of any kind – and that could be anyone’s JavaScript library or Flash demonstrator – it doesn’t actually require browser implementations. But I digress…). To be fair, a subpart of TTML is by now implemented in Internet Explorer, but all the other major browsers have thus far rejected proposals of implementation.

Because of its Recommendation status, TTML has become the basis for several other caption standards that other SDOs have picked: the SMPTE’s SMPTE-TT format, the EBU’s EBU-TT format, and the DASH Industry Forum’s use of SMPTE-TT. SMPTE-TT has also become the “safe harbour” format for the US legislation on captioning as decided by the FCC. (Note that the FCC requirements for captions on the Web are actually based on a list of features rather than requiring a specific format. But that will be the topic of a different blog post…)

WebVTT is much younger than TTML. TTML was developed as an interchange format among caption authoring systems. WebVTT was built for rendering in Web browsers and with HTML5 in mind. It meets the requirements of the element and supports more than just captions/subtitles. WebVTT is popular with browser developers and has already been implemented in all major browsers (Firefox Nightly is the last to implement it – all others have support already released).

As we can see and as has been proven by the HTML spec and multiple other specs: browsers don’t wait for specifications to have W3C Recommendation status before they implement them. Nor do they really care about the status of a spec – what they care about is whether a spec makes sense for the Web developer and user communities and whether it fits in the Web platform. WebVTT has obviously achieved this status, even with an evolving spec. (Note that the spec tries very hard not to break backwards compatibility, thus all past implementations will at least be compatible with the more basic features of the spec.)

Given that Web browsers don’t need WebVTT to become a W3C standard, why then should we spend effort in moving the spec through the W3C process to become a W3C Recommendation?

The modern Web is now much bigger than just Web browsers. Web specifications are being used in all kinds of devices including TV set-top boxes, phone and tablet apps, and even unexpected devices such as white goods. Videos are increasingly omnipresent thus exposing deaf and hard-of-hearing users to ever-growing challenges in interacting with content on diverse devices. Some of these devices will not use auto-updating software but fixed versions so can’t easily adapt to new features. Thus, caption producers (both commercial and community) need to be able to author captions (and other video accessibility content as defined by the HTML5 element) towards a feature set that is clearly defined to be supported by such non-updating devices.

Understandably, device vendors in this space have a need to build their technology on standardised specifications. SDOs for such device technologies like to reference fixed specifications so the feature set is not continually updating. To reference WebVTT, they could use a snapshot of the specification at any time and reference that, but that’s not how SDOs work. They prefer referencing an officially sanctioned and tested version of a specification – for a W3C specification that means creating a W3C Recommendation of the WebVTT spec.

Taking WebVTT on a W3C recommendation track is actually advantageous for browsers, too, because a test suite will have to be developed that proves that features are implemented in an interoperable manner. In summary, I can see the advantages and personally support the effort to take WebVTT through to a W3C Recommendation.

Choice of Working Group

FAIK this is the first time that a specification developed in a Community Group is being moved into the recommendation track. This is something that has been expected when the W3C created CGs, but not something that has an established process yet.

The first question of course is which WG would take it through to Recommendation? Would we create a new Working Group or find an existing one to move the specification through? Since WGs involve a lot of overhead, the preference was to add WebVTT to the charter of an existing WG. The two obvious candidates were the HTML WG and the TT-WG – the first because it’s where WebVTT originated and the latter because it’s the closest thematically.

Adding a deliverable to a WG is a major undertaking. The TT-WG is currently in the process of re-chartering and thus a suggestion was made to add WebVTT to the milestones of this WG. TBH that was not my first choice. Since I’m already an editor in the HTML WG and WebVTT is very closely related to HTML and can be tested extensively as part of HTML, I preferred the HTML WG. However, adding WebVTT to the TT-WG has some advantages, too.

Since TTML is an exchange format, lots of captions that will be created (at least professionally) will be in TTML and TTML-related formats. It makes sense to create a mapping from TTML to WebVTT for rendering in browsers. The expertise of both, TTML and WebVTT experts is required to develop a good mapping – as has been shown when we developed the mapping from CEA608/708 to WebVTT. Also, captioning experts are already in the TT-WG, so it helps to get a second set of eyes onto WebVTT.

A disadvantage of moving a specification out of a CG into a WG is, however, that you potentially lose a lot of the expertise that is already involved in the development of the spec. People don’t easily re-subscribe to additional mailing lists or want the additional complexity of involving another community (see e.g. this email).

So, a good process needs to be developed to allow everyone to contribute to the spec in the best way possible without requiring duplicate work. How can we do that?

The forthcoming process

At TPAC the TT-WG discussed for several hours what the next steps are in taking WebVTT through the TT-WG to recommendation status (agenda with slides). I won’t bore you with the different views – if you are keen, you can read the minutes.

What I came away with is the following process:

  1. Fix a few more bugs in the CG until we’re happy with the feature set in the CG. This should match the feature set that we realistically expect devices to implement for a first version of the WebVTT spec.
  2. Make a FSA (Final Specification Agreement) in the CG to create a stable reference and a clean IPR position.
  3. Assuming that the TT-WG’s charter has been approved with WebVTT as a milestone, we would next bring the FSA specification into the TT-WG as FPWD (First Public Working Draft) and immediately do a Last Call which effectively freezes the feature set (this is possible because there has already been wide community review of the WebVTT spec); in parallel, the CG can continue to develop the next version of the WebVTT spec with new features (just like it is happening with the HTML5 and HTML5.1 specifications).
  4. Develop a test suite and address any issues in the Last Call document (of course, also fix these issues in the CG version of the spec).
  5. As per W3C process, substantive and minor changes to Last Call documents have to be reported and raised issues addressed before the spec can progress to the next level: Candidate Recommendation status.
  6. For the next step - Proposed Recommendation status - an implementation report is necessary, and thus the test suite needs to be finalized for the given feature set. The feature set may also be reduced at this stage to just the ones implemented interoperably, leaving any other features for the next version of the spec.
  7. The final step is Recommendation status, which simply requires sufficient support and endorsement by W3C members.

The first version of the WebVTT spec naturally has a focus on captioning (and subtitling), since this has been the dominant use case that we have focused on this far and it’s the part that is the most compatibly implemented feature set of WebVTT in browsers. It’s my expectation that the next version of WebVTT will have a lot more features related to audio descriptions, chapters and metadata. Thus, this seems a good time for a first version feature freeze.

There are still several obstacles towards progressing WebVTT as a milestone of the TT-WG. Apart from the need to get buy-in from the TT-WG, the TT-CG, and the AC (Adivisory Committee who have to approve the new charter), we’re also looking at the license of the specification document.

The CG specification has an open license that allows creating derivative work as long as there is attribution, while the W3C document license for documents on the recommendation track does not allow the creation of derivative work unless given explicit exceptions. This is an issue that is currently being discussed in the W3C with a proposal for a CC-BY license on the Recommendation track. However, my view is that it’s probably ok to use the different document licenses: the TT-WG will work on WebVTT 1.0 and give it a W3C document license, while the CG starts working on the next WebVTT version under the open CG license. It probably actually makes sense to have a less open license on a frozen spec.

Making the best of a complicated world

WebVTT is now proposed as part of the recharter of the TT-WG. I have no idea how complicated the process will become to achieve a W3C WebVTT 1.0 Recommendation, but I am hoping that what is outlined above will be workable in such a way that all of us get to focus on progressing the technology.

At TPAC I got the impression that the TT-WG is committed to progressing WebVTT to Recommendation status. I know that the TT-CG is committed to continue developing WebVTT to its full potential for all kinds of media-time aligned content with new kinds already discussed at FOMS. Let’s enable both groups to achieve their goals. As a consequence, we will allow the two formats to excel where they do: TTML as an interchange format and WebVTT as a browser rendering format.

Summary Video Accessibility Talk

I’ve just got off a call to the UK Digital TV Group, for which I gave a talk on HTML5 video accessibility (slides best viewed in Google Chrome).

The slide provide a high-level summary of the accessibility features that we’ve developed in the W3C for HTML5, including:

  • Subtitles & Captions with WebVTT and the track element
  • Video Descriptions with WebVTT, the track element and speech synthesis
  • Chapters with WebVTT for semantic navigation
  • Audio Descriptions through synchronising an audio track with a video
  • Sign Language video synchronized with a main video

I received some excellent questions.

The obvious one was about why WebVTT and not TTML. While for anyone who has tried to implement TTML support, the advantages of WebVTT should be clear, for some the decision of the browsers to go with WebVTT still seems to be bothersome. The advantages of CSS over XSL-FO in a browser-context are obvious, but not as much outside browsers. So, the simplicity of WebVTT and the clear integration with HTML have to speak for themselves. Conversion between TTML and WebVTT was a feature that was being asked for.

I received a question about how to support ducking (reduce the volume of the main audio track) when using video descriptions. My reply was to either use video descriptions with WebVTT and do ducking during the times that a cue is active, or when using audio descriptions (i.e. actual audio tracks) to add an additional WebVTT file of kind=metadata to mark the intervals in which to do ducking. In both cases some JavaScript will be necessary.

I received another question about how to do clean audio, which I had almost forgotten was a requirement from our earlier media accessibility document. “Clean audio” consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated. I suggested using the mediagroup attribute to provide a main video element (without an audio track) and then the other channels as parallel audio tracks that can be turned on and off and attenuated individually. There is some JavaScript coding involved on top of the APIs that we have defined in HTML, but it can be implemented in browsers that support the mediagroup attribute.

Another question was about the possibilities to extend the list of @kind attribute values. I explained that right now we have a proposal for a new text track kind=“forced” so as to provide forced subtitles for sections of video with foreign language. These would be on when no other subtitle or caption tracks are activated. I also explained that if there is a need for application-specific text tracks, the kind=“metadata” would be the correct choice.

I received some further questions, in particular about how to apply styling to captions (e.g. color changes to text) and about how closely the browser are able to keep synchronization across multiple media elements. The earlier was easily answered with the ::cue pseudo-element, but the latter is a quality of implementation feature, so I had to defer to individual browsers.

Overall it was a good exercise to summarize the current state of HTML5 video accessibility and I was excited to show off support in Chrome for all the features that we designed into the standard.

My crazy linux.conf.au week

In January I attended the annual Australian Linux and Open Source conference (LCA). But since I was sick all of January and had a lot to catch up on, I never got around to sharing all the talks that I gave during that time.

Drupal Down Under

It started with a talk at Drupal Down Under, which happened the weekend before LCA. I gave a talk titled “HTML5 video specifications” (video, slides).

I spoke about the video and audio element in HTML5, how to provide fallback content, how to encode content, how to control them from JavaScript, and briefly about Drupal video modules, though the next presentation provided much more insight into those. I explained how to make the HTML5 media elements accessible, including accessible controls, captions, audio descriptions, and the new WebVTT file format. I ran out of time to introduce the last section of my slides which are on WebRTC.

Linux.conf.au

On the first day of LCA I gave a talk both in the Multimedia Miniconf and the Browser Miniconf.

Browser Miniconf

In the Browser Miniconf I talked about “Web Standardisation – how browser vendors collaborate, or not” (slides). Maybe the most interesting part about this was that I tried out a new slide “deck” tool called impress.js. I’m not yet sure if I like it but it worked well for this talk, in which I explained how the HTML5 spec is authored and who has input.

I also sat on a panel of browser developers in the Browser Miniconf (more as a standards than as a browser developer, but that’s close enough). We were asked about all kinds of latest developments in HTML5, CSS3, and media standards in the browser.

Multimedia Miniconf

In the Multimedia Miniconf I gave a “HTML5 media accessibility update” (slides). I talked about the accessibility problems of Flash, how native HTML5 video players will be better, about accessible video controls, captions, navigation chapters, audio descriptions, and WebVTT. I also provided a demo of how to synchronize multiple video elements using a polyfill for the multitrack API.

I also provided an update on HTTP adaptive streaming APIs as a lightning talk in the Multimedia Miniconf. I used an extract of the Drupal conference slides for it.

Main conference

Finally, and most importantly, Alice Boxhall and myself gave a talk in the main linux.conf.au titled “Developing Accessible Web Apps - how hard can it be?” (video, slides). I spoke about a process that you can follow to make your Web applications accessible. I’m writing a separate blog post to explain this in more detail. In her part, Alice dug below the surface of browsers to explain how the accessibility markup that Web developers provide is transformed into data structures that are handed to accessibility technologies.

3rd W3C Web and TV Workshop, Hollywood

Curious about any new requirements that the TV community may have for HTML5 video, I attended the W3C Web and TV Workshop in Hollywood last week. It’s already the third of its kind and was also the largest to date showing an increasing interest of the TV community to converge with the Web community.

The Workshop Aim

I went into the Workshop not quite knowing what to expect. My previous contact with members of this community was restricted to email exchanges on the W3C Web and TV Interest Group (IG) mailing list. I knew there was some interest in video accessibility (well: particularly captions) and little knowledge of existing HTML5 specifications around text tracks and why the browsers were going with WebVTT. So I had decided to attend the workshop to get a better understanding of the community, it’s background, needs, and issues, and to hopefully teach some of the ways of HTML5. For that reason I had also submitted a WebVTT presentation/demo.

As it turned out, the workshop had as its key target the facilitation of communication between the TV and the HTML5 community. The aim was to identify features that need to be added to the HTML5 video element to satisfy the needs of the TV community. I obviously came to the right workshop.

The process that is being used by the W3C in the Interest Group is to have TV community members express their needs, then have HTML5 experts express how these needs can be satisfied with existing HTML5 features, then make trial implementations and identify any shortcomings, then move forward to progress these through HTML5 or HTML.next. This workshop clearly focused on the first step: expressing needs.

Often times it was painful for me to watch presenters defending their requirements and trying to impress on the audience how important a certain feature is to them when that features actually already has a HTML5 specification, but just not yet a browser implementations. That there were so few HTML5 video experts present and that they were given very little space to directly reply to the expressed needs and actually explain what is already possible (or specified to be possible) was probably one of the biggest drawbacks of the workshop.

To be fair, detailed technical discussions were not possible in a room with 150 attendees with a panel sitting at the front discussing topics and taking questions. Solving a use case with existing HTML5 markup and identifying the gaps requires smaller break-out groups of a maximum of maybe 20 people and sufficient HTML5 knowledge in the room. Ultimately they require a single person to try to implement it using JavaScript alone, and, failing that, writing browser extensions. Only such code actually proves that a feature is missing.

Now, the video features of HTML5 are still continuing to change almost on a daily basis. Much development is, for example, happening around real-time communication features and around the track element as we speak. So, focusing on further requirements finding around HTML5 video for now is probably a good thing.

The TV Community Approach

Before I move on to some of the topics covered by the workshop, I have to express some concern about the behaviour that I observed with lots of the TV community folks. Many people tried pushing existing solutions from other spaces into the Web unchanged with a claim of not re-inventing the wheel and following paved cowpaths, which are some of the underlying design principles for HTML5. I can understand where such behaviour originates thinking that having solved the same problems elsewhere before, those solutions should apply here, too. But I would like to warn people of this approach.

If we blindly apply solutions that were not developed for HTML5 into HTML we will end up with suboptimal solutions that will hurt us further down the track. The principles of not re-inventing the wheel and following paved cowpaths were introduced for features that were already implemented by browsers or in de-facto standard use by JavaScript libraries. They were not created for new features in HTML. The video element is a completely new feature in HTML thus everything around it is new.

I would therefore like to see some more respect given to HTML5 and the complexities involved in finding the best possible technical solutions for the Web given that the video element does not stand alone in HTML5, but is part of a much larger picture of technical capabilities on the Web where many of the requested features for TV applications may already be solved by existing HTML markup that is not part of the video element.

Also, HTML5 is not just about the HTML markup, but also about CSS and JavaScript and HTTP. There are several layers of technology involved in creating a Web application: not only a separation of work between client and servers, but also between the Operating System, the media framework, the browser, browser plugins, and JavaScript has to be balanced. To get this balance right is a fine art that will take many discussion, many experiments and sometimes several design approaches. We need patience and calm to work through this, not a rushed adoption of existing solutions from other spaces.

New Requirements

Now let’s get to the take-aways I had from the workshop’s sessions:

Session 1 / Content Provider and Consumer Perspective:

The sessions participants postulate that we will see the creation of application stores for TV applications similar to how we have experienced this for mobile phones and tablets. People enjoy collecting apps like they collect badges. Right now, the app store domain is dominated by native apps and now Web apps. The reason is that we haven’t got a standard platform for setting up Web app stores with Web apps that work in all browsers on all operating systems. Thus, developers have to re-deploy their app for many environments.

While essentially an orthogonal need to HTML standardisation, this seems to be one of the key issues that keep Web apps back from making big market inroads and W3C may do well in setting up a new WG to define a standard Web app manifest format and JS APIs.

Session 2+3 / Multi-screen TV in the Home Network:

Several technologies of hybrid TV broadcast and set-top-box Web content delivery were being pointed out, including the European HbbTV and the Japanese Hybridcast, the latter of which gave an in-depth demo.

Web purists would probably say that it would be simpler to just deliver all content over the Web and not have to worry about any further technical challenges encountered by having to synchronize content received via two vastly different delivery mechanisms. I personally believe this development is one of business models: we don’t yet know exactly how to earn money from TV content delivered over the Internet, but we do know how to do so with TV content. So, hybrids allow the continuation of existing income streams while allowing the features to be augmented with those people enjoy from the Internet.

Should requirements that emerge from such a use case for HTML5 video be taken seriously? I think they absolutely should. What I see happening is that a new way of using the Web is starting to emerge. The new way is video-focused rather than text-focused. We receive our Web content by watching video programming online - video channels, not Web pages are the core content that we consume in the living room. Video channels are where we start our browsing experience from. Search may still be our first point of call, but it will be search for video content or a video-centric app rather than search for a Web site.

And it will be a matter of many interconnected devices in the house that contribute to the experience: the 5.1 stereos that are spread all over the house and should receive our video’s sound, the different screens in the different areas of our house between which we move around, and remote controls, laptops or tablets that function as remote controls and preview stations and are used to determine our viewing experience and provide a back-channel to the publishers.

We have barely begun to identify how such interconnected devices within a home fit within the server-client-based view of the Web world, and the new Web Sockets functionality. The Home Networking Task Force of the Web and TV IG is looking at the issues and analysing existing protocols and standards that solve this picture. But I have a gnawing feeling that the best solution will be something new that is more Web-specific and fits better with the technology layers of the Web.

Session 4 / Synchronized Metadata:

The TV environment offers many data services, some of which have been legally prescribed. This session analysed TV needs and how they can be satisfied with current HTML5.

Subtitles and closed captioning support are one of the key requirements that have been legally prescribed to allow for equal access of non-native speakers, and blind and vision-impaired users to TV content. After demonstration of some key features defined into the HTML5 track element and the WebVTT format, it was generally accepted that HTML5 is making big progress in this space, in particular that browsers are in the process of implementing support for the track element. A concern still exists for complete coverage of all the CEA-608/708 features in WebVTT.

Further concern was raised for support of audio descriptions and audio translations, in particular since no browser has as yet committed to implementing the HTML5’s media multitrack API with the @mediagroup attribute. In this context I am excited to see first JavaScript polyfills emerge (see captionator.js & mediagroup.js).

Another concern was that many captions are actually delivered as raster images (in particular DVD captions) and how that would work in the Web context. The proposal was to use WebVTT and encode the raster images as data-URIs included in timed cues, then render them by JavaScript as an overlay. This is something to explore further.

Demos were shown using WebVTT to synchronize ads with videos, to display related metadata from a user’s life log with videos, to display thumbnails along a video’s timeline, and to show the rendering of text descriptions through screen readers. General agreement by the panel was that WebVTT offers many opportunities and that this area will continue to need further development and that we will see new capabilities on the Web around metadata that were not previously possible on TV.

Session 5 / Content Format and Codecs: DASH and Codec standards

The introduction of HTTP adaptive streaming into HTML5 was one of the core issues that kept returning in the discussions. This panel focused on MPEG DASH, but also mentioned the need for programmatic implementation of adaptive streaming functionality.

The work around MPEG DASH would require specifications of how to use DASH with WebM and Ogg Theora, as well as a specification of a HTML5 profile for DASH, which would limit the functionality possible in DASH files to the ones needed in a HTML5 video element. One criticism of DASH was its verbosity. Another was its unclear patent position. Panel attendees with included Qualcomm, Apple and Microsoft made very clear that their position is pro a royalty-free use of DASH.

The work around a programmatic implementation for adaptive streaming would require at least a JavaScript API to measure the quality of service of a presented video element and a JavaScript API to feed the video element with chunks of (encrypted) video content on the fly. Interestingly enough, there are existing experiments both around Video metrics and MediaSource extensions, so we can expect some progress in this space, even if these are not yet a strong focus of the HTML WG.

I would personally support the creation of Community Group at the W3C around HTTP adaptive streaming and DASH. I think it would work towards alleviating the perceived patent issues around DASH and allow the right members of the community to participate in preparing a specification for HTML5 without requiring them to become W3C members.

Session 6 / Content Protection and DRM

A core concern of the TV community is around content protection. The requirements in this space seem, however, very confused.

The key assumption here is that Web browsers should support the decoding of DRM-protected content in the HTML5 video element because the video element provides a desirable JavaScript API, accessibility features (the track element), default controls, and the possibility to synchronize multiple media elements. However, at the same time, the video element is part of the core content of a Web page and thus allows direct access to the image content in a canvas etc, so some of its functionality is not desirable.

The picture is further confused by requests for authentication, authorization, encryption, obfuscation, same-origin, secure transmission, secure decryption key delivery, unique content identification and other “content protection” techniques without a clear understanding of what is already possible on the Web and what requirements to content publishers actually have for delivering their content on the Web. This is further complicated by the fact that there are many competing solutions for DRM systems in the market with no clear standard that all browsers could support.

A thorough analysis of the technologies and solutions available in this space as well as an analysis of the needs for HTML5 is required before it becomes clear what solution HTML5 browsers may need to support. There seemed to be agreement in the group, though, that browsers would not need to implement DRM solutions, but rather only hand through the functionality of the platform on which they are running (including the media frameworks and operating system functionalities). How this is supposed to work was, however, unclear.

Session 7 / Web & TV: Additional Device & User Requirements

This was a catch-all session for topics that had not been addressed in other sessions. Among the topics addressed in this group were:

  • Parental Guidance: how to deal with ratings in an internationally inconsistent ratings landscape, how to deliver the ratings with the content, and how to enforce the viewing restrictions
  • Emergency Notifications: how to replicate on the Web the emergency notification functionality of TV by providing text overlays to alert users
  • TV channels: how to detect what channels of programming are available to users

Overall, the workshop was a worthwhile experience. It seems there is a lot of work still ahead for making HTML5 video the best it can be on the Web.

Recent developments around WebVTT

People have been asking me lots of questions about WebVTT (Web Video Text Tracks) recently. Questions about its technical nature such as: are the features included in WebVTT sufficient for broadcast captions including positioning and colors? Questions about its standardisation level: when is the spec officially finished and when will it move from the WHATWG to the W3C? Questions about implementation: are any browsers supporting it yet and how can I make use of it now?

I’m going to answer all of these questions in this post to make it more efficient than answering tweets, emails, and skype and other phone conference requests. It’s about time I do a proper post about it.

Implementations

I’m starting with the last area, because it is the simplest to answer.

No, no browser has as yet shipped support for the element and therefore there is no support for WebVTT in browsers yet. However, implementations are in progress. For example, Webkit has recently received first patches for the track element, but there is still an open bug for a WebVTT parser. Similarly, Firefox can now parse the track element, but is still working on the element’s actual functionality.

However, you do not have to despair, because there are now a couple of JavaScript polyfill libraries for either just the track element or for video players with track support. You can start using these while you are waiting for the browsers to implement native support for the element and the file format.

Here are some of the libraries that I’ve come across that will support SRT and/or WebVTT (do leave a comment if you come across more):

  • Captionator - a polyfill for track and SRT parsing (WebVTT in the works)
  • js_videosub - a polyfill for track and SRT parsing
  • jscaptions - a polyfill for track and SRT parsing
  • LeanBack player - a video player with track and SRT, SUB, DFXP, and soon full WebVTT parsing support
  • playr - a video player that includes track and WebVTT parsing
  • MediaElementJS - a video player that includes track and SRT parsing
  • Kaltura’s video player - a video player that includes track and SRT parsing

I am actually most excited about the work of Ronny Mennerich from LeanbackPlayer on WebVTT, since he has been the first to really attack full support of cue settings and to discuss with Ian, me and the WHATWG about their meaning. His review notes with visual description of how settings are to be interpreted and his demo will be most useful to authors and other developers.

Standardisation

Before we dig into the technical progress that has been made recently, I want to answer the question of “maturity”.

The WebVTT specification is currently developed at the WHATWG. It is part of the HTML specification there. When development on it started (under its then name WebSRT), it was also part of the HTML5 specification of the W3C. However, there was a concern that HTML5 should be independent of the chosen captioning format and thus WebVTT currently only exists at the WHATWG.

In recent months - and particularly since browser vendors have indicated that they will indeed implement support for WebVTT as their implementation of the element - the question of formal standardization of WebVTT at the W3C has arisen. I’m involved in this as a Google contractor and we’ve put together a proposed charter for a WebVTT Working Group at the W3C.

In the meantime, standardization progresses at the WHATWG productively. Much feedback has recently been brought together by Ian and changes have been applied or at least prepared for a second feature set to be added to WebVTT once the first lot is implemented. I’ve captured the potentially accepted and rejected new features in a wiki page.

Many of the new features are about making the WebVTT format more useful for authoring and data management. The introduction of comments, inline CSS settings and default cue settings will help authors reduce the amount of styling they have to provide. File-wide metadata will help with the exchange of management information in professional captioning scenarios and archives.

But even without these new features, WebVTT already has all the features necessary to support professional captioning requirements. I’ve prepared a draft mapping of CEA-608 captions to WebVTT to demonstrate these capabilities (CEA-608 is the TV captioning standard in the US).

So, overall, WebVTT is in a great state for you to start implementing support for it in caption creation applications and in video players. There’s no need to wait any longer - I don’t expect fundamental changes to be made, but only new features to be added.

New WebVTT Features

This takes us straight to looking at the recently introduced new features.

  • Simpler File Magic: Whereas previously the magic file identifier for a WebVTT file was a single line with “WEBVTT FILE”. This has now been changed to a single line with just “WEBVTT”.
  • Cue Bold Span: The element has been introduced into WebVTT, thus aligning it somewhat more with SRT and with HTML.
  • CSS Selectors: The spec already allowed to use the names of tags, the classes of tags, and the voice annotations of tags as CSS selectors for ::cue. ID selector matching is now also available, where the cue identifier is used.
  • text-decoration support: The spec now also supports the CSS text-decoration property for WebVTT cues, allowing functionality such as blinking text and bold.

Further to this, the email identifies the means in which WebVTT is extensible:

  • Header area: The WebVTT header area is defined through the “WEBVTT” magic file identifier as a start and two empty lines as an end. It is possible to add into this area file-wide information header information.
  • Cues: Cues are defined to start with an optional identifier, and then a start/end time specification with ”—>” separator. They end with two empty lines. Cues that contain a ”—>” separator but don’t parse as valid start/end time are currently skipped. Such “cues” can be used to contain inline command blocks.
  • Inline in cues: Finally, within cues, everything that is within a “tag”, i.e. between "", and does not parse as one of the defined start or end tags is ignored, so we can use these to hide text. Further, text between such start and end tags is visible even if the tags are ignored, so wen can introduce new markup tags in this way.

Given this background, the following V2 extensions have been discussed:

  • Metadata: Enter name-value pairs of metadata into the header area, e.g.

    WEBVTT
    Language=zh
    Kind=Caption
    Version=V1_ABC
    License=CC-BY-SA
    
    1
    00:00:15.000 --> 00:00:17.950
    first cue
  • Inline Cue Settings: Default cue settings can come in a “cue” of their own, e.g.

    WEBVTT
    
    DEFAULTS --> D:vertical A:end
    
    00:00.000 --> 00:02.000
    This is vertical and end-aligned.
    
    00:02.500 --> 00:05.000
    As is this.
    
    DEFAULTS --> A:start
    
    00:05.500 --> 00:07.000
    This is horizontal and start-aligned.
    
  • Inline CSS: Since CSS is used to format cue text, a means to do this directly in WebVTT without a need for a Web page and external style sheet is helpful and could be done in its own cue, e.g.

    WEBVTT
    
      STYLE -->
      ::cue(v[voice=Bob]) { color: green; }
      ::cue(c.narration) { font-style: italic; }
      ::cue(c.narration i) { font-style: normal; }
    
      00:00.000 --> 00:02.000
      <v Bob>Welcome.
    
      00:02.500 --> 00:05.000
      <c .narration>To <i>WebVTT</i>.
    
  • Comments: Both, comments within cues and complete cues commented out are possible, e.g.

    WEBVTT
    
     COMMENT -->
     00:02.000 --> 00:03.000
     two; this is entirely
     commented out
     
     00:06.000 --> 00:07.000
     this part of the cue is visible
     <! this part isn't >
     <and neither is this>
    

Finally, I believe we still need to add the following features:

  • Language tags: I’d like to add a language tag that allows to mark up a subpart of cue text as being in a different language. We need this feature for mixed-language cues (in particular where a different font may be necessary for the inline foreign-language text). But more importantly we will need this feature for cues that contain text descriptions rather than captions, such that a speech synthesizer can pick the correct language model to speak the foreign-language text. It was discussed that this could be done with a xxx type of markup.
  • Roll-up captions: When we use timestamp objects and the future text is hidden, then is un-hidden upon reaching its time, we should allow the cue text to scroll up a line when the un-hidden text requires adding a new line. This is the typical way in which TV live captions have been displayed and so users are acquainted with this display style.
  • Inline navigation: For chapter tracks the primary use of cues are for navigation. In other formats - in particular in DAISY-books for blind users - there are hierarchical navigation possibilities within media resources. We can use timestamp objects to provide further markers for navigation within cues, but in order to make these available in a hierarchical fashion, we will need a grouping tag. It would be possible to introduce a
  • Default caption width: At the moment, the default display size of a caption cue is 100% of the video’s width (height for vertical directions), which can be overruled with the “S” cue setting. I think it should by default rather be the width (height) of the bounding box around all the text inside the cue.

Aside from these changes to WebVTT, there are also some things that can be improved on the element. I personally support the introduction of the source element underneath the track element, because that allows us to provide different caption files for different devices through the @media media queries attribute and it allows support for more than just one default captioning format. This change needs to be made soon so we don’t run into trouble with the currently empty track element.

I further think a oncuelistchange event would be nice as well in cases where the number of tracks is somehow changed - in particular when coming from within a media file.

Other than this, I’m really very happy with the state that we have achieved this far.

WebVTT explained

On Wednesday, I gave a talk at Google about WebVTT, the Web Video Text Track file format that is under development at the WHATWG for solving time-aligned text challenges for video.

I started by explaining all the features that WebVTT supports for captions and subtitles, mentioned how WebVTT would be used for text audio descriptions and navigation/chapters, and explained how it is included into HTML5 markup, such that the browser provides some default rendering for these purposes. I also mentioned the metadata approach that allows any timed content to be included into cues.

The talk slides include a demo of how the element works in the browser. I’ve actually used the Captionator polyfill for HTML5 to make this demo, which was developed by Chris Giffard and is available as open source from GitHub.

The talk was recorded and has been made available as a Google Tech talk with captions and also a separate version with extended audio descriptions.

The slides of the talk are also available (best to choose the black theme).

I’ve also created a full transcript of the described video.

Get the WebVTT specification from the WHATWG Website.

Accessibility to Web video for the Vision-Impaired

In the past week, I was invited to an IBM workshop on audio/text descriptions for video in Japan. Geoff Freed and Trisha O’Connell from WGBH, and Michael Evans from BBC research were the other invited experts to speak about the current state of video accessibility around the world and where things are going in TV/digital TV and the Web.

The two day workshop was very productive. The first day was spent with presentations which were open to the public. A large vision-impaired community attended to understand where technology is going. It was very humbling to be part of an English-spoken workshop in Japan, where much of the audience is blind, but speaks English much better than my average experience with English in Japan. I met many very impressive and passionate people that are creating audio descriptions, adapting NVDA for the Japanese market, advocating to Broadcasters and Government to create more audio descriptions, and perform fundamental research for better tools to create audio descriptions. My own presentation was on “HTML5 Video Descriptions”.

On the second day, we only met with the IBM researchers and focused discussions on two topics:

  1. How to increase the amount of video descriptions
  2. HTML5 specifications for video descriptions

The first topic included concerns about guidelines for description authoring by beginners, how to raise awareness, who to lobby, and what production tools are required. I personally was more interested in the second topic and we moved into a smaller breakout group to focus on these discussions.

HTML5 specifications for video descriptions Two topics were discussed related to video descriptions: text descriptions and audio descriptions. Text descriptions are descriptions authored as time-aligned text snippets and read out by a screen reader. Audio descriptions are audio recordings either of a human voice or even of a TTS (text-to-speech) synthesis - in either case, they are audio samples.

For a screen reader, the focus was actually largely on NVDA and people were very excited about the availability of this open source tool. There is a concern about how natural-sounding a screen reader can be made and IBM is doing much research there with some amazing results. In user experiment between WGBH and IBM they found that the more natural the voice sounds, the more people comprehend, but between a good screen reader and an actual human voice there is not much difference in the comprehension level. Broadcasters and other high-end producers are unlikely to accept TTS and will prefer the human voice, but for other materials - in particular for the large majority of content on the Web - TTS and screen readers can make a big difference.

An interesting lesson that I learnt was that video descriptions can be improved by 30% (i.e. 30% better comprehension) if we introduce extended descriptions, i.e. descriptions that can pause the main video to allow for a description be read for something that happens in the video, but where there is no obvious pause to read out the description. So, extended descriptions are one of the major challenges to get right.

We then looked at the path that we are currently progressing on in HTML5 with WebSRT, the TimedTrack API, the elements and the new challenges around a multitrack API.

For text descriptions we identified a need for the following:

  • extension marker on cues: often it is very clear to the author of a description cue that there is no time for the cue to be read out in parallel to the main audio and the video needs to be paused. The proposal is for introduction of an extension marker on the cue to pause the video until the screen reader is finished. So, a speech-complete event from the screen reader API needs to be dealt with. To make this reliable, it might make sense to put a max duration on the cue so the video doesn’t end up waiting endlessly in case the screen reader event isn’t fired. The duration would be calculated based on a typical word speaking rate.
  • importance marker on cues: the duration of all text cues being read out by screen readers depends on the speed set-up of the screen reader. So, even when a cue has been created for a given audio break in the video, it may or may not fit into this break. For most cues it is important that they are read out completely before moving on, but for some it’s not. So, an importance maker could be introduced that determines whether a video stops at the end of the cue to allow the screen reader to finish, or whether the screen reader is silenced at that time no matter how far it has gotten.
  • ducking during cues: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension
  • voice hints: an instruction at the beginning of the text description file for what voice to choose such that it won’t collide with e.g. the narrator voice of a video - typically the choice will be for a female voice when the narrator is male and the other way around - this will help initialize the screen reader appropriately
  • speed hints: an indicator at the beginning of a text description toward what word rate was used as the baseline for the timing of the cue durations such that a screen reader can be initialized with this
  • synthesis directives: while not a priority, eventually it will make for better quality synchronized text if it is possible to include some of the typical markers that speech synthesizers use (see e.g. SSML or speech CSS), including markers for speaker change, for emphasis, for pitch change and other prosody. It was, in fact, suggested that the CSS3’s speech module may be sufficient in particular since Opera already implements it.

This means we need to consider extending WebSRT cues with an “extension” marker and an “importance” marker. WebSRT further needs header-type metadata to include a voice and a speed hint for screen readers. The screen reader further needs to work more closely with the browser and exchange speech-complete events and hints for ducking. And finally we may need to allow for CSS3 speech styles on subparts of WebSRT cues, though I believe this latter one is not of high immediate importance.

For audio descriptions we identified a need for:

  • external/in-band descriptions: allowing external or in-band description tracks to be synchronized with the main video. It would be assumed in this case that the timeline of the description track is identical to the main video.
  • extended external descriptions: since it’s impossible to create in-band extended descriptions without changing the timeline of the main video, we can only properly solve the issue of extended audio descriptions through external resources. One idea that we came up with is to use a WebSRT file with links to short audio recordings as external extended audio descriptions. These can then be synchronized with the video and pause the video at the correct time etc through JavaScript. This is probably a sufficient solution for now. It supports both, sighted and vision-impaired users and does not extend the timeline of the original video. As an optimization, we can also do this through a single “virtual” resource that is a concatenation of the individual audio cues and is addressed through the WebSRT file with byte ranges.
  • ducking: making the main audio track quieter in relation to the video description for the duration of a cue such as to allow the comprehension of the video description cue is important for comprehension also with audio files, though it may be more difficult to realize
  • separate loudness control: making it possible for the viewer to separately turn the loudness of an audio description up/down in comparison to the main audio

For audio descriptions, we saw the need for introduction of a multitrack video API and markup to synchronize external audio description tracks with the main video. Extended audio descriptions should be solved through JavaScript and hooking up through the TimedTrack API, so mostly rolling it by hand at this stage. We will see how that develops in future. Ducking and separate loudness controls are equally needed here, but we do need more experiments in this space.

Finally, we discussed general needs to locate accessibility content such as audio descriptions by vision-impaired user:

  • the need for accessible user menus to turn on/off accessibility content
  • the introduction of dedicated and standardized keyboard short-cuts to turn on and manipulate the volume of audio descriptions (and captions)
  • the introduction of user preferences for automatically activating accessibility content; these could even learn from current usage, such that if a user activates descriptions for a video on one Website, the preferences pick this up; different user profiles are already introduced by ISO in “Access for all” and used in websites such as teachersdomain
  • means to generally locate accessibility content on the web, such as fields in search engines and RSS feeds
  • more generally there was a request to have caption on/off and description on/off buttons be introduced into remote controls of machines, which will become prevalent with the increasing amount of modern TV/Internet integrated devices

Overall, the workshop was a great success and I am keen to see more experimentation in this space. I also hope that some of the great work that was shown to us at IBM with extended descriptions and text descriptions will become available - if only as screencasts - so we can all learn from it to make better standards and technology.

WebSRT and HTML5 media accessibility

On 23rd July, Ian Hickson, the HTML5 editor, posted an update to the WHATWG mailing list introducing the first draft of a platform for accessibility for the HTML5

What I want to do here is to summarize what was introduced, together with the improvements that I and some others have proposed in follow-up emails, and list some of the media accessibility needs that we are not yet dealing with.

For those wanting to only selectively read some sections, here is a clickable table of contents of this rather long blog post:

THE WebSRT TIMED TEXT FORMAT

The first and to everyone probably most surprising part is the new file format that is being proposed to contain out-of-band time-synchronized text for video. A new format was necessary after the analysis of all relevant existing formats determined that they were either insufficient or hard to use in a Web environment.

The new format is called WebSRT and is an extension to the existing SRT SubRip format. It is actually also the part of the new specification that I am personally most uncomfortable with. Not that WebSRT is a bad format. It’s just not sufficient yet to provide all the functionality that a good time-synchronized text format for Web media should. Let’s look at some examples.

WebSRT is composed of a sequence of timed text cues (that’s what we’ve decided to call the pieces of text that are active during a certain time interval). Because of its ancestry of SRT, the text cues can optionally be numbered through. The content of the text cues is currently allowed to contain three different types of text: plain text, minimal markup, and anything at all (also called “metadata”).

In its most simple form, a WebSRT file is just an ordinary old SRT file with optional cue numbers and only plain text in cues:

  1
  00:00:15.00 --> 00:00:17.95
  At the left we can see...

  2
  00:00:18.16 --> 00:00:20.08
  At the right we can see the...

  3
  00:00:20.11 --> 00:00:21.96
  ...the head-snarlers

A bit of a more complex example results if we introduce minimal markup:

  00:00:15.00 --> 00:00:17.95 A:start
  Auf der <i>linken</i> Seite sehen wir...

  00:00:18.16 --> 00:00:20.08 A:end
  Auf der <b>rechten</b> Seite sehen wir die....

  00:00:20.11 --> 00:00:21.96 A:end
  <1>...die Enthaupter.

  00:00:21.99 --> 00:00:24.36 A:start
  <2>Alles ist sicher.
  Vollkommen <b>sicher</b>.

and add to this a CSS to provide for some colors and special formatting:

    ::cue { background: rgba(0,0,0,0.5); } 
    ::cue-part(1) { color: red; } 
    ::cue-part(2, b) { font-style: normal; text-decoration: underline; } 

Minimal markup accepts , , and a timestamp in <>, providing for italics, bold, and ruby markup as well as karaoke timestamps. Any further styling can be done using the CSS pseudo-elements ::cue and ::cue-part, which accept the features ‘color’, ‘text-shadow’, ‘text-outline’, ‘background’, ‘outline’, and ‘font’.

Note that positioning requires some special notes at the end of the start/end timestamps which can provide for vertical text, line position, text position, size and alignment cue setting. Here is an example with vertically rendered Chinese text, right-aligned at 98% of the video frame:

  00:00:15.00 --> 00:00:17.95 A:start D:vertical L:98%
  在左边我们可以看到...

  00:00:18.16 --> 00:00:20.08 A:start D:vertical L:98%
  在右边我们可以看到...

  00:00:20.11 --> 00:00:21.96 A:start D:vertical L:98%
  ...捕蝇草械.

  00:00:21.99 --> 00:00:24.36 A:start D:vertical L:98%
  一切都安全.
  非常地安全.

Finally, WebSRT files can be authored with abstract metadata inside cues, which practically means anything at all. Here’s an example with HTML content:

  00:00:15.00 --> 00:00:17.95 A:start
  <img src="pic1.png"/>Auf der <i>linken</i> Seite sehen wir...

  00:00:18.16 --> 00:00:20.08 A:end
  <img src="pic2.png"/>Auf der <b>rechten</b> Seite sehen wir die....

  00:00:20.11 --> 00:00:21.96 A:end
  <img src="pic3.png"/>...die <a href="http://members.chello.nl/j.kassenaar/
elephantsdream/subtitles.html">Enthaupter</a>.

  00:00:21.99 --> 00:00:24.36 A:start
  <img src="pic4.png"/>Alles ist <mark>sicher</mark>.<br/>Vollkommen <b>sicher</b>.

Here is another example with JSON in the cues:

  00:00:00.00 --> 00:00:44.00
  {
    slide: intro.png,
    title: "Really Achieving Your Childhood Dreams" by Randy Pausch, 
             Carnegie Mellon University, Sept 18, 2007
  }

  00:00:44.00 --> 00:01:18.00
  {
    slide: elephant.png,
    title: The elephant in the room...
  }

  00:01:18.00 --> 00:02:05.00
  {
    slide: denial.png,
    title: I'm not in denial...
  }

What I like about WebSRT:

  1. it allows for all sorts of different content in the text cues - plain text is useful for texted audio descriptions, minimal markup is useful for subtitles, captions, karaoke and chapters, and “metadata” is useful for, well, any data.
  2. it can be easily encapsulated into media resources and thus turned into in-band tracks by regarding each cue as a data packet with time stamps.
  3. it is not verbose

Where I think WebSRT still needs improvements:

  1. break with the SRT history: since WebSRT and SRT files are so different, WebSRT should get its own MIME type, e.g. text/websrt, and file extensions, e.g. .wsrt; this will free WebSRT for changes that wouldn’t be possible by trying to keep conformant with SRT
  2. introduce some header fields into WebSRT: the format needs
    • file-wide name-value metadata, such as author, date, copyright, etc
    • language specification for the file as a hint for font selection and speech synthesis
    • a possibility for style sheet association in the file header
    • a means to identify which parser is required for the cues
    • a magic identifier and a version string of the format
  3. allow innerHTML as an additional format in the cues with the CSS pseudo-elements applying to all HTML elements
  4. allow full use of CSS instead of just the restricted features and also use it for positioning instead of the hard to understand positioning hints
  5. on the minimum markup, provide a neutral structuring element such as <span @id @class @lang> to associate specific styles or specific languages with a subpart of the cue

Note that I undertook some experiments with an alternative format that is XML-based and called WMML to gain most of these insights and determine the advantages/disadvantages of a xml-based format. The foremost advantage is that there is no automatism with newlines and displayed new lines, which can make the source text file more readable. The foremost disadvantages are verbosity and that there needs to be a simple encoding step to remove all encapsulating header-type content from around the timed text cues before encoding it into a binary media resource.

ASSOCIATING EXTERNAL TIMED TEXT RESOURCES WITH A VIDEO

Now that we have a timed text format, we need to be able to associate it with a media resource in HTML5. This is what the element was introduced for. It associates the timestamps in the timed text cues with the timeline of the video resource. The browser is then expected to render these during the time interval in which the cues are expected to be active.

Here is an example for how to associate multiple subtitle tracks with a video:

  <video src="california.webm" controls>
    <track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
    <track label="German" kind="subtitles" src="calif_de.wsrt" srclang="de">
    <track label="Chinese" kind="subtitles" src="calif_zh.wsrt" srclang="zh">
  </video>

In this case, the UA is expected to provide a text menu with a subtitle entry with these three tracks and their label as part of the video controls. Thus, the user can interactively activate one of the tracks.

Here is an example for multiple tracks of different kinds:

  <video src="california.webm" controls>
    <track label="English" kind="subtitles" src="calif_eng.wsrt" srclang="en">
    <track label="German" kind="captions" src="calif_de.wsrt" srclang="de">
    <track label="French" kind="chapter" src="calif_fr.wsrt" srclang="fr">
    <track label="English" kind="metadata" src="calif_meta.wsrt" srclang="en">
    <track label="Chinese" kind="descriptions" src="calif_zh.wsrt" srclang="zh">
  </video>

In this case, the UA is expected to provide a text menu with a list of track kinds with one entry each for subtitles, captions and descriptions through the controls. The chapter tracks are expected to provide some sort of visual subdivision on the timeline and the metadata tracks are not exposed visually, but are only available through the JavaScript API.

Here are several ideas for improving the specification:

  • is currently only defined for WebSRT resources - it should be made generic and then browsers can compete on the formats for which they provide support. WebSRT could be the baseline format. A @type attribute could be added to hint at the MIME type of the provided resource.
  • needs a means for authors to mark certain tracks as active, others as inactive. This can be overruled by browser settings e.g. on @srclang and by user interaction.
  • karaoke and lyrics are supported by WebSRT, but aren’t in the HTML5 spec as track kinds - they should be added and made visible like subtitles or captions.

EXPOSING A LIST OF TimedTracks TO JAVASCRIPT

This is where we take an extra step and move to a uniform handling of both in-band and out-of-band timed text tracks. Futher, a third type of timed text track has been introduced in the form of a MutableTimedTrack - i.e. one that can be authored and added through JavaScript alone.

The JavaScript API that is exposed for any of these track type is identical. A media element now has this additional IDL interface:

interface HTMLMediaElement : HTMLElement {
...
  readonly attribute TimedTrack[] tracks;
  MutableTimedTrack addTrack(in DOMString label, in DOMString kind, 
                                 in DOMString language);
};

A media element thus manages a list of TimedTracks and provides for adding TimedTracks through addTrack().

The timed tracks are associated with a media resource in the following order:

  1. The element children of the media element, in tree order.
  2. Tracks created through the addTrack() method, in the order they were added, oldest first.
  3. In-band timed text tracks, in the order defined by the media resource’s format specification.

The IDL interface of a TimedTrack is as follows:

interface TimedTrack {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;
  readonly attribute unsigned short readyState;
           attribute unsigned short mode;
  readonly attribute TimedTrackCueList cues;
  readonly attribute TimedTrackCueList activeCues;
  readonly attribute Function onload;
  readonly attribute Function onerror;
  readonly attribute Function oncuechange;
};

The first three capture the value of the @kind, @label and @srclang attributes and are provided by the addTrack() function for MutableTimedTracks and exposed from metadata in the binary resource for in-band tracks.

The readyState captures whether the data is available and is one of “not loaded”, “loading”, “loaded”, “failed to load”. Data is only availalbe in “loaded” state.

The mode attribute captures whether the data is activate to be displayed and is one of “disabled”, “hidden” and “showing”. In the “disabled” mode, the UA doesn’t have to download the resource, allowing for some bandwidth management.

The cues and activeCues attributes provide the list of parsed cues for the given track and the subpart thereof that is currently active.

The onload, onerror, and oncuechange functions are event handlers for the load, error and cuechange events of the TimedTrack.

Individual cues expose the following IDL interface:

interface TimedTrackCue {
  readonly attribute TimedTrack track;
  readonly attribute DOMString id;
  readonly attribute float startTime;
  readonly attribute float endTime;
  DOMString getCueAsSource();
  DocumentFragment getCueAsHTML();
  readonly attribute boolean pauseOnExit;
  readonly attribute Function onenter;
  readonly attribute Function onexit;
  readonly attribute DOMString direction;
  readonly attribute boolean snapToLines;
  readonly attribute long linePosition;
  readonly attribute long textPosition;
  readonly attribute long size;
  readonly attribute DOMString alignment;
  readonly attribute DOMString voice;
};

The @track attribute links the cue to its TimedTrack.

The @id, @startTime, @endTime attributes expose a cue identifier and its associated time interval. The getCueAsSource() and getCueAsHTML() functions provide either an unparsed cue text content or a text content parsed into a HTML DOM subtree.

The @pauseOnExit attribute can be set to true/false and indicates whether at the end of the cue’s time interval the media playback should be paused and wait for user interaction to continue. This is particularly important as we are trying to support extended audio descriptions and extended captions.

The onenter and onexit functions are event handlers for the enter and exit events of the TimedTrackCue.

The @direction, @snapToLines, @linePosition, @textPosition, @size, @alignment and @voice attributes expose WebSRT positioning and semantic markup of the cue.

My only concerns with this part of the specification are:

  • The WebSRT-related attributes in the TimedTrackCue are in conflict with CSS attributes and really should not be introduced into HTML5, since they are WebSRT-specific. They will not exist in other types of in-band or out-of-band timed text tracks. As there is a mapping to do already, why not rely on already available CSS features.
  • There is no API to expose header-specific metadata from timed text tracks into JavaScript. This such as the copyright holder, the creation date and the usage rights of a timed text track would be useful to have available. I would propose to add a list of name-value metadata elements to the TimedTrack API.
  • In addition, I would propose to allow media fragment hyperlinks in a

RENDERING TimedTracks

The third part of the timed track framework deals with how to render the timed text cues in a Web page. The rendering rules are explained in the HTML5 rendering section.

I’ve extracted the following rough steps from the rendering algorithm:

  1. All timed tracks of a media resource that are in “showing” mode are rendered together to avoid overlapping text from multiple tracks.
  2. The timed tracks cues that are to be rendered are collected from the active timed tracks and ordered by the timed track order first and by their start time second. Where there are identical start times, the cues are ordered by their end time, earliest first, or by their creation order if all else is identical.
  3. Each cue gets its own CSS box.
  4. The text in the CSS boxes is positioned and formated by interpreting the positioning and formatting instructions of WebSRT that are provided on the cues.
  5. An anonymous inline CSS box is created into which all the cue CSS boxes are wrapped.
  6. The wrapping CSS box gets the dimensions of the video viewport. The cue CSS boxes are positioned so they don’t overlap. The text inside the cue CSS boxes inside the wrapping CSS box is wrapped at the edges if necessary.

To overcome security concerns with this kind of direct rendering of a CSS box into the Web page where text comes potentially from a different and malicious Web site, it is required to have the cues come from the same origin as the Web page.

To allow application of a restricted set of CSS properties to the timed text cues, a set of pseudo-selectors was introduced. This is necessary since all the CSS boxes are anonymous and cannot be addressed from the Web page. The introduced pseudo-selectors are ::cue to address a complete cue CSS box, and ::cue-part to address a subpart of a cue CSS box based on a set of identifiers provided by WebSRT.

I have several issues with this approach:

  • I believe that it is not a good idea to only restrict rendering to same-origin files. This will disallow the use of external captioning services (or even just a separate caption server of the same company) to link to for providing the captions to a video. Henri Sivonen proposed a means to overcome this by parsing every cue basically as its own HTML document (well, the body of a document) and then rendering these in iFrame-manner into the Web page. This would overcome the same-origin restriction. It would also allow to do away with the new ::cue CSS selectors, thus simplifying the solution.
  • In general I am concerned about how tightly the rendering is tied to WebSRT. Step 4 should not be in the HTML5 specification, but only apply to WebSRT. Every external format should provide its own mapping to CSS. As it is specified right now, other formats, such as e.g. 3GPP in MPEG-4 or Kate in Ogg, are required to map their format and positioning information to WebSRT instructions. These are then converted again using the WebSRT to CSS mapping rules. That seems overkill.
  • I also find step 6 very limiting, since only the video viewport is regarded as a potential rendering area - this is also the reason why there is no rendering for audio elements. Instead, it would make a lot more sense if a CSS box was provided by the HTML page - the default being the video viewport, but it could be changed to any area on screen. This would allow to render music lyrics under or above an audio element, or render captions below a video element to avoid any overlap at all.

SUMMARY AND FURTHER NEEDS

We’ve made huge progress on accessibility features for HTML5 media elements with the specifications that Ian proposed. I think we can move it to a flexible and feature-rich framework as the improvements that Henri, myself and others have proposed are included.

This will meet most of the requirements that the W3C HTML Accessibility Task Force has collected for media elements where the requirements relate to accessibility functionality provided through alternative text resources.

However, we are not solving any of the accessibility needs that relate to alternative audio-visual tracks and resources. In particular there is no solution yet to deal with multi-track audio or video files that have e.g. sign language or audio description tracks in them - not to speak of the issues that can be introduced through dealing with separate media resources from several sites that need to be played back in sync. This latter may be a challenge for future versions of HTML5, since needs for such synchoronisation of multiple resources have to be explored further.

In a first instance, we will require an API to expose in-band tracks, a means to control their activation interactively in a UI, and a description of how they should be rendered. E.g. should a sign language track be rendered as pciture-in-picture? Clear audio and Sign translation are the two key accessibility needs that can be satisfied with such a multi-track solution.

Finally, another key requirement area for media accessibility is described in a section called “Content Navigation by Content Structure”. This describes the need for vision-impaired users to be able to navigate through a media resource based on semantic markup - think of it as similar to a navigation through a book by book chapters and paragraphs. The introduction of chapter markers goes some way towards satisfying this need, but chapter markers tend to address only big time intervals in a video and don’t let you navigate on a different level to subchapters and paragraphs. It is possible to provide that navigation through providing several chapter tracks at different resolution levels, but then they are not linked together and navigation cannot easily swap between resolution levels.

An alternative might be to include different resolution levels inside a single chapter track and somehow control the UI to manage them as different resolutions. This would only require an additional attribute on text cues and could be useful to other types of text tracks, too. For example, captions could be navigated based on scenes, shots, coversations, or individual captions. Some experimentation will be required here before we can introduce a sensible extension to the given media accessibility framework.

"HTML5 Audio And Video Accessibility, Internationalisation And Usability" talk at Mozilla Summit

For 2 months now, I have been quietly working along on a new Mozilla contract that I received to continue working on HTML5 media accessibility. Thanks Mozilla!

Lots has been happening - the W3C HTML5 accessibility task force published a requirements document, the Media Text Associations proposal made it into the HTML5 draft as a element, and there are discussions about the advantages and disadvantages of the new WebSRT caption format that Ian Hickson created in the WHATWG HTML5 draft.

In attending the Mozilla Summit last week, I had a chance to present the current state of development of HTML5 media accessibility and some of the ongoing work. I focused on the following four current activities on the technical side of things, which are key to satisfying many of the collected media accessibility requirements:

  1. Multitrack Video Support
  2. External Text Tracks Markup in HTML5
  3. External Text Track File Format
  4. Direct Access to Media Fragments

The first three now already have first drafts in the HTML5 specification, though the details still need to be improved and an external text track file format agreed on. The last has had a major push ahead with the Media Fragments WG publishing a Last Call Working Draft. So, on the specification side of things, major progress has been made. On the implementation - even on the example implementation - side of things, we still fall down badly. This is where my focus will lie in the next few months.

Follow this link to read through my slides from the Mozilla 2010 summit.

Media Fragment URI Specification in Last Call WD

After two years of effort, the W3C Media Fragment WG has now created a Last Call Working Draft document. This means that the working group is fairly confident that they have addressed all the required issues for media fragment URIs and their implementation on HTTP and is asking for outside experts and groups for input. This is the time for you to get active and proof-read the specification thoroughly and feed back all the concerns that you have and all the things you do not understand!

The media fragment (MF) URI specification specifies two types of MF URIs: those created with a URI fragment (”#”), e.g. video.ogv#t=10,20 and those with a URI query (”?”), e.g. video.ogv?t=10,20. There is a fundamental difference between the two that needs to be appreciated: with a URI fragment you can specify a subpart of a resource, e.g. a subpart of a video, while with a URI query you will refer to a different resource, i.e. a “new” video. This is an important difference to understand for media fragments, because only some things that we want to achieve with media fragments can be achieved with ”#”, while others can only be achieved by transforming the resource into a different new bitstream.

This all sounds very abstract, so let me give you an example. Say you want to retrieve a video without its audio track. Say you’d rather not download the audio track data, since you want to save on bandwidth. So, you are only interested to get the video data. The URI that you may want to use is video.ogv#track=video. This means that you don’t want to change the video resource, but you only want to see the video. The user agent (UA) has two options to resolve such a URI: it can either map that request to byte ranges and just retrieve those - or it can download the full resource and ignore the data it has not been requested to display.

Since we do not want the extra bytes of the audio track to be retrieved, we would hope the UA can do the byte range requests. However, most Web video formats will interleave the different tracks of a media resource in time such that a video track will results in a gazillion of smaller byte ranges. This makes it impractical to retrieve just the video through a ”#” media fragment. Thus, if we really want this functionality, we have to make the server more intelligent and allow creation of a new resource from the existing one which doesn’t contain the audio. Then, the server, upon receiving a request such as video.ogv#track=video can redirect that to video.ogv?track=video and actually serve a new resource that satisfies the needs.

This is in fact exactly what was implemented in a recently published Firefox Plugin written by Jakub Sendor - also described in his presentation “Media Fragment Firefox plugin”.

Media Fragment URIs are defined for four dimensions:

  • temporal fragments
  • spatial fragments
  • track fragments
  • named fragments

The temporal dimension, while not accompanied with another dimension, can be easily mapped to byte ranges, since all Web media formats interleave their tracks in time and thus create the simple relationship between time and bytes.

The spatial dimension is a very complicated beast. If you address a rectangular image region out of a video, you might want just the bytes related to that image region. That’s almost impossible since pixels are encoded both aggregated across the frame and across time. Also, actually removing the context, i.e. the image data outside the region of interest may not be what you want - you may only want to focus in on the region of interest. Thus, the proposal for what to do in the spatial dimension is to simply retrieve all the data and have the UA deal with the display of the focused region, e.g. putting a dark overlay over the regions outside the region of interest.

The track dimension is similarly complicated and here it was decided that a redirect to a URI query would be in order in the demo Firefox plugin. Since this requires an intelligent server - which is available through the Ninsuna demo server that was implemented by Davy Van Deursen, another member of the MF WG - the Firefox plugin makes use of that. If the UA doesn’t have such an intelligent server available, it may again be most useful to only blend out the non-requested data on the UA similar to the spatial dimension.

The named dimension is still a largely undefined beast. It is clear that addressing a named dimension cannot be done together with the other dimensions, since a named dimension can represent any of the other dimensions above, and even a combination of them. Thus, resolving a named dimension requires an understanding of either the UA or the server what the name maps to. If, for example, a track has a name in a media resource and that name is stored in the media header and the UA already has a copy of all the media headers, it can resolve the name to the track that is being requested and take adequate action.

But enough explaining - I have made a screencast of the Firefox plugin in action for all these dimensions, which explains things a lot more concisely than word will ever be able to - enjoy:

And do not forget to proofread the specification and send feedback to public-media-fragment@w3.org.

Accessibility support in Ogg and liboggplay

At the recent FOMS/LCA in Wellington, New Zealand, we talked a lot about how Ogg could support accessibility. Technically, this means support for multiple text tracks (subtitles/captions), multiple audio tracks (audio descriptions parallel to main audio track), and multiple video tracks (sign language video parallel to main video track).

Creating multitrack Ogg files The creation of multitrack Ogg files is already possible using one of the muxing applications, e.g. oggz-merge. For example, I have my own little collection of multitrack Ogg files at http://annodex.net/~silvia/itext/elephants_dream/multitrack/. But then you are stranded with files that no player will play back.

Multitrack Ogg in Players As Ogg is now being used in multiple Web browsers in the new HTML5 media formats, there are in particular requirements for accessibility support for the hard-of-hearing and vision-impaired. Either multitrack Ogg needs to become more of a common case, or the association of external media files that provide synchronised accessibility data (captions, audio descriptions, sign language) to the main media file needs to become a standard in HTML5.

As it turn out, both these approaches are being considered and worked on in the W3C. Accessibility data that are audio or video tracks will in the near future have to come out of the media resource itself, but captions and other text tracks will also be available from external associated elements.

The availability of internal accessibility tracks in Ogg is a new use case - something Ogg has been ready to do, but has not gone into common usage. MPEG files on the other hand have for a long time been used with internal accessibility tracks and thus frameworks and players are in place to decode such tracks and do something sensible with them. This is not so much the case for Ogg.

For example, a current VLC build installed on Windows will display captions, because Ogg Kate support is activated. A current VLC build on any other platform, however, has Ogg Kate support deactivated in the build, so captions won’t display. This will hopefully change soon, but we have to look also beyond players and into media frameworks - in particular those that are being used by the browser vendors to provide Ogg support.

Multitrack Ogg in Browsers Hopefully gstreamer (which is what Opera uses for Ogg support) and ffmpeg (which is what Chrome uses for Ogg support) will expose all available tracks to the browser so they can expose them to the user for turning on and off. Incidentally, a multitrack media JavaScript API is in development in the W3C HTML5 Accessibility Task Force for allowing such control.

The current version of Firefox uses liboggplay for Ogg support, but liboggplay’s multitrack support has been sketchy this far. So, Viktor Gal - the liboggplay maintainer - and I sat down at FOMS/LCA to discuss this and Viktor developed some patches to make the demo player in the liboggplay package, the glut-player, support the accessibility use cases.

I applied Viktor’s patch to my local copy of liboggplay and I am very excited to show you the screencast of glut-player playing back a video file with an audio description track and an English caption track all in sync:

elephants_dream_with_audiodescriptions_and_captions

Further developments There are still important questions open: for example, how will a player know that an audio description track is to be played together with the main audio track, but a dub track (e.g. a German dub for an English video) is to be played as an alternative. Such metadata for the tracks is something that Ogg is still missing, but that Ogg can be extended with fairly easily through the use of the Skeleton track. It is something the Xiph community is now working on.

Summary This is great progress towards accessibility support in Ogg and therefore in Web browsers. And there is more to come soon.

Government Report: "Access to Electronic Media for the Hearing and Vision Impaired"

Today was the last day to provide a submission and input to the Australian Government’s discussion report on “Access to Electronic Media for the Hearing and Vision Impaired: Approaches for Consideration”.

The report explains the Australian Government’s existing regulatory framework for accessibility to audio-visual content on TV, digital TV, DVDs, cinemas, and the Internet, and provides an overview about what it is planning to do over the next 3-5 years.

It is interesting to read that according to the Australian Bureau of Statistics about 2.67 million Australians - one in every eight people - have some form of hearing loss and 284,000 are completely or partially blind. Also, it is expected that these numbers will increase with an ageing population and obesity-linked diabetes are expected to continue to increase these numbers.

For obvious reasons, I was particularly interested in the Internet-related part of the report. It was the second-last section (number five), and to be honest, I was rather disappointed: only 3 pages of the 40 page long report concerned themselves with Internet content. Also, the main message was that “at this time the costs involved with providing captions for online content were deemed to represent an undue financial impost on a relatively new and developing service.”

Audio descriptions weren’t even touched with a stick and both were written off with “a lack of clear online caption production and delivery standard and requirements”. There is obviously a lot of truth to the statements of the report - the Internet audio-visual content industry is still fairly young compared to e.g. TV, and there are a multitude of standards rather than a single clear path.

However, I believe the report neglected to mention the new HTML5 video and audio elements and the opportunity they provide. Maybe HTML5 was excluded because it wasn’t expected to be relevant within the near future. I believe this is a big mistake and governments should pay more attention to what is happening with HTML5 audio and video and the opportunities they open for accessibility.

In the end, I made a submission because I wanted the Australian Government to wake up to the HTML5 efforts and I wanted to correct a mistake they made with claiming MPEG-2 was “not compatible with the delivery of closed audio descriptions”.

I believe a lot more can be done with accessibility for Internet content than just “monitor international developments” and industry partnership with disability representative groups. I therefore proposed to undertake trials in particular with textual audio descriptions to see if they could be produced in a similar manner to captions, which would make their cost come down enormously. Also I suggested actually aiming for WCAG 2.0 conformance within the next 5 years - which for audio-visual content means at minimum captions and audio descriptions.

You can read the report here and my 4 page long submission here.

Tutorial on HTML5 open video at LCA 2010

During last week’s LCA, Jan Gerber, Michael Dale and I gave a 3 hour tutorial on how to publish HTML5 video in an open format.

We basically taught people how to create and publish Ogg Theora video in HTML5 Web pages and how to make them work across browsers, including much of the available tools and libraries. We’re hoping that some people will have learnt enough to include modules in CMSes such as Drupal, Joomla and Wordpress, which will easily support the publishing of Ogg Theora.

I have been asked to share the material that we used. It consists of:

Note that if you would like to walk through the exercises, you should install the following software beforehand:

You might need to look for packages of your favourite OS (e.g. Windows or Mac, Ubuntu or Debian).

The exercises include:

  • creating a Ogg video from an editor
  • transcoding a video using http://firefogg.org/
  • creating a poster image using OggThumb
  • writing a first HTML5 video Web page with Ogg Theora
  • publishing it on a Web Server, with correct MIME type & Duration hint
  • writing a second HTML5 video Web page with Ogg Theora & MP4 to cover Safari/Webkit
  • transcoding using ffmpeg2theora in a script
  • writing a third HTML5 video Web page with Cortado fallback
  • writing a fourth Web page using “Video for Everybody”
  • writing a fifth Web page using “mwEmbed”
  • writing a sixth Web page using firefogg for transcoding before upload
  • and a seventh one with a progress bar
  • encoding srt subtitles into an Ogg Kate track
  • writing an eighth Web page using cortado to display the Ogg Kate track

For those that would like to see the slides here immediately, a special flash embed:

Enjoy!

Manifests for exposing the structure of a Composite Media Resource

In the previous post I explained that there is a need to expose the tracks of a time-linear media resource to the user agent (UA). Here, I want to look in more detail at different possibilities of how to do so, their advantages and disadvantages.

Note: A lot of this has come out of discussions I had at the recent W3C TPAC and is still in flux, so I am writing this to start discussions and brainstorm.

Declarative Syntax vs JavaScript API

We can expose a media resource’s tracks either through a JavaScript function that can loop through the tracks and provide access to the tracks and their features, or we can do this through declarative syntax.

Using declarative syntax has the advantage of being available even if JavaScript is disabled in a UA. The markup can be parsed easily and default displays can be prepared without having to actually decode the media file(s).

OTOH, it has the disadvantage that it may not necessarily represent what is actually in the binary resource, but instead what the Web developer assumed was in the resource (or what he forgot to update). This may lead to a situation where a “404” may need to be given on a media track.

A further disadvantage is that when somebody copies the media element onto another Web page, together with all the track descriptions, and then the original media resource is changed (e.g. a subtitle track is added), this has not the desired effect, since the change does not propagate to the other Web page.

For these reasons, I thought that a JavaScript interface was preferable over declarative syntax.

However, recent discussions, in particular with some accessibility experts, have convinced me that declarative syntax is preferable, because it allows the creation of a menu for turning tracks on/off without having to even load the media file. Further, declarative syntax allows to treat multiple files and “native tracks” of a virtual media resource in an identical manner.

Extending Existing Declarative Syntax

The HTML5 media elements already have declarative syntax to specify multiple source media files for media elements. The element is typically used to list video in mpeg4 and ogg format for support in different browsers, but has also been envisaged for different screensize and bandwidth encodings.

The elements are generally meant to list different resources that contribute towards the media element. In that respect, let’s try using it for declaring a manifest of tracks of the virtual media resource on an example:

  <video>
    <source id='av1' src='video.3gp' type='video/mp4' media='mobile' lang='en'
                     role='media' >
    <source id='av2' src='video.mp4' type='video/mp4' media='desktop' lang='en'
                     role='media' >
    <source id='av3' src='video.ogv' type='video/ogg' media='desktop' lang='en'
                     role='media' >
    <source id='dub1' src='video.ogv?track=audio[de]' type='audio/ogg' lang='de'
                     role='dub' >
    <source id='dub2' src='audio_ja.oga' type='audio/ogg' lang='ja'
                     role='dub' >
    <source id='ad1' src='video.ogv?track=auddesc[en]' type='audio/ogg' lang='en'
                     role='auddesc' >
    <source id='ad2' src='audiodesc_de.oga' type='audio/ogg' lang='de'
                     role='auddesc' >
    <source id='cc1' src='video.mp4?track=caption[en]' type='application/ttaf+xml'
                     lang='en' role='caption' >
    <source id='cc2' src='video.ogv?track=caption[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='caption' >
    <source id='cc3' src='caption_ja.ttaf' type='application/ttaf+xml' lang='ja'
                     role='caption' >
    <source id='sign1' src='signvid_ase.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='ase' role='sign' >
    <source id='sign2' src='signvid_gsg.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='gsg' role='sign' >
    <source id='sign3' src='signvid_sfs.ogv' type='video/ogg; codecs="theora"'
                     media='desktop' lang='sfs' role='sign' >
    <source id='tad1' src='tad_en.srt' type='text/srt; charset="ISO-8859-1"'
                     lang='en' role='tad' >
    <source id='tad2' src='video.ogv?track=tad[de]' type='text/srt; charset="ISO-8859-1"'
                     lang='de' role='tad' >
    <source id='tad3' src='tad_ja.srt' type='text/srt; charset="EUC-JP"' lang='ja'
                     role='tad' >
  </video>

Note that this somewhat ignores my previously proposed special itext tag for handling text tracks. I am doing this here to experiment with a more integrative approach with the virtual media resource idea from the previous post. This may well be a better solution than a specific new text-related element. Most of the attributes of the itext element are, incidentally, covered.

You will also notice that some of the tracks are references to tracks inside binary media files using the Media Fragment URI specification while others link to full files. An example is video.ogv?track=auddesc[en]. So, this is a uniform means of exposing all the tracks that are part of a (virtual) media resource to the UA, no matter whether in-band or in external files. It actually relies on the UA or server being able to resolve these URLs.

”type” attribute

“media” and “type” are existing attributes of the element in HTML5 and meant to help the UA determine what to do with the referenced resource. The current spec states:

The “type” attribute gives the type of the media resource, to help the user agent determine if it can play this media resource before fetching it.

The word “play” might need to be replaced with “decode” to cover several different MIME types.

The “type” attribute was also extended with the possibility to add the “charset” MIME parameter of a linked text resource - this is particularly important for SRT files, which don’t handle charsets very well. It avoids having to add an additional attribute and is analogous to the “codecs” MIME parameter used by audio and video resources.

”media” attribute

Further, the spec states:

The “media” attribute gives the intended media type of the media resource, to help the user agent determine if this media resource is useful to the user before fetching it. Its value must be a valid media query.

The “mobile” and “desktop” values are hints that I’ve used for simplicity reasons. They could be improved by giving appropriate bandwidth limits and width/height values, etc. Other values could be different camera angles such as topview, frontview, backview. The media query aspect has to be looked into in more depth.

”lang” attribute

The above example further uses “lang” and “role” attributes:

The “lang” attribute is an existing global attribute of HTML5, which typically indicates the language of the data inside the element. Here, it is used to indicate the language of the referenced resource. This is possibly not quite the best name choice and should maybe be called “hreflang”, which is already used in multiple other elements to signify the language of the referenced resource.

”role” attribute

The “role” attribute is also an existing attribute in HTML5, included from ARIA. It currently doesn’t cover media resources, but could be extended. The suggestion here is to specify the roles of the different media tracks - the ones I have used here are:

  • “media”: a main media resource - typically contains audio and video and possibly more
  • “dub”: a audio track that provides an alternative dubbed language track
  • “auddesc”: a audio track that provides an additional audio description track
  • “caption”: a text track that provides captions
  • “sign”: a video-only track that provides an additional sign language video track
  • “tad”: a text track that provides textual audio descriptions to be read by a screen reader or a braille device

Further roles could be “music”, “speech”, “sfx” for audio tracks, “subtitle”, “lyrics”, “annotation”, “chapters”, “overlay” for text tracks, and “alternate” for a alternate main media resource, e.g. a different camera angle.

Track activation

The given attributes help the UA decide what to display.

It will firstly find out from the “type” attribute if it is capable of decoding the track.

Then, the UA will find out from the “media” query, “role”, and “lang” attributes whether a track is relevant to its user. This will require checking the capabilities of the device, network, and the user preferences.

Further, it could be possible for Web authors to influence whether a track is displayed or not through CSS parameters on the element: “display: none” or “visibility: hidden/visible”.

Examples for track activation that a UA would undertake using the example above:

Given a desktop computer with Firefox, German language preferences, captions and sign language activated, the UA will fetch the original video at video.ogv (for Firefox), the German caption track at video.ogv?track=caption[de], and the German sign language track at signvid_gsg.ogv (maybe also the German dubbed audio track at video.ogv?track=audio[de], which would then replace the original one).

Given a desktop computer with Safari, English language preferences and audio descriptions activated, the UA will fetch the original video at video.mp4 (for Safari) and the textual audio description at tad_en.srt to be displayed through the screen reader, since it cannot decode the Ogg audio description track at video.ogv?track=auddesc[en].

Also, all decodeable tracks could be exposed in a right-click menu and added on-demand.

Display styling

Default styling of these tracks could be:

  • video or alternate video in the video display area,
  • sign language probably as picture-in-picture (making it useless on a mobile and only of limited use on the desktop),
  • captions/subtitles/lyrics as overlays on the bottom of the video display area (or whatever the caption format prescribes),
  • textual audio descriptions as ARIA live regions hidden behind the video or off-screen.

Multiple audio tracks can always be played at the same time.

The Web author could also define the display area for a track through CSS styling and the UA would then render the data into that area at the rate that is required by the track.

How good is this approach?

The advantage of this new proposal is that it builds basically on existing HTML5 components with minimal additions to satisfy requirements for content selection and accessibility of media elements. It is a declarative approach to the multi-track media resource challenge.

However, it leaves most of the decision on what tracks are alternatives of/additions to each other and which tracks should be displayed to the UA. The UA makes an informed decision because it gets a lot of information through the attributes, but it still has to make decisions that may become rather complex. Maybe there needs to be a grouping level for alternative tracks and additional tracks - similar to what I did with the second itext proposal, or similar to the and elements of SMIL.

A further issue is one that is currently being discussed within the Media Fragments WG: how can you discover the track composition and the track naming/uses of a particular media resource? How, e.g., can a Web author on another Web site know how to address the tracks inside your binary media resource? A HTML specification like the above can help. But what if that doesn’t exist? And what if the file is being used offline?

Alternative Manifest descriptions

The need to manifest the track composition of a media resource is not a new one. Many other formats and applications had to deal with these challenges before - some have defined and published their format.

I am going to list a few of these formats here with examples. They could inspire a next version of the above proposal with grouping elements.

Microsoft ISM files (SMIL subpart)

With the release of IIS7, Microsoft introduced “Smooth Streaming”, which uses chunking on files on the server to deliver adaptive streaming to Silverlight clients over HTTP. To inform a smooth streaming client of the tracks available for a media resource, Microsoft defined ism files: IIS Smooth Streaming Server Manifest files.

This is a short example - a longer one can be found here:

<?xml version=

The model of a time-linear media resource for HTML5

HTML5 has been criticised for not having a timing model of the media resource in its new media elements. This article spells it out and builds a framework of how we should think about HTML5 media resources. Note: these are my thoughts and nothing offical from HTML5 - just conclusions I have drawn from the specs and from discussions I had.

What is a time-linear media resource?

In HTML5 and also in the Media Fragment URI specification we deal only with audio and video resources that represent a single timeline exclusively. Let’s call such Web resources a time-linear media resource.

The Media Fragment requirements document actually has a very nice picture to describe such resources - replicated here for your convenience:

Model of a Media Resource

The resource can potentially consist of any number of audio, video, text, image or other time-aligned data tracks. All these tracks adhere to a single timeline, which tends to be defined by the main audio or video track, while other tracks have been created to synchronise with these main tracks.

This model matches with the world view of video on YouTube and any other video hosting service. It also matches with video used on any video streaming service.

Background on the choice of “time-linear”

I’ve deliberately chosen the word “time-linear” because we are talking about a single, gap-free, linear timeline here and not multiple timelines that represent the single resource.

The word “linear” is, however, somewhat over-used, since the introduction of digital systems into the world of analog film introduced what is now known as “non-linear video editing”. This term originates from the fact that non-linear video editing systems don’t have to linearly spool through film material to get to a edit point, but can directly access any frame in the footage as easily as any other.

When talking about a time-linear media resource, we are referring to a digital resource and therefore direct access to any frame in the footage is possible. So, a time-linear media resource will still be usable within a non-linear editing process.

As a Web resource, a time-linear media resource is not addressed as a sequence of frames or samples, since these are encoding specific. Rather, the resource is handled abstractly as an object that has track and time dimensions - and possibly spatial dimensions where image or video tracks are concerned. The framerate encoding of the resource itself does not matter and could, in fact, be changed without changing the resource’s time, track and spatial dimensions and thus without changing the resource’s address.

Interactive Multimedia

The term “time-linear” is used to specify the difference between a media resource that follows a single timeline, in contrast to one that deals with multiple timelines, linked together based on conditions, events, user interactions, or other disruptions to make a fully interactive multi-media experience. Thus, media resources in HTML5 and Media Fragments do not qualify as interactive multimedia themselves because they are not regarded as a graph of interlinked media resources, but simply as a single time-linear resource.

In this respect, time-linear media resources are also different from the kind of interactive mult-media experiences that an Adobe Shockwave Flash, Silverlight, or a SMIL file can create. These can go far beyond what current typical video publishing and communication applications on the Web require and go far beyond what the HTML5 media elements were created for. If your application has a need for multiple timelines, it may be necessary to use SMIL, Silverlight, or Adobe Flash to create it.

Note that the fact that the HTML5 media elements are part of the Web, and therefore expose states and integrate with JavaScript, provides Web developers with a certain control over the playback order of a time-linear media resource. The simple functions pause(), play(), and the currentTime attribute allow JavaScript developers to control the current playback offset and whether to stop or start playback. Thus, it is possible to interrupt a playback and present, e.g. a overlay text with a hyperlink, or an additional media resource, or anything else a Web developer can imagine right in the middle of playing back a media resource.

In this way, time-linear media resources can contribute towards an interactive multi-media experience, created by a Web developer through a combination of multiple media resources, image resources, text resources and Web pages. The limitations of this approach are not yet clear at this stage - how far will such a constructed multi-media experience be able to take us and where does it become more complicated than an Adobe Flash, Silverlight, or SMIL experience. The answer to this question will, I believe, become clearer through the next few years of HTML5 usage and further extensions to HTML5 media may well be necessary then.

Proper handling of time-linear media resources in HTML5

At this stage, however, we have already determined several limitations of the existing HTML5 media elements that require resolution without changing the time-linear nature of the resource.

1. Expose structure

Above all, there is a need to expose the above painted structure of a time-linear media resource to the Web page. Right now, when the

We need a means to expose the available tracks inside a time-linear media resource and allow the UA some control over it - e.g. to choose whether to turn on/off a caption track, to choose which video track to display, or to choose which dubbed audio track to display.

I’ll discuss in another article different approaches on how to expose the structure. Suffice for now that we recognise the need to expose the tracks.

2. Separate the media resource concept from actual files

A HTML page is a sequence of HTML tags delivered over HTTP to a UA. A HTML page is a Web resource. It can be created dynamically and contain links to other Web resources such as images which complete its presentation.

We have to move to a similar “virtual” view of a media resource. Typically, a video is a single file with a video and an audio track. But also typically, caption and subtitle tracks for such a video file are stored in other files, possibly even on other servers. The caption or subtitle tracks are still in sync with the video file and therefore are actual tracks of that time-linear media resource. There is no reason to treat this differently to when the caption or subtitle track is inside the media file.

When we separate the media resource concept from actual files, we will find it easier to deal with time-linear media resources in HTML5.

3. Track activation and Display styling

A time-linear media resource, when regarded completely abstractly, can contain all sorts of alternative and additional tracks.

For example, the existing elements inside a video or audio element are currently mostly being used to link to alternative encodings of the main media resource - e.g. either in mpeg4 or ogg format. We can regard these as alternative tracks within the same (virtual) time-linear media resource.

Similarly, the elements have also been suggested to be used for alternate encodings, such as for mobile and Web. Again, these can be regarded as alternative tracks of the same time-linear media resource.

Another example are subtitle tracks for a main media resource, which are currently discussed to be referenced using the element. These are in principle alternative tracks amongst themselves, but additional to the main media resource. Also, some people are actually interested in displaying two subtitle tracks at the same time to learn translations.

Another example are sign language tracks, which are video tracks that can be regarded as an alternative to the audio tracks for hard-of-hearing users. They are then additional video tracks to the original video track and it is not clear how to display more than one video track. Typically, sign language tracks are displayed as picture-in-picture, but on the Web, where video is usually displayed in a small area, this may not be optimal.

As you can see, when deciding which tracks need to be displayed one needs to analyse the relationships between the tracks. Further, user preferences need to come into play when activating tracks. Finally, the user should be able to interactively activate tracks as well.

Once it is clear, what tracks need displaying, there is still the challenge of how to display them. It should be possible to provide default displays for typical track types, and allow Web authors to override these default display styles since they know what actual tracks their resource is dealing with.

While the default display seems to be typically an issue left to the UA to solve, the display overrides are typically dealt with on the Web through CSS approaches. How we solve this is for another time - right now we can just state the need for algorithms for track activiation and for default and override styling.

Hypermedia

To make media resources a prime citizens on the Web, we have to go beyond simply replicating digital media files. The Web is based on hyperlinks between Web resources, and that includes hyperlinking out of resources (e.g. from any word within a Web page) as well as hyperlinking into resources (e.g. fragment URIs into Web pages).

To turn video and audio into hypervideo and hyperaudio, we need to enable hyperlinking into and out of them.

Hyperlinking into media resources is fortunately already being addressed by the W3C Media Fragments working group, which also regards media resources in the same way as HTML5. The addressing schemes under consideration are the following:

  • temporal fragment URI addressing: address a time offset/region of a media resource
  • spatial fragment URI addressing: address a rectangular region of a media resource (where available)
  • track fragment URI addressing: address one or more tracks of a media resource
  • named fragment URI addressing: address a named region of a media resource
  • a combination of the above addressing schemes

With such addressing schemes available, there is still a need to hook up the addressing with the resource. For the temporal and the spatial dimension, resolving the addressing into actual byte ranges is relatively obvious across any media type. However, track addressing and named addressing need to be resolved. Track addressing will become easier when we solve the above stated requirement of exposing the track structure of a media resource. The name definition requires association of an id or name with temporal offsets, spatial areas, or tracks. The addressing scheme will be available soon - whether our media resources can support them is another challenge to solve.

Finally, hyperlinking out of media resources is something that is not generally supported at this stage. Certainly, some types of media resources - QuickTime, Flash, MPEG4, Ogg - support the definition of tracks that can contain HTML marked-up text and thus can also contain hyperlinks. But standardisation in this space has not really happened yet. It seems to be clear that hyperlinks out of media files will come from some type of textual track. But a standard format for such time-aligned text tracks doesn’t yet exist. This is a challenge to be addressed in the near future.

Summary

The Web has always tried to deal with new extensions in the simplest possible manner, providing support for the majority of current use cases and allowing for the few extraordinary use cases to be satisfied by use of JavaScript or embedding of external, more complex objects.

With the new media elements in HTML5, this is no different. So far, the most basic need has been satisfied: that of including simple video and audio files into Web pages. However, many basic requirements are not being satisfied yet: accessibility needs, codec choice, device-independence needs are just some of the core requirements that make it important to extend our view of

This post has created the concept of a “media resource”, where we keep the simplicity of a single timeline. At the same time, it has tried to classify the list of shortcomings of the current media elements in a way that will help us address these shortcomings in a Web-conformant means.

If we accept the need to expose the structure of a media resource, the need to separate the media resource concept from actual files, the need for an approach to track activation, and the need to deal with styling of displayed tracks, we can take the next steps and propose solutions for these.

Further, understanding the structure of a media resources allows us to start addressing the harder questions of how to associate events with a media resource, how to associate a navigable structure with a media resource, or how to turn media resources into hypermedia.

HTML5 Video element discussions at TPAC meetings

Last week’s TPAC (2009 W3C Technical Plenary / Advisory Committee) meetings were my second time at a TPAC and I found myself becoming highly involved with the progress on accessibility on the HTML5 video element. There were in particular two meetings of high relevanct: the Video Accessibility workshop and Friday’s HTML5 breakout group on the video element.

HTML5 Video Accessibility Workshop

The week started on Sunday with the “HTML5 Video Accessibility workshop” at Stanford University, organised by John Foliot and Dave Singer. They brought together a substantial number of people all representing a variety of interest groups. Everyone got their chance to present their viewpoint - check out the minutes of the meeting for a complete transcript.

The list of people and their discussion topics were as follows:

Accessibility Experts

  • Janina Sajka, chair of WAI Protocols and Formats: represented the vision-impaired community and expressed requirements for a deeply controllable access interface to audio-visual content, preferably in a structured manner similar to DAISY.
  • Sally Cain, RNIB, Member of W3C PF group: expressed a deep need for audio descriptions, which are often overlooked besides captions.
  • Ken Harrenstien, Google: has worked on captioning support for video.google and YouTube and shared his experiences, e.g. http://www.youtube.com/watch?v=QRS8MkLhQmM, and automated translation.
  • Victor Tsaran, Yahoo! Accessibility Manager: joined for a short time out of interest.

Practicioners

  • John Foliot, professor at Stanford Uni: showed a captioning service that he set up at Stanford University to enable lecturers to publish more accessible video - it uses humans for transcription, but automated tools to time-align, and provides a Web interface to the staff.
  • Matt May, Adobe: shared what Adobe learnt about accessibility in Flash - in particular that an instream-only approach to captions was a naive approach and that external captions are much more flexible, extensible, and can fit into current workflows.
  • Frank Olivier, Microsoft: attended to listen and learn.

Technologists

  • Pierre-Antoine Champin from Liris (France), who was not able to attend, sent a video about their research work on media accessibility using automatic and manual annotation.
  • Hironobu Takagi, IBM Labs Tokyo, general chair for W4A: demonstrated a text-based audio description system combined with a high-quality, almost human-sounding speech synthesizer.
  • Dick Bulterman, Researcher at CWI in Amsterdam, co-chair of SYMM (group at W3C doing SMIL): reported on 14 years of experience with multimedia presentations and SMIL (slides) and the need to make temporal and spatial synchronisation explicit to be able to do the complex things.
  • Joakim S

Web Directions South 2009 talk on HTML5 video

Yesterday, I gave a talk on the HTML5 video element at Web Directions South.

The title was “Taking HTML5

_This talk focuses on the efforts engaged by W3C to improve the new HTML 5 media elements with mechanisms to allow people to access multimedia content, including audio and video. Such developments are also useful beyond accessibility needs and will lead to a general improvement of the usability of media, making media discoverable and generally a prime citizen on the Web.

Silvia will discuss what is currently technically possible with the HTML5 media elements, and what is still missing. She will describe a general framework of accessibility for HTML5 media elements and present her work for the Mozilla Corporation that includes captions, subtitles, textual audio annotations, timed metadata, and other time-aligned text with the HTML5 media elements. Silvia will also discuss work of the W3C Media Fragments group to further enhance video usability and accessibility by making it possible to directly address temporal offsets in video, as well as spatial areas and tracks._

Here are my slides:

Download the pdf from here.

There was also a video recording and I will add that here as soon as it is published.

UPDATE: The video is available on Tinyvid:

I’m not going to try and upload this 50min long video to YouTube - with it’s 10 min limit, I won’t get very far.

WebJam 2009 talk on video accessibility

On Wednesday evening I gave a 3 min presentation on video accessibility in HTML5 at the WebJam in Sydney. I used a video as my presentation means and explained things while playing it back. Here is the video, without my oral descriptions, but probably still useful to some. Note in particular how you can experience the issues of deaf (HoH), blind (VI) and foreign language users:

The Ogg version is here.

New proposal for captions and other timed text for HTML5

The first specification for how to include captions, subtitles, lyrics, and similar time-aligned text with HTML5 media elements has received a lot of feedback - probably because there are several demos available.

The feedback has encouraged me to develop a new specification that includes the concerns and makes it easier to associate out-of-band time-aligned text (i.e. subtitles stored in separate files to the video/audio file). A simple example of the new specification using srt files is this:

<video src="video.ogv" controls>
   <itextlist category="CC">
     <itext src="caption_en.srt" lang="en"/>
     <itext src="caption_de.srt" lang="de"/>
     <itext src="caption_fr.srt" lang="fr"/>
     <itext src="caption_jp.srt" lang="jp"/>
   </itextlist>
 </video>

By default, the charset of the itext file is UTF-8, and the default format is text/srt (incidentally a mime type the still needs to be registered). Also by default the browser is expected to select for display the track that matches the set default language of the browser. This has been proven to work well in the previous experiments.

Check out the new itext specification, read on to get an introduction to what has changed, and leave me your feedback if you can!

The itextlist element You will have noticed that in comparison to the previous specification, this specification contains a grouping element called “itextlist”. This is necessary because we have to distinguish between alternative time-aligned text tracks and ones that can be additional, i.e. displayed at the same time. In the first specification this was done by inspecting each itext element’s category and grouping them together, but that resulted in much repetition and unreadable specifications.

Also, it was not clear which itext elements were to be displayed in the same region and which in different ones. Now, their styling can be controlled uniformly.

The final advantage is that association of callbacks for entering and leaving text segments as extracted from the itext elements can now be controlled from the itextlist element in a uniform manner.

This change also makes it simple for a parser to determine the structure of the menu that is created and included in the controls element of the audio or video element.

Incidentally, a patch for Firefox already exists that makes this part of the browser. It does not yet support this new itext specification, but here is a screenshot that Felipe Corr

W3C Workshop/Barcamp on HTML5 Video Accessibility

Web accessibility veteran John Foliot of Stanford University and Apple’s QuickTime EcoSystem Manager Dave Singer are organising a W3C Workshop/Barcamp on Video Accessibility on the Sunday before the W3C’s annual combined technical plenary meeting TPAC.

The workshop will take place on 1st November at Stanford University - see details on the Workshop. If you read the announcement, you will see that this is about understanding all the issues around video (and audio) accessibility, understanding existing approaches, and trying to find solutions for HTML5 that all browser vendors will be able to support.

The workshop is run under the W3C Hypertext Coordination Group and registration is required.

W3C membership is not required in order to participate in the gathering. However, you are required to contribute your knowledge actively and constructively to the Workshop. You must come prepared to present on one of the questions in this document to help inform the discussion and make progress on proposing solutions.

I am very excited about this workshop because I think it is high time to move things forward.

If I can get my travel sorted, I will present my results on the video accessibility work that I did for Mozilla. It will cover both: out-of-band accessibility data for video elements, as well as in-line accessibility data and how to expose a common API in the Web browser for them. I have recently experimented with encoding srt and lrc files in Ogg and displaying them in Firefox by using the patches that were contributed by OggK and Felipe into Firefox. More about this soon.

Tracking Status of Video Accessibility Work

Just a brief note to let everyone know about a new wikipage I created for my Mozilla work about video accessibility, where I want to track the status and outcomes of my work. You can find it at https://wiki.mozilla.org/Accessibility/Video_a11y_Aug09. It lists the following sections: Test File Collection, Specifications, Demo implementations using JavaScript, Related open bugs in Mozilla, and Publications.

Open Standards for Sign Languages

Looking at accessibility for video includes sign language. It is a most fascinating area to get into and an area that still leaves a lot to formalise and standardise. A lot has happened in recent years and a lot still needs to be done.

Sign languages are different languages to spoken languages: they emerged in parallel to spoken languages in communities whose boundaries may not overlap with the boundaries of spoken languages. However, most developed means to translate spoken language artifacts (i.e. letters) into sign language artifacts (i.e. signs). So, a typical signer will speak/write at least 3-4 “languages”: the spoken language of their hearing peers, lip reading of that spoken language, letter signs of the spoken language, and finally the native sign language of the community they live in.

Encoding sign language in the computer is a real challenge. Firstly, there is the problem of enumerating all available languages. Then there is the challenge to find an alphabet to represent all “characters” that can be used in sign across many (preferably all) sign languages. Then there is the need to encode these characters in a way that computers can deal with. And finally, there is the need to find a screen representation of the characters. In this blog post, I want to describe the status for all of these.

Currently, sign language can only be represented as a video track by recording sign speakers. Once a sign character list together with an encoding and representation means for them and a specification of the different sign languages is available, it is possible to encode sign sentences in computer-readable form. Further, programs can be written that can present sign sentences on screen, that translate between different sign languages, and between sign and spoken languages. Also, avatars can be programmed that actually present animated sign sentences.

Imagine a computer that instead of presenting letters in your spoken language uses sign language characters and has keys with signs on them instead of letters. To a sign speaker this would be a lot more natural, since for most sign is their mother tongue.

Listing all existing sign languages It was a challenge to create codes for all existing spoken languages - the current list of language codes has only been finalised in 1998.

Until the 1980s, scientists assumed that it is impossible to develop as rich a language with signs as with writing and speaking. Thus, the native languages of deaf people were often regarded as inferior to spoken languages. In many countries it was even prohibited to teach the language in schools for the deaf and instead they were taught to speak an oral language and read lips. In France this prohibition was only lifted in 1991! Only in about 1985 was it proven that sign languages are indeed as rich as spoken languages and deserve the right to be called a “language” and be treated as a fully capable means of communication.

So, there hasn’t actually been much time to map out a list of all sign languages. The best list I was able to find is in Wikipedia. It lists 28 N/S American, 38 European, 34 Asia-Pacific-AU/NZ, 30 African, and 13 Middle Eastern sign languages - in summary 143 sign languages. This list contains 177 sign languages.

Interestingly, there is also a new International Sign Language in development called Gestuno which is in use in international events (Olympics, conferences etc.) but has only a limited vocabulary.

In 1999 the Irish National Body, Deaf Action Committee for SignWriting, proposed the addition of sign language codes to ISO-639-2. Instead, a single code entered the list: sgn for sign language. In 2001, this led to the development of IETF language extension codes in RFC 3066 for 22 sign languages. In September 2006, this standard was replaced by RFC 4646, which defines 135 subtags for sign languages, including one for the International Sign Language and a generic “sgn” one.

While not complete, the current IANA subtag language registry now regards sign languages as valid derivatives of a country’s languages and therefore handles them identically to spoken languages. It’s also extensible such that any sign language not yet registered can still be specified.

Characters for sign languages The written word is very powerful for preserving and sharing information. For a very long time there has been no written representation of sign languages. This is not surprising considering that there are still indigenous spoken languages that have no written representation. Also, the written representation of the spoken language around the community of a sign language would have served the sign community sufficiently for most purposes - except for the accurate capture of their thoughts and sign communications. It would always be a foreign language.

To move sign languages into the 20th century, the invention of characters for signs was necessary.

It is relatively easy to map the alphabets of spoken languages to signs (e.g. American (ASL) manual alphabet, British, Australian and NZ (AUSLAN) manual alphabet, or German manual finger alphabet, also see fingerspelling). Interesting the AUSLAN manual alphabet is a two-handed one while the ASL one is single-handed.

Fonts are available for these alphabets, too, e.g. British Sign Font, American Sign Font, French Sign Font and more.

The real challenge lies in capturing the proper signs deaf people use to communicate amongst themselves.

This is rather challenging, since sign languages uses the hands, head and body, with constantly changing movements and orientations for communication. Thus, while spoken language only has one dimension (sound) over time, sign languages have “three dimensions” and capturing this in characters is difficult. Many sign languages to this date don’t have a widely used written form, e.g. AUSLAN. Mostly in use nowadays are sequences of photos or videos - which of course cannot be computer processed easily.

Two main writing systems have been developed: the phonemic Stokoe notation and the iconic SignWriting.

Stokoe notation was created by William Stokoe for ASL in 1960, with Latin letters and numbers used for the shapes they have in fingerspelling, and iconic glyphs to transcribe the position, movement, and orientation of the hands. Adaptations were made to other sign languages to include further phonemes not found in ASL. Stokoe notation is written left-to-right on a page and can be typed with the proper font installed. It has a Unicode/ASCII mapping, but does not easily apply to other sign languages than ASL since it does not capture all possible signs. It has no representation for facial and body expressions and is therefore a relatively poor representation for sign.

SignWriting was created by Valerie Sutton in 1974, a dancer who had two years earlier developed DanceWriting and later developed MimeWriting, SportsWriting, and ScienceWriting. SignWriting is a writing system which uses visual symbols to represent the handshapes, movements, and facial expressions of sign languages. It is a generic sign alphabet with a list of symbols that can be used to write any sign language in the world.

SignWriting can be easily learnt by signers and is more popular now than Stokoe. Signers compose the symbols together in a spatial way to represent their signs. They then write the composed symbols from top to bottom on a page, similar to other iconic character sets. SignWriting currently supports 73 different sign languages, whose dictionaries and encyclopedias are captured in SignPuddle. This will eventually allow the creation of complete corpora for all sign languages.

Unicode encoding of SignWriting and visual representation Because of its unique challenges of having to cover the spatial combination of symbols as a new symbol rather than just the sequential combination of symbols, it took a while to get a Unicode representation of SignWriting.

About a year ago, on 19th September 2008, Valerie Sutton released the International SignWriting Alphabet (ISWA 2008).

A binary representation of SignWriting is defined in ISWA 2008. It is based on a representing 639 base symbols and their potential 6 fill and 16 rotation variants in 61,343 code points, that completely cover the subset of 35023 valid symbol codes. The spatial aspect of SignWriting are encoded in a 2-dimensional coordinate system. The dimensions go from -1919 through 1919 to place the top left corner of the symbol.

SignWriting base symbols are encoded in plane 4 of Unicode, which provides 65,536 code points, easily covering the defined 61,343 Binary SignWriting code points. Further special control and number characters are used to encode the spatial layout.

Visual Representation of SignWriting Valerie Sutton created over 35k individual PNG images for ISWA 2008, which have been reformatted for standard color & reduced file size, and renamed to the character code. They are a font used to represent the signs. The images can be accessed on Valerie’s server.

Closing After learning all this today, I have to say that Valerie Sutton has just turned into a new idol of mine. The achievements with SignWriting and the possibilities it will enable are massive.

Now I just have to figure out what to do when we hit on a sign language track that has been encoded in SignWriting and it represents captions. Maybe it is possible to display sign as overlay but on the left side of the video. This would be similar to some other languages that go from top to bottom rather than left to right.

Updated video accessibility demo

Just a brief note to share that I have updated the video accessibility demo at http://www.annodex.net/~silvia/itext/elephant_no_skin.html.

It should now support ARIA and tab access to the menu, which I have simply put next to the video. I implemented the menu by learning from YUI. My Firefox 3.5.3 actually doesn’t tab through it, but then it also doesn’t tab through the YUI example, which I think is correct. Go figure.

Also, the textual audio descriptions are improved and should now work better with screenreaders.

I have also just prepared a recorded audio description of “Elephants Dreams” (German accent warning).

You can also download the multitrack Ogg Theora video file that contains the original audio and video track plus the audio description as an extra track, created using oggz-merge.

As soon as some kind soul donates a sign language track for “Elephants Dream”, I will have a pretty complete set of video accessibility tracks for that video. This will certainly become the basis for more video a11y work!

Demo of deep hyperlinking into HTML5 video

In an effort to give a demo of some of the W3C Media Fragment WG specification capabilities, I implemented a HTML5 page with a video element that reacts to fragment offset changes to the URL bar and the

Demo Features

The demo can be found on the Annodex Web server. It has the following features:

If you simply load that Web page, you will see the video jump to an offset because it is referred to as “elephants_dream/elephant.ogv#t=20”.

If you change or add a temporal fragment in the URL bar, the video jumps to this time offset and overrules the video’s fragment addressing. (This only works in Firefox 3.6, see below - in older Firefoxes you actually have to reload the page for this to happen.) This functionality is similar to a time linking functionality that YouTube also provides.

When you hit the “play” button on the video and let it play a bit before hitting “pause” again - the second at which you hit “pause” is displayed in the page’s URL bar . In Firefox, this even leads to an addition to the browser’s history, so you can jump back to the previous pause position.

Three input boxes allow for experimentation with different functionality.

  • The first one contains a link to the current Web page with the media fragment for the current video playback position. This text is displayed for cut-and-paste purposes, e.g. to send it in an email to friends.

  • The second one is an entry box which accepts float values as time offsets. Once entered, the video will jump to the given time offset. The URL of the video and the page URL will be updated.

  • The third one is an entry box which accepts a video URL that replaces the . It is meant for experimentation with different temporal media fragment URLs as they get loaded into the

Javascript Hacks

You can look at the source code of the page - all the javascript in use is actually at the bottom of the page. Here are some of the juicy bits of what I’ve done:

Since Web browsers do not support the parsing and reaction to media fragment URIs, I implemented this in javascript. Once the video is loaded, i.e. the “loadedmetadata” event is called on the video, I parse the video’s @currentSrc attribute and jump to a time offset if given. I use the @currentSrc, because it will be the URL that the video element is using after having parsed the @src attribute and all the containing elements (if they exist). This function is also called when the video’s @src attribute is changed through javascript.

This is the only bit from the demo that the browsers should do natively. The remaining functionality hooks up the temporal addressing for the video with the browser’s URL bar.

To display a URL in the URL bar that people can cut and paste to send to their friends, I hooked up the video’s “pause” event with an update to the URL bar. If you are jumping around through javascript calls to video.currentTime, you will also have to make these changes to the URL bar.

Finally, I am capturing the window’s “hashchange” event, which is new in HTML5 and only implemented in Firefox 3.6. This means that if you change the temporal offset on the page’s URL, the browser will parse it and jump the video to the offset time.

Optimisation

Doing these kinds of jumps around on video can be very slow when the seeking is happening on the remote server. Firefox actually implements seeking over the network, which in the case of Ogg can require multiple jumps back and forth on the remote video file with byte range requests to locate the correct offset location.

To reduce as much as possible the effort that Firefox has to make with seeking, I referred to Mozilla’s very useful help page to speed up video. It is recommended to deliver the X-Content-Duration HTTP header from your Web server. For Ogg media, this can be provided through the oggz-chop CGI. Since I didn’t want to install it on my Apache server, I hard coded X-Content-Duration in a .htaccess file in the directory that serves the media file. The .htaccess file looks as follows:

<Files "elephant.ogv"> Header set X-Content-Duration "653.791" </Files>

This should now help Firefox to avoid the extra seek necessary to determine the video’s duration and display the transport bar faster.

I also added the @autobuffer attribute to the

ToDos

This is only a first and very simple demo of media fragments and video. I have not made an effort to capture any errors or to parse a URL that is more complicated than simply containing “#t=”. Feel free to report any bugs to me in the comments or send me patches.

Also, I have not made an effort to use time ranges, which is part of the W3C Media Fragment spec. This should be simple to add, since it just requires to stop the video playback at the given end time.

Also, I have only implemented parsing of the most simple default time spec in seconds and fragments. None of the more complicated npt, smpte, or clock specifications have been implemented yet.

The possibilities for deeper access to video and for improved video accessibility with these URLs are vast. Just imagine hooking up the caption elements of e.g. an srt file with temporal hyperlinks and you can provide deep interaction between the video content and the captions. You could even drive this to the extreme and jump between single words if you mark up each with its time relationship. Happy experimenting!

UPDATE: I forgot to mention that it is really annoying that the video has to be re-loaded when the @src attribute is changed, even if only the hash changes. As support for media fragments is implemented in

Thanks go to Chris Double and Chris Pearce from Mozilla for their feedback and suggestions for improvement on an early version of this.

The different aspects of video accessibility

In the last week, I have received many emails replying to my request for feedback on the video accessibility demo. Thanks very much to everyone who took the time.

Interestingly, I got very little feedback on the subtitles and textual audio annotation aspects of my demo, actually, even though that was the key aspect of my analysis. It’s my own fault, however, because I chose a good looking video player skin over an accessible one.

This is where I need to take a step back and explain about the status of HTML5 video and its general accessibility aspects. Some of this is a repetition of an email that I sent to the W3C WAI-XTECH mailing list.

Browser support of HTML5 video

The HTML5 video tag is still a rather new tag that has not been implemented in all browsers yet - and not all browsers support the Ogg Theora/Video codec that my demo uses. Only the latest Firefox 3.5 release will support my demo out of the box. For Chrome and Opera you will have to use the latest nightly build (which I am not even sure are publicly available). IE does not support it at all. For Safari/Webkit you will need the latest release and install the XiphQT quicktime component to provide support for the codec.

My recommendation is clearly to use Firefox 3.5 to try this demo.

Standardisation status of HTML5 video

The standardisation of the HTML5 video tag is still in process. Some of the attributes have not been validated through implementations, some of the use cases have not been turned into specifications, and most importantly to the topic of interest here, there have been very little experiments with accessibility around the HTML5 video tag.

Accessibility of video controls

Most of the comments that I received on my demo were concerned with the accessibility of the video controls.

In HTML5 video, there is a attribute called @controls. If it is available, the browser is expected to display default controls on top of the video. Here is what the current specification says:

“This user interface should include features to begin playback, pause playback, seek to an arbitrary position in the content (if the content supports arbitrary seeking), change the volume, and show the media content in manners more suitable to the user (e.g. full-screen video or in an independent resizable window).”

In Firefox 3.5, the controls attribute currently creates the following controls:

  • play/pause button (toggles between the two)
  • slider for current playback position and seeking (also displays how much of the video has currently been downloaded)
  • duration display
  • roll-over button for volume on/off and to display slider for volume
  • FAIK fullscreen is not currently implemented

Further, the HTML5 specification prescribes that if the @controls attribute is not available, “user agents may provide controls to affect playback of the media resource (e.g. play, pause, seeking, and volume controls), but such features should not interfere with the page’s normal rendering. For example, such features could be exposed in the media element’s context menu.”

In Firefox 3.5, this has been implemented with a right-click context menu, which contains:

  • play/pause toggle
  • mute/unmute toggle
  • show/hide controls toggle

When the controls are being displayed, there are keyboard shortcuts to control them:

  • space bar toggles between play and pause
  • left/right arrow winds video forward/back by 5 sec
  • CTRL+left/right arrow winds video forward/back by 60sec
  • HOME+left/right jumps to beginning/end of video
  • when focused on the volume button, up/down arrow increases/decreases volume

As for exposure of these controls to screen readers, Mozilla implemented this in June, see Marco Zehe’s blog post on it. It implies having to use focus mode for now, so if you haven’t been able to use keyboard for controlling the video element yet, that may be the reason.

New video accessibility work

My work is actually meant to take video accessibility a step further and explore how to deal with what I call time-aligned text files for video and audio. For the purposes of accessibility, I am mainly concerned with subtitles, captions, and audio descriptions that come in textual form and should be read out by a screen reader or made available to braille devices.

I am exploring both, time-aligned text that comes within a video file, but also those that are available as external Web resources and are just associated to the video through HTML. It is this latter use case that my demo explored.

To create a nice looking demo, I used a skin for the video player that was developed by somebody else. Now, I didn’t pay attention to whether that skin was actually accessible and this is the source of most of the problems that have been mentioned to me thus far.

A new, simpler demo I have now developed a new demo that uses the default player controls which should be accessible as described above. I hope that the extra button that I implemented for the menu with all the text tracks is now accessible through a screen reader, too.

UPDATE: Note that there is currently a bug in Firefox that prevents tabbing to the video element from working. This will be possible in future.

More video accessibility work

It’s already old news, but I am really excited about having started a new part-time contract with Mozilla to continue pushing the HTML5 video and audio elements towards accessibility.

My aim is two-fold: firstly to improve the HTML5 audio and video tags with textual representations, and secondly to hook up the Ogg file format with these accessibility features through an Ogg-internal text codec.

The textual representation that I am after is closely based on the itext elements I have been proposing for a while. They are meant to be a simple way to associate external subtitle/caption files with the HTML5 video and audio tags. I am initially looking at srt and DFXP formats, because I think they are extremes of a spectrum of time-aligned text formats from simple to complex. I am preparing a specification and javascript demonstration of the itext feature and will then be looking for constructive criticism from accessibility, captioning, Web, video and any other expert who cares to provide input. My hope is to move the caption discussion forward on the WHATWG and ultimately achieve a cross-browser standard means for associating time-aligned text with media streams.

The Ogg-internal solution for subtitles - and more generally for time-aligned text - is then a logical next step towards solving accessibility. From the many discussions I have had on the topic of how best to associate subtitles with video I have learnt that there is a need for both: external text files with subtitles, as well as subtitles that are multiplexed with the media into a single binary fie. Here, I am particularly looking at the Kate codec as a means of multiplexing srt and DFXP into Ogg.

Eventually, the idea is to have a seamless interface in the Web Browser for dealing with subtitles, captions, karaoke, timed metadata, and similar time-aligned text. The user interaction should be identical no matter whether the text comes from within a binary media file or from a secondary Web resource. Once this seamless interface exists, hooking up accessibility tools such as screen readers or braille devices to the data should in theory be simple.

Progress on captions for HTML5 video

Paul Rouget this week published another example implementation for using srt with HTML5 video with a javascript library. This is at least the fourth javascript implementation that I know of for attaching srt subtitles to the video element.

It is great to see such a huge need for this. At the same time I am also worried about the amount of incompatible implementations of this feature. It will inhibit search engines from realising which text relates to and describes a particular video. It will also inhibit accessibility technology such as screen readers or braille devices from realising there is text that would be necessary to be rendered.

A standard means of associating srt (or other format) subtitle files with the video tag is really necessary. So, where are we at with this?

Recently, Greg Millam from Google posted a proposal to WHATWG, that shares a lot of elements with the proposal that has been previously discussed between Mozilla, Xiph, and Opera, the current state of which is summarised in the Mozilla wiki. No implementation into a Browser has been made yet, but initial implementations in javascript exist. I think that we will ultimately come out with a harmonised solution between the browser vendors. It just needs implementation work and continuous improvement.

At the same time, in-band captions that come multiplexed within the Ogg file are also being progressed. At Xiph we are now focusing on using Ogg Kate for these purposes - it really don’t make much sense to invent another codec when Ogg Kate is already so close to solving most problems. So, between the developer of Ogg Kate and myself, we are preparing a Google Summer of Code project that should see a implementation for Firefox 3.1 that is capable of extracting the text from an Ogg file that has a Kate track and displaying that track as though it was a srt file. If you are interested, shoot me an email!

UPDATE: Firefox 3.1 is apparently now called Firefox 3.5 - sorry guys. :-)

ANOTHER UPDATE: My post seemed to imply that Firefox 3.5 will have Ogg Kate support. This is not the case. There is a patch for Firefox and liboggplay to provide Ogg Kate support into Firefox and this patch will be the basis of the Summer of Code project. The student will then work mostly on implementing a comprehensive javascript library to display Ogg Kate encoded time-aligned text (read: captions, Karaoke etc) in the Web browser. This is a proof-of-concept and a first step towards standardising the handling of time-aligned text in Web browsers that suppor the HTML5 video tag.

LCA 2009 talk on video accessibility

During the LCA 2009 Multimedia Miniconf, I gave a talk on video accessibility. Videos have been recorded, but haven’t been published yet. But here are the talk slides:

Lca2009 Video A11y

View more presentations or upload your own. (tags: deaf captions)

I basically gave a very brief summary of my analysis of the state of video accessibility online and what should be done. More information can be found on the Mozilla wiki.

The recommendation is to support the most basic and possibly the most widely online used of all subtitle/captioning formats first: srt. This will help us explore how to relate to out-of-band subtitles for a HTML5 video tag - a proposal of which has been made to the WHATWG and is presented in the slides. It will also help us create Ogg files with embedded subtitles - a means of encapsulation has been proposed in the Xiph wiki.

Once we have experience with these, we should move to a richer format that will also allow the creation of other time-aligned text formats, such as ticker text, annotations, karaoke, or lyrics.

Further, there is non-text accessibility data for videos, e.g. sign language recordings or audio annotations. These can also be multiplexed into Ogg through creating secondary video and audio tracks.

Overall, we aim to handle all such accessibility data in a standard way in the Web browser to achieve a uniform experience with text for video and a uniform approach to automating the handling of text for video. The aim is:

  • to have a default styling of time-aligned text categories,
  • to allow styling override of time-aligned through CSS,
  • to allow the author of a Web page with video to serve a multitude of time-aligned text categories and turn on ones of his/her choice,
  • to automatically use the default language and accessibility settings of a Web browser to request appropriate time-aligned text tracks,
  • to allow the consumer of a Web page with video to manually select time-aligned text tracks of his/her choice, and
  • to do all of this in the same way for out-of-band and in-line time-aligned text.

At the moment, none of this is properly implemented. But we are working on a liboggtext library and are furher discussing how to include out-of-band text with the video in the Webpage - e.g. should it go into the Webpage DOM or in a separate browsing context.

If you feel strongly about video a11y, get involved at http://lists.xiph.org/mailman/listinfo/accessibility.

Embedding time-aligned text into Ogg

As part of my accessibility work for Mozilla and Xiph, it is necessary to define how time-aligned text such as subtitles, captions, or annotations, are encapsulated into Ogg. In the fansubber community this is called “hard subtitles” as opposed to “soft subtitles” which are subtitles that stay in a text file and are loaded separately to the video file into a media player and synchronised with the video by the media player. (as per comment below, all text annotations are “soft” - or also “closed”.)

I can hear you ask: so how do I do subtitles/captions with Ogg now? Well, it would have been possible to simply choose one subtitling format and map that into Ogg, then ask everyone to just use that one format and be done. But which one to choose? And why prefer a simpler one over a more complex one? And why just do subtitles and not any other time-aligned text?

So, instead, I analysed what types of time-aligned text “codecs” I have come across. Each one would have a multitude of text formats to capture the text data, because it is easy to invent a new format and standardisation hasn’t really happened in this space yet.

I have come up with the following list of typical time-aligned text codecs:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind - to be transferred to braille or TTS)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: metadata & semantic annotations
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

Let me know if you can think of any other classes of video/audio-related time-aligned text.

All of these texts can be represented in text files with some kind of time marker, and possibly some header information to set up the interpretation environment. So, the simplest way of creating a representation of these inside Ogg was to define a generic mapping for time-aligned text into Ogg.

The Xiph wiki holds the current draft specification for mapping text codecs into Ogg. For anyone wanting to map a text codec into Ogg, this should provide the framework. The idea is to separate the text codec’s data into header data and into timed text segments (which can have all sorts of styling and other information with it). Then, the mapping is simple. An example for srt is described on the wiki page.

The specification is still in draft status, because we’re still expecting feedback. In fact, what we now need is people trying an implementation and providing fixes to the specification.

To map your text codec of choice into Ogg, you will probably requrie further mapping specifications. Dependent on how complex your text codec of choice is, these additional mapping specifications may be rather simple or quite complicated. In the case of srt, it should be trivial. Considering the massive amount of srt already freely available online, the srt mapping may well have a really large impact. Enough hits. Let me know if you’re coding up something!

My next duty is to look for a representation that is generic enough to provide representations for any of the above listed text codecs. This representation is what will need to be available to a Web Browser when working with a Web video that has related text. Current contenders are OggKate and W3C TimedText, but I am not sure if either are too restrictive. I am indeed looking for the next generation of captioning technology that will be able to provide any type of time-aligned text that relates to audio/video.

W3C Technical Plenary / Advisory Committee Meetings Week 2008

I spent last week in France, near Cannes, at the W3C TPAC meeting. This is the one big meeting that the W3C has every year to bring together all (or most) of the technical working groups and other active groups at the W3C.

It was not my first time at a standards body meeting - I have been part of ISO/MPEG before and also of IETF, and spoken with people at IEEE and SMPTE. However, this time was different. I felt like I was with people that spoke my language. I also felt like my experience was valued and will help solving some of the future challenges for the Web. I am very excited to be an invited expert on the Media Fragments and Media Annotations working groups and be able to provide input into HTML5.

In the Media Fragments working group we are developing a URI addressing scheme that enables direct linking to media fragments, in particular temporal and spatial segments. Experience from our earlier temporal URI scheme is one of the inputs to the scheme. Currently it looks likely that we will choose a scheme that has ”#” in it and then require changes to browsers, Web proxys, and servers to enable delivery of media fragments.

In the Media Annotations working group we are deciding upon an ontology to generically describe media resources - something based on Dublin Core but more extended and more appropriate for audio and video. We are currently looking at Adobe’s XMP specification.

As for HTML5 - there was not much of a discussion at the TPAC meeting about the audio and video elements (unless I missed it by attending the other groups). However, from some of the discussions it became clear to me that they are still in very early stage of specification and much can be done to help define the general architecture of how to publish video on the Web and its metadata, help define javascript APIs and DOM models, and help define accessibility.

I actually gave a lightning talk about the next challenges of HTML5 video at TPAC (see my “video slides”) which points out the need for standard definitions of video structure and annotations together with an API to reach them. I had lots of discussions with people afterwards and also learnt a lot more about how to do accessibility for Web video. I should really write it up in an article…

Of course, I also met a lot of cool people at TPAC, amongst them Larry Masinter, Ian Hickson, and Tim Berners-Lee - past and new heros of Web standards. :-) It was totally awesome and I am very grateful to Mozilla for sending me there and enabling me to learn more about the greater picture of video accessibility and the role it plays on the Web.

Video Accessibility for Firefox

Ogg has struggled for the last few years to recommend the best format to provide caption and subtitle support for Ogg Theora. The OGM fork had a firm focus on using subtitles in SRT, SSA or VobSub format. However, in Ogg we have always found these too simplistic and wanted a more comprehensive solution. The main aim was to have timed text included into the video stream in a time-aligned fashion. Writ, CMML, and now Kate all do this. And yet, we have still not defined which is the one format that we want everybody to support as the caption/subtitle format.

With Ogg Theora having been chosen by Mozilla as the baseline video codec for Firefox and the HTML5

As a first step in this direction, Mozilla have contracted me to analyse the situation and propose a way forward.

The contract goes beyond simple captions and subtitles though: it analyses all accessibility requirements for video, which includes audio annotations for the blind, sign language video tracks, and also transcripts, karaoke, and metadata tracks as more generic timed text example tracks. The analysis will thus be about how to enable a framework for creating a timed text track in Ogg and which concrete formats should be supported for each of the required functionalities.

While I can do much of the analysis myself, a decision on how to move forward can only be made with lots of community input. The whole process of this analysis will therefore be an open one with information being collected on the Mozilla Wiki, see https://wiki.mozilla.org/Accessibility/Video_Accessibility .

An open mailing list is also set up at Xiph to create a discussion forum for video accessibility: accessibility@lists.xiph.org. Join there if you’d like to provide input. I am particularly keen for people with disabilities to join because we need to get it right for them!

I am very excited about this project and feel honoured for being supported to help solve accessibility issues for Ogg and Firefox! Let’s get it right!