Embedding HTML5 polyglot + schema.org microdata into NewsML-G2

Summary

XHTML5 (HTML5 polyglot delivered with the content-type header "application/xhtml-xml") extended with schemas for structured data markup in microdata syntax is a powerful data structure for delivering news content in a standardized way. Since HTML5 polyglot documents are valid XML it’s parsable with Xpath and easily to embed them into NewsML-G2 contentSet/inlineXML elements. Both HTML5 polyglot and structured can be validated using validation services. An open question is how a document in the inlineXML element can be identified to be HTML5 polyglot. That is needed for distinct it from XHTML 1.x. An extension of the NewsML-G2 specification might be required.

Definitions

XHTML5

XHTML5 is a synonym for "HTML5 serialized as XML" It's completely compliant to the W3C specification to do so and it is supported by virtually all common browsers.

HTML5 polyglot

"A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees ... when processed either as HTML or when processed as XML. Polyglot markup that meets a well-defined set of constraints is interpreted as compatible, regardless of whether it is processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on void elements, named entity references, and the use of scripts and style.“

Source: Polyglot Markup: A robust profile of the HTML5 vocabulary

HTML5 polyglot is basically serialized HTML5 as XML and is a subset of valid HTML5. It’s identified by the HTTP header "Content-type: application/xhtml+xml", the doctype declaration “<! DOCTYPE html?>“ and the HTML namespace attribute “http://www.w3.org/1999/xhtml”. There are several other namespaces for elements not covered by the HTML specification, see http://www.w3.org/TR/2011/WD-html5-20110405/namespaces.html

schema.org Vocabulary and Microdata

Pro and Contra for Using HTML5 polyglot + schema.org Microdata with NewsML-G2

HTML5 polyglot

schema.org microdata

[https://schema.org/docs/extension.html]

Alternatives

Resources

HTML5

Specification

W3C HTML Polyglot Candidate Recommendation

Articles

Benefits of polyglot XHTML

Validators

validator.nu

W3C Markup Validation Service

Microdata

Specification

schema.org

IPTC rNews

Validators

Google Structured Data Testing Tool

Bing Markup Validator

Yandex Structured Data Validator

Structured Data Linter

Examples

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
       <meta charset="UTF-8"/>
   <title>Bacon ipsum dolor ...</title>
    </head>
    <body>
   <article itemscope="itemscope" itemtype="http://schema.org/NewsArticle">
       <h1 itemprop="headline">Bacon ipsum...
       </h1>
       <div itemprop="description">
           <p>Short loin...
           </p>
       </div>
       <div itemprop="associatedMedia">
           <figure itemscope="itemscope" itemtype="http://schema.org/ImageObject">
               <img itemprop="image" src="Forelle.jpg"   width="841" height="581"/>
               <figcaption>
                   <div itemprop="headline">Europäische Forelle</div>
                   <div itemprop="description">
                       <p>Die Bachforelle...</p>
                   </div>
                   <div itemprop="author">Don Alfonso</div>
               </figcaption>
           </figure>
       </div>
       <div itemprop="articleBody">
           <p>Meatloaf ham..
           </p>
           <figure itemscope="itemscope" itemtype="http://schema.org/ImageObject">
               <img itemprop="image" src="Flunder.jpg" width="398" height="240"/>
               <figcaption>
                   <div itemprop="headline">Flunder</div>
                   <div itemprop="description">
                       <p>Sie ist flach, flach, flach...</p>
                   </div>
                   <div itemprop="author">Don Fernando</div>
               </figcaption>
           </figure>
           <p>Chuck corned beef.</p>
       </div>
       <div itemscope="itemscope" itemtype="http://www.w3.org/2004/02/skos/corefoobar">
           <span itemprop="definition">Explanation...</span>
       </div>
   </article>
   </body>
</html>