WebKR 2008: XML
Representing, accessing, and transforming data

Eyal Oren, eyal@cs.vu.nl

Overview: data on the Web

  • HTML: Web documents
  • CSS: reusable layouts
  • XML: arbitrary Web data
  • XSD/DTD: XML schemas
  • XQuery/Xpath: accessing XML data
  • XSLT: transforming XML data

HTML

  • W3C standard for hypertext markup language (1992, 1999)
  • Defines set of tags and their mandatory/optional interpretation
    • Example tags: <html>, <head>, <body>, <h1>, <p>, <b>
    • Head: metadata
    • Body: structure and content
  • With HTML one also defines page layout:
    • Absolute: absolute sizes and positioning
    • Relative: percentage sizes and positioning
    • For one page (inline)
    • For sets of pages (external CSS)

Example HTML: VU website

        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
        <html>
          <head>
            <title>Vrije Universiteit Amsterdam</title>
            ...
          </head>
        
          <body>
            <h1>Example of HTML document</h1>
        
            <p id="top">
              Here is a paragraph with a sample reference to the <a 
              href="http://www.few.vu.nl">FEW section of the VU</a>. 
            </p>
        
            <p>
              Another paragraph, referring to the <a href="#top">top of this page</a>.
            </p>
          </body>
        </html>
      

HTML: inline layout

<p align="center">
<hr width="20%">
<img height="212px">
<table cellspacing="50%">
<frameset cols="20%,80%">

HTML: external layout (CSS)

  • W3C standard for cascading style sheets (1998)
        <html>
          <head>
            <title>Example</title>
            <link rel="stylesheet" href="style.css" type="text/css"/>
          </head>
          <body>
          </body>
        </html>
      
        h1 {
          font-size: 120%;
          font-weight: bold;
        }
        img.bordered {
          border: 1px solid black
        }
      

Why external CSS?

  • Separate style from content
  • CSS defines presentation model
    • Reuse layout
    • Different style for different situations/media/people
    • Reduce bandwidth: CSS cached on client side, smaller HTML pages

XML

  • W3C standard for extensible markup language (1998)
  • Extensible set of tags
  • Describe and exchange arbitrary information (not just Web documents)
  • Markup follows syntactic rules: check for well-formedness
  • Document schema may be defined in DTD or XSD: check for validity

XML in the Web stack

XML in the Web stack (2)

XML syntax (1)

  • Hierarchy of elements (XML element tree)
  • Elements have names (tags), values (content) and attributes
  • Elements can be nested
        <?xml version="1.0" encoding="UTF­8"?>
        
        <country name= "The Netherlands" > 
          <geography>
            <capital name= "Amsterdam" >
              <remark> The Hague is the seat of the government </remark> 
            </capital>
            <neighboring_country> Germany </neighboring_country>
            <neighboring_country> Belgium </neighboring_country>
          </geography> 
        </country>
      

XML syntax (2)

  • Prolog: mandatory header
    <xml version="1.0" encoding="utf-8"?>
  • Elements: basic components
    • start-tag, end-tag and content
    • the root must be unique
    • content can be text, other elements (nested) or nothing
  • Attributes: name-value pair inside tag
    <person firstname="John" lastname="Smith"/>

XML syntax (3)

        <?xml version="1.0" encoding="UTF­8"?>
        <?xml:stylesheet type="text/css2" href="style.css"?>
        
        <country name= "The Netherlands" > 
          <geography>
            <capital name= "Amsterdam" >
              <remark> the seat of the government is The Hague </remark> 
            </capital>
            <neighboring_country> Germany </neighboring_country>
            <neighboring_country> Belgium </neighboring_country>
          </geography> 
        
          <!-- Should be extended with other data ­­>
        </country>
      

XML syntax (4)

  • comments: ignored by parser
    <!-- comment -->
  • processing instructions: passed to application
                <?xml:stylesheet type="text/css2" href="style.css"?>
              
  • XML is well-formed if:
    • nesting is well-balanced
    • attribute names unique within element

Well-formed?

  •             <name>
                  <firstName>Vincent
                  <lastName>van Gogh
                  </firstName>
                  </lastName>
                </name>
              
  •             <name>
                <firstName>Vincent</firstName>
                <lastName>van Gogh</lastName>
              

Well-formed?

XML Namespaces

  • Combining documents can lead to naming collisions: book title and recipe title
  • Namespace provide naming context URI for elements and attributes
  • Namespace prefixes provide shorthand notation
        <collection
          xmlns:books="http://www.oclc.org/books/1.0/"
          xmlns:webpage="http://www.w3c.org/html/1.0/">
        
          <book>
            <books:title>Gulliver's travels</books:title>
          </book>
          <web>
            <webpage:title>My first homepage</webpage:title>
          </web>
        </collection>
      

XML data model (1)

  • Well-formed: if document syntactically correct
  • Syntax: alphabet and grammar: '<', '/', '&', etc.
  • Data model: how to interpret syntax

XML data model (2)

  • Ordered labelled tree
  • Exactly one root, no cycles
  • Every non-root node has one parent
  • Every node can have a label
  • Order is important for elements

XML data model (3)

  • XML data model: limited meaning
    • elements with names
    • parent-child relations
    • element values
  • Not:
    • concepts/classes
    • concept properties
    • class hierarchy
  • XML defines document structure

Beyond XML syntax

  • XML represents structured information
  • XML allows for arbitrary structures
  • Structure can be agreed upon and described using DTDs or XSDs
  • Validity of instance documents can be verified against these schemas
  • XML document is valid if: well-formed and conforms to XSD/DTD

Structing using DTDs

  • DTD: document type definition
  • Associated using 'DOCTYPE' statement
  • Element nesting, order, multiplicity, attributes
  • Only few datatypes
  •           <!ELEMENT country (geography, people, economy)>
              <!ATTLIST country
                name CDATA #REQUIRED>
              <!ELEMENT geography (capital, neighboring_country*)>
              <!ELEMENT capital (remark*)>
              <!ATTLIST capital
                name CDATA #REQUIRED>
              <!ELEMENT remark (#PCDATA)>
              <!ELEMENT neighboring_country (#PCDATA)>
            

Structing using XSDs

  • W3C standard for XML Schema Definition (2001)
  • Schema language like DTD, but:
    • Expressed in XML
    • Several datatypes
    • Richer grammar
        <complexType name="capital">
          <element name="name" type="string"/>
          <element ref= "remark" maxOccurs="unbounded"/>
        </complexType>
      

XSD grammar

  • Cardinality: minOccurs, maxOccurs
  • Content models: choice, sequence, all
  • Attribute values: default, fixed
        <complexType name="WindowsType">
          <element name="version" type="string" minOccurs="0"
            maxOccurs="1" default="W98"/>
          <element name="includedBrowser" type="string"
            minOccurs="0" maxOccurs="1" fixed="Internet Explorer"/>
        </complexType>
      
        <schema
          xmlns="http://www.w3.org/2001/XMLSchema"
          xmlns:po="http://www.example.com/purchaseOrder">
        
          <element name="purchaseOrder" type="po:type"/>
          <element name="comment"       type=":string"/>
          <element name="anotherComment" type="xsd:string"/>
          <!-- etc. -->
        </schema>
      

Summary HTML and XML

  • HTML: fixed set of tags, represent document structure and layout, presentation model defined in CSS
  • XML: arbitrary set of tags, schema may be specified in DTD or XSD
    • well-formed: correct syntax, nested tags
    • valid: conforms to schema definition
  • XHTML: HTML4.0 in XML

  • How to represent XML data in HTML (web page)?
  • How to transform XML document?
  • How to query XML data?

Querying XML

  • Database: formulate query (SQL), run on data
                SELECT student_id, name FROM student WHERE firstyear = '2006'
              
  • HTML: use fragment IDs to point to part of document
                http://www.cia.gov/cia/publications/factbook/index.html#chiefsofstate
              
  • XML: formulate query (XQuery), run on data
    /country/geography/capital/@name="Amsterdam"

XQuery path expressions (XPath)

  • Describes how to reach (set of) elements
  • Traverses XML element
/country/geography/capital/@name="Amsterdam"

XQuery and Xpath

  • Selection: element path (SQL SELECT)
  • Filtering: brackets '[]' (SQL WHERE)
  • Wildcards: '//' (somewhere below)
  • Attribute names prepended with '@'
  • All country elements that neighbour Germany
    /country[geography/neighboring_country="Germany"]
  • All capital elements
    //capital

XQuery results:

//capital
        <xql:result>
          <capital name="Algiers">
          <capital name="Amsterdam">
            <remark> the seat of the government is The Hague </remark>
          </capital>
          <capital name="Berlin"/>
          <capital name="Bogota">
          <capital name="Buenos Aires">
           ....
        </xql:result>
      

XQuery filters:

  • Filter expressions can use values
    /country[geography/neighboring_country="Germany"]
  • But also sequence numbers
    //author[3]
  • Relative order
    //author[3]/book[last()]
  • Negation
    //author[3]/book[not @title]

Note about querying

  • What do we query? The datamodel (tree)
  • What does the datamodel mean?
  • We need to know element names, attributes, and nesting
  • Querying requires:
    • Agreement on meaning of elements/attributes
    • Agreement on document structure
    • Fulfilled if document conforms to DTD/XSD

Exercise

        <?xml version="1.0" encoding="iso­8859­1" ?>
        <country name= "The Netherlands" > 
          <geography>
            <capital name= "Amsterdam" >
              <remark> The Hague is the seat of the government </remark> 
            </capital>
            <neighboring_country>Germany</neighboring_country>
            <neighboring_country>Belgium</neighboring_country>
          </geography> 
        </country>
      
  • Give the (Xpath) expression for the element neighboring_country
  • Give the expression for the second element neighboring_country
  • Give the expression for the value of the name attribute in the capital element
  • Give the expression for the text value of the remark tag

Exercise answers

  • Give the (Xpath) expression for the element neighboring_country
    /country/geography/neighboring_country
  • Give the expression for the second element neighboring_country
    /country/geography/neighboring_country[2]
  • Give the expression for the name attribute in the capital element
    /country/geography/capital/@name
  • Give the expression for the text value of the remark tag
    /country/geography/capital/remark/text()

Transforming XML: XSLT

  • W3C standard for XML stylesheet transformations (1999)
  • From XML to XML or (X)HTML
  • Use Xpath for element addressing (selecting)
  • Use XML for output construction (rendering)

XSLT overview

  • Define XSLT templates, point from XML to XSLT
  •           <?xml version="1.0" encoding="iso-8859-1"?>
              <?xml-stylesheet type="text/xsl" href="foo.xsl"?>
              <countries>
                ...
              </countries>
            
  • Templates applied to XPath addresses
  •           <xsl:template match="/country/geography/capital">
                <html>
                  <body>
                    <b> Name: </b> 
                    <xsl:value­of select= "@name"/>
                  </body>
                </html>
              </xsl:template>
            
  • Templates can be invoked by other templates
  •           <xsl:template match="/country/geography">
                <xsl:apply­templates select="capital"/>
              </xsl:template>
            

Exercise (1): what is the output of the XSLT?

        <?xml version="1.0" encoding="iso-8859-1"?>
        <countries>
          <country name="The Netherlands">
            <geography>
              <capital name="Amsterdam">
                <remark>The Hague is the seat of the government</remark>
              </capital>
              <neighboring_country> Germany </neighboring_country>
              <neighboring_country> Belgium </neighboring_country>
            </geography>
          </country>
          <country name="France">
            <geography>
              <capital name="Paris">
                <remark>There is more to France than just Paris</remark>
              </capital>
              <neighboring_country>Germany</neighboring_country>
              <neighboring_country>Belgium</neighboring_country>
              <neighboring_country>Spain</neighboring_country>
              <neighboring_country>Italy</neighboring_country>
              <neighboring_country>Switzerland</neighboring_country>
            </geography>
          </country>
        </countries>
      

Exercise (1): what is the output of the XSLT?

        <xsl:stylesheet version="1.0"
          xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="html" encoding="iso-8859-1" indent="yes"/>
        
          <xsl:template match="text()">
          </xsl:template>
        
          <xsl:template match="/">
            <html>
              <head>
                <title>
                  <xsl:text>first example of XSLT transformation</xsl:text>
                </title>
              </head>
              <body>
                <xsl:apply-templates/>
              </body>
            </html>
          </xsl:template>
        
          <xsl:template match ="country/geography/capital">
            <p>
              <b>Name of the capital</b>:
              <xsl:value-of select = "@name"/>
            </p>
          </xsl:template>
        
        </xsl:stylesheet>
      

Answer:

        Name of the capital: Amsterdam
        Name of the capital: Paris
      

Exercise (2): how to get the following result?

        <?xml version="1.0" encoding="iso-8859-1"?>
        <countries>
          <country name="The Netherlands">
            <geography>
              <capital name="Amsterdam">
                <remark>The Hague is the seat of the government</remark>
              </capital>
              <neighboring_country> Germany </neighboring_country>
              <neighboring_country> Belgium </neighboring_country>
            </geography>
          </country>
          <country name="France">
            <geography>
              <capital name="Paris">
                <remark>There is more to France than just Paris</remark>
              </capital>
              <neighboring_country>Germany</neighboring_country>
              <neighboring_country>Belgium</neighboring_country>
              <neighboring_country>Spain</neighboring_country>
              <neighboring_country>Italy</neighboring_country>
              <neighboring_country>Switzerland</neighboring_country>
            </geography>
          </country>
        </countries>
      

Exercise (2): how to get the following result?

        Country: The Netherlands 
        
Neighboring country: Germany
Neighboring country: Belgium
Country: France
Neighboring country: Germany
Neighboring country: Belgium
Neighboring country: Spain
Neighboring country: Italy
Neighboring country: Switzerland

Answer:

        <xsl:stylesheet version="1.0"
          xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="html" encoding="iso-8859-1" indent="yes"/>
        
          <xsl:template match="text()">
          </xsl:template>
        
          <xsl:template match="/">
            <html>
             <head>
                <title>
                  <xsl:text>Second example of XSLT transformation</xsl:text>
                </title>
              </head>
              <body>
                <p>
                  <xsl:apply-templates/>
                </p>
              </body>
            </html>
          </xsl:template>
        
          <xsl:template match="country">
            <b>Country</b>: 
            <xsl:value-of select="@name"/>
            <hr/>
            <xsl:apply-templates/>
          </xsl:template>
        
          <xsl:template match="neighboring_country">
            <b>Neighboring country:</b>
            <xsl:value-of select="text()"/>
            <hr/>
          </xsl:template>
        </xsl:stylesheet>
      

Summary

  • HTML: markup language for Web documents (fixed set of tags)
  • XML: extensible markup language (arbitrary tags)
  • XSD: schema language (define structure and constraints)
  • XQuery: query XML data (documents)
  • XPath: point to (set of) XML elements
  • XSLT: transform XML documents (using template stylesheets)

Assignment (XML)

  • Design an XML vocabulary for some domain
    • Create XSD document
    • Create XML data document
    • Validate XML against XSD
    • Transform XML into XHTML using XSLT
  • Tools
    • Mozilla/Firefox: XML & XSLT
    • XML, XPath, XSLT standards
    • XSLT tutorial
  • Submit
    • Email (subject: WBKR XML) with URLs of
      • XSD file
      • XML file (linked to XSLT file)
    • Until Tuesday 26 Feb 2008, 15:00h