XML

Tags

An opening tag gives the name of the element, within <> brackets.

<computer>

A closing tag is always required. The closing tag has a slash before the name.

The name must be spelled exactly like the opening tag. The capitalization must be the same. COMPUTER or Computer would be wrong in this example. You can use capital letters, if you wish, but the opening and closing tag must match.

</computer>

You may NOT put spaces before the name. You may put spaces at the end, before the closing >

<computer   >

</computer   >

Elements

An element is an opening tag, its matching closing tag, and everything contained between them.

This example element content is only text. Text is always contained within an element. Text is sometimes called Parsed Character Data or PCDATA. Parsed means that the parser looks at the text to try and find < characters and element tags. PCDATA is an SGML term. XML is derived from the older language: Standard General Markup Language (SGML).

An element that only contains text is said to have simple content.

            
<computer>
   My computer is black and gray.
</computer>
            
          

An element that only contains elements is said to have element content.
In this example, computer has element content, and color has text content.

            
<computer>
   <color>
      black
   </color>
   <color>
      gray
   </color>
</computer>
            
          

An element that contains elements and text is said to have mixed content.
In this example, computer has mixed content.

computer is the parent of the color elements and the text element: My computer is old. They are the children of the computer element. If there were several levels of elements, we would use the terms descendents and ancestor. We say the color elements and the text element: My computer is old. are siblings.

            
<computer>
   <color>
      black
   </color>
   <color>
      gray
   </color>
   My computer is old.
</computer>
            
          

There must be ONE root element in an XML document. All the other elements and all text must be somewhere within the root element

            
<computer>
   
   If computer is the top level, root element,
   then the other elements and the text must be here, 
   inside it.
   
</computer>
            
          

Attributes

You can put attributes in the start tag of an element (Not in the closing tag at the end of an element).

Put a space before the attribute name,
then give the name of the attribute,
then an equal sign,
then either an opening single quote or a double quote,
then the value,
then a closing quote, which must match the opening quote exactly.
The opening and closing quotes must both be single quotes, or must both be double quotes.

If you use an attribute, it must always have a value, and the value must always be in quotes
(If the value is just an empty string, the value must still be in quotes).

The name of the attribute must be unique within the tag where it appears.

Attributes can be coded in any order. The order of the A, B, and C attributes does not matter in the example.
The order of elements may sometimes be important, but the order of attributes is not.

            
<name type="corporate">ABC Manufacturing Company</name>
<name type="person">Charles Proteus Steinmetz</name>
<name type="">Whatch callit</name>

<sample A="hello" B="goodbye" C="hello again">
<sample C="hello again" A="hello" B="goodbye">
            
          

Names

Names of elements and attributes must meet the following rules:

  • case sensitive: Joe JOE joe are different names
  • start with a letter or underscore _
  • letters, numbers, underscores, hyphens and periods are allowed
  • may not contain spaces
  • colon : is only allowed when using namespaces (We will study namespaces)
  • my not start with xml in upper or lower case letters

Examples:

            
fred
_MARY-1st.Scotland
            
          

Whitespace

In HTML and XHTML, the extra whitespace is stripped out of the text.

<p>
       This paragraph         has
             extra            space.
</p>
          

will display in HTML as:
This paragraph has extra space.

In XML PCDATA, the whitespace is NOT stripped out. When you look at this example displayed in a browser, you will only see a single space, as is done in HTML. The XML still has the extra spaces, but the browsers do not show it.
Attribute values DO have the new line characters stripped out, and replaced with a space.
You can -view -source to see the xml.

Empty element

Sometimes you have an empty element.
Make sure you put nothing in it, not even a space.
An empty element can be abbreviated by using one self closing tag. Sometimes this are called a singleton tag.
A self closing tag is a single tag, with a slash before the > at the end.
These two examples are entirely equivalent in XML.

<processed></processed>
<processed/>

Entity and Character references

Entity references and character references are used to insert a single character into your document.
Entity and character references must begin with an ampersand (&) and end with a semicolon(;).

The following are the only entity references in XML.

Entity reference Resulting character
&amp; &
&lt; <
&gt; >
&apos; '
&quot; "

You are required to use   &amp; for &   and   &lt; for <   everywhere they appear in your document. Otherwise the parser would think they were the start of a tag or of a reference. The other entity references are optional.

You are also permitted to use character references for any Unicode character, by using the decimal or hexadecimal number for the character. This example uses #169 or #xA9 to produce the character ©

  Character reference Resulting character
Decimal &#169; ©
Hexadecimal &#xA9; ©

Parsed and unparsed data

The text in your document is called parsed character data or PCDATA. This XML text is parsed. That means the XML parser looks for < and & characters; if it finds one, it assumes it is a tag or reference, and processes it.

If you have a lot of < and & characters, you can just tell the parser to not look at the data. Then the text will be unparsed CDATA (no P because it is not parsed).

To make text CDATA, so it will not be parsed,
<![CDATA[   is put first
Your character data is put next
]]>   is put at the end.

The character sequence   ]]>   is not allowed anywhere in an xml document, except at the end of CDATA.

Look at the source code of the CDATA example. You will see that the mathematical expression uses   <   rather than   &lt;

Comments

XML uses SGML comments, just like HTML.

Comments start with <!-- and end with -->
There are two limitations on comments:

  • Comments may not be inside tags
  • Comments may not contain --

<!-- This comment is for humans to read, not for computers. -->

XML declaration

The XML declaration tells that this is an XML document.

  • The XML declaration is not required, but I suggest you use it.
  • The XML declaration MUST be on the FIRST line, with no blank line before it.
  • The XML declaration MUST be on the FIRST space in the line, with no blank space before it.
  • <?xml starts the declaration
  • ?> ends the declaration
  • The attributes must be in order:
    1. version (required)
    2. encoding (optional)
    3. standalone (optional)
    It seems strange that the attributes must be in order in the XML declaration, but can be in any order in other XML statements. Perhaps, before the document has been declared to be XML, the full XML parser is not yet active.
  • version must be 1.0 or 1.1   I suggest 1.0
  • encoding has several possible values. I suggest UTF-8
  • standalone is not usually used. It is yes if there is no Document Type definition (DTD), and no if there is one. I suggest you omit it until we study Document Type Definitions.

<?xml version="1.0" encoding="UTF-8"?>

Processing Instructions

Processing instructions are not very important. They look like the XML declaration, except they may appear anywhere in the document, and instead of specifying xml processing, some other name is used to identify your processor, which will be processing the information. The contents can be anything your process needs. The process name must meet the requirements for XML names. Processing instructions are not used very much.

<?funny-process clown joke?>

Lecture notes

Lecture notes are the material used by the instructor in class.

These on-line web pages give most of the information about what you need to do and how to do it.
The book gives the most complete and accurate set of information.
You may wish to look at the week 1 lecture notes, to understand better why we use XML, and what XML is good for.

Reference page

The link is to a summary of the rules for writing well formed XML. You may print this summary for use in this class, if you wish.