Introduction to XML

What is XML?

  • XML stands for eXtensible Mark-up Language.
  • XML is a mark-up language (like HTML).
  • XML was designed to describe data.
  • XML tags are not predefined. You must define your own tags or use an existing specification.
  • XML uses a Document Type Definition (DTD) or an XML Schema to describe it contents structure
  • XML was designed to carry data.
  • XML was designed to describe data and to focus on what data is.

XML does not DO anything

Maybe it is a little hard to understand, but XML does not DO anything. XML was created to structure, store and to send information.

The following example is an address book entry for John Doe, stored as XML:

<person>
	<firstname>John</firstname>
	<surname>Doe</surname>
	<age>24</age>
	<address type=work>
		<streetnumber>10</streetnumber>
		<street>The Street</street>
		<town>London</town>
	</address>
</person>

The address book entry has all the information contained in a defined manner that is relevant for John Doe. But still, this XML document does not DO anything. It is just pure information wrapped in XML tags. Someone must write a piece of software to send, receive, understand or display it.

XML is free and extensible

Opting for XML is a bit like choosing SQL for databases:you still have to build your own database and your own programs and procedures that manipulate it, but there are many tools available and many people who can help. And since XML is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor.

XML tags are not predefined. You must "invent" your own tags.

The tags used to mark up HTML documents and the structure of a HTML documents are predefined. The author of HTML documents can only use tags that are defined in the HTML standard (like <p>, <h1>, etc.).

XML allows the author to define there own tags and there own document structure.

The tags in the example above (like <firstname> and <street>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.

XML is not a replacement for HTML.

It is important to understand that XML is not a replacement for HTML. In future Web development it is most likely that XML will be used to describe the data, while HTML will be used to format and display the same data.

XML can be thought of as a cross-platform, software and hardware independent tool for transmitting information.

How can XML be used?

It is important to understand that XML was designed to store, carry, and exchange data. XML was not designed to display data.

XML can Separate Data from display

When HTML is used to display data, the data is stored inside your HTML. With XML, data can be stored in separate XML files. This way you can concentrate on using a display dedicated language for display, and be sure that changes in the underlying data will not require any changes to your data structer.

For example you could use XML data inside HTML, using HTML only for formatting and displaying the data and XML as the transport layer.

XML is used to Exchange Data

With XML, data can be exchanged between incompatible systems.

In the real world, computer systems and databases contain data in incompatible formats. One of the most time-consuming challenges for developers has been to exchange data between such systems over the Internet.

Converting the data to XML can greatly reduce this complexity and create data that can be read by many different types of applications and computer systems.

XML can be used to Share Data

With XML, plain text files can be used to share data.

Since XML data is stored in plain text format, XML provides a software- and hardware-independent way of sharing data.

This makes it much easier to create data that different applications can work with. It also makes it easier to expand or upgrade a system to new operating systems, servers and applications.

XML can be used to Store Data

With XML, plain text files can be used to store data. XML can also be used to store data in files or in databases. Applications can be written to store and retrieve information from the store, and generic applications can be used to display the data.

XML can make your Data more Useful

With XML, your data is available to more users.

Since XML is independent of hardware, software and application, you can make your data available to third parties with little intervention from the author of the XML data (if a standard XML DTD/Schema is used then data exchange intervention from the generator is not an issue). Third party applications and users can access your XML files as data sources, like they are accessing databases. Your data can be made available to all kinds of "reading machines" (agents), and it is easier to make your data available for the blind, or people with other disabilities.

If Developers have Sense

The future might give us word processors, spreadsheet applications and databases that can read each other's data in a pure text format, without any conversion utilities in between. Microsoft has started down this road with the release of its office sweat XML format, this is not to say they are leading the field though as both Star office (a cross platform free Microsoft compatible office sweet) and Apple OS X both heavily depend on XML as there preferred data format.

There is nothing special about XML

There is nothing special about XML. It is just plain text with the addition of some XML tags enclosed in angle brackets.

Software that can handle plain text can also handle XML. In a simple text editor, the XML tags will be visible and will not be handled specially.

In an XML-aware application however, the XML tags can be handled specially. The tags may or may not be visible, or have a functional meaning, depending on the nature of the application.

XML Syntax

The syntax rules of XML are very simple and very strict. The rules are very easy to learn, and very easy to use.

Because of this, creating software that can read and manipulate XML is very easy to do.

An example XML document

XML documents use a self-describing and simple syntax.

<?xml version="1.0" encoding="utf-8"?>
<person>
	<firstname>John</firstname>
	<surname>Doe</surname>
	<age>24</age>
	<address type=work>
		<streetnumber>10</streetnumber>
		<street>The Street</street>
		<town>London</town>
	</address>
</person>

The first line in the document - the XML declaration - defines the XML version and the character encoding used in the document. In this case the document conforms to the 1.0 specification of XML and uses the utf-8 character set (a sub-set of the Unicode standard).

The next line describes the root element of the document:

<person>

The next 8 lines describe 4 child elements of the root (firstname, surname, age, and address):

<firstname>John</firstname>
<surname>Doe</surname>
<age>24</age>
<address type=work>
	<streetnumber>10</streetnumber>
	<street>The Street</street>
	<town>London</town>
</address>

In the example the address tag also has three child elements (streetnumber, street, town) this type of data structure is called a hieratical or tree structure.

And finally the XML file ends with the closing of the root element:

</person>

All XML elements must have a closing tag

With XML, it is illegal to omit the closing tag. In HTML some elements do not have to have a closing tag. The following code is legal in HTML:

<p>This is a paragraph
<p>This is another paragraph

In XML all elements must have a closing tag, in order for them to be legal XML:

<p>This is a paragraph</p>
<p>This is another paragraph</p>

NOTE You might have noticed from the previous example that the XML declaration did not have a closing tag. This is not an error. The declaration is not a part of the XML document itself and is there as a hint to the interpreter of the XML as to what is to follow.

XML tags are case sensitive

Unlike HTML, XML tags are case sensitive. With XML, the tag <Letter> is different from the tag <letter>. Opening and closing tags must therefore be written with the same case:

<Message>This is incorrect</message>
<message>This is correct</message>

All XML elements must be properly nested

Improper nesting of tags makes no sense in XML an will make the document illegal. In HTML some elements can be improperly nested within each:

<b><i>This text is bold and italic</b></i>

In XML all elements must be properly nested within each other:

<b><i>This text is bold and italic</i></b>

All XML documents must have a root element

All XML documents must contain a single tag pair to define a root element. All other elements must be within this root element.

All elements can have sub elements (child elements). Sub elements must be correctly nested within their parent element:

<root>
	<child>
		<subchild>.....</subchild>
	</child>
</root> 

Attribute values must always be quoted

With XML, it is illegal to omit quotation marks around attribute values. XML elements can have attributes in name/value pairs just like in HTML. In XML the attribute value must always be quoted. Study the two XML documents below. The first one is incorrect, the second is correct:

<?xml version="1.0" encoding="utf-8"?>
<person>
	<address type=work>
	</address>
</person>

<?xml version="1.0" encoding="utf-8"?>
<person>
	<address type=work>
	</address>
</person>

The error in the first document is that the type attribute in the address element is not quoted.

This is correct: type="work" This is incorrect: type=work

Comments in XML

The syntax for writing comments in XML is similar to that of HTML.

<!-- This is a comment -->

XML Elements

The main carrier of information in an XML document is an element. An element is a single unit of storage that has a role to play in the over all document structure. An element can contain data or other elements.

To understand XML terminology, you have to know how relationships between XML elements are named, and how element content is described.

Imagine that this is a table of contents:

My First anatomy Book
Introduction to the body.
	What does my blood do?
	What do my bones do?
Putting it all together
	How do I breath?
	How do I walk?

Imagine that this XML document describes the book:

<toc>
	<title>My first anatomy Book</title>
	<meta date=2004-10-01><meta>
	<section> Introduction to the body.
		<paragraph>What is HTML</paragraph>
		<paragraph>What is XML</paragraph>
	</section>
	<section>Putting it all together
		<paragraph>How do I breath?</paragraph>
		<paragraph>How do I walk?</paragraph>
	</section>
</toc>

In the example toc is the root element, title, meta and section are child elements of toc. toc is the parent element of title, meta and section. title, meta and section are siblings (or sister elements) because they have the same parent.

Elements have Content

Elements can have different content types. An XML element is everything from (including) the element's start tag to (including) the element's end tag. An element can have element content, mixed content, simple content, or empty content. An element can also have attributes. In the example above, toc has element content, because it contains other elements. section has mixed content because it contains both text and other elements. paragraph has simple content (or text content) because it contains only text. meta has empty content, because it carries no information.

Element Naming

  • XML elements must follow these naming rules:
  • Names can contain letters, numbers, and other characters
  • Names must not start with a number or punctuation character
  • Names must not start with the letters xml (or XML or Xml )
  • Names cannot contain spaces

Take care when you "invent" element names and follow these simple rules:

  • Any name can be used, no words are reserved, but the idea is to make names descriptive. Names with an underscore separator are nice.

Non-English letters like éäé are perfectly legal in the XML specification, but not all software vendor support them.

The colon (":") should not be used in element names because it is reserved to be used for something called namespaces (more later).

XML Attributes

XML elements can have attributes in the start tag, just like HTML. Attributes are used to provide additional information about elements.

In HTML you can create tags like this:

<IMG SRC="toc.gif">

The SRC attribute provides additional information about the IMG element.

In HTML (and in XML) attributes provide additional information about elements:

<img src="computer.gif">
<a href="demo.asp">

Attributes often provide information that is not a part of the data. In the example below, the file type is irrelevant to the data, but important to the software that wants to manipulate the element:

<file type="gif">computer.gif</file>

Quote Styles, "female" or 'female'?

Attribute values must always be enclosed in quotes, but either single or double quotes can be used. For a person's sex, the person tag can be written like this:

<person sex="female">

or like this:

<person sex='female'>

NOTE If the attribute value itself contains double quotes it is necessary to use single quotes, like in this example:
<person sex='male "apart from Tuseday when you can call me Tracey" '/>

NOTE If the attribute value itself contains single quotes it is necessary to use double quotes, like in this example:
<person sex="male 'apart from Tuseday when you can call me Tracey'" />

Use of Elements vs. Attributes

Data can be stored in child elements or in attributes. Take a look at these examples:

<person sex="female">
	<firstname>Anna</firstname>
	<surname>Smith</surname>
</person>
<person>
	<sex>female</sex>
	<firstname>Anna</firstname>
	<surname>Smith</surname>
</person>

In the first example sex is an attribute. In the last, sex is a child element. Both examples provide the same information.

There are no rules about when to use attributes, and when to use child elements.

Summary

XML is not so much a language as a standardized set of rules for adding structure to any form of data using a system of markup tags. Anyone can create their own markup vocabulary (called an XML Schema), and XML ensures that the structure will be intelligible to anyone else who reads the XML Schema document. More importantly, referring to an XML Schema enables XML-aware software to automatically manipulate the data without needing advance knowledge of the structure.