CIT041X Index > Lecture Notes - Introduction

Lecture 1 Notes

Markup

Typewritten data with red-pen markup Markup comes from the bad old days before word processors. If you needed a brochure, you'd type it on a typewriter, and then literally mark it up with a red pen to tell the typesetter what you wanted it to look like. The typesetter would follow your instructions and return a finished document to you:

How to Buy a Wrench

There are two kinds of wrenches: wrenches with fixed size, and adjustable wrenches.

In this instance, we're using markup not only to show how text should be presented (italic versus normal text), but also to tell how the document is structured: some of the words form a heading, the other words are just ordinary text.

The idea of using markup to impose structure on otherwise anonymous data is such a good one that people came up with a standardized way to create markups for general use. This method was called the Standard Generalized Markup Language, or SGML. SGML really isn't a language in and of itself, it is more of a “rulebook” that tells you how to develop these markup languages. Any markup that follows the SGML rulebook is called an application of SGML.

The most widely known application of SGML is a language used to mark up text for delivery and presentation on the World Wide Web. That language is HTML, the HyperText Markup Language. In HTML, we can mark up the example above to send to a web browser instead of a typesetter:

<h3>How to Buy a Wrench</h3>
<p>
There are two kinds of wrenches: wrenches with fixed size, and
<i>adjustable</i> wrenches.
</p>

There are many other applications of SGML, but they're mostly found in large corporations and government agencies. That's because the SGML rulebook is very complex, which makes it hard to learn. For example, SGML allows optional opening and closing tags. Quick: is </li> required or not? How about <body>? Additionally, it's difficult (and expensive!) to develop tools that can manage data that's marked up according to those rules.

HTML Doesn't Do It All

While HTML is a good thing, it doesn't solve all our problems. Consider the following two tables. While the data is structured into rows and cells, there's nothing to tell you (other than your intuition) that the first table gives maximum and minimum temperatures, while the second table gives current and maximum capacities for water reservoirs.

<table border="1">
<tr>
  <td>Chicago</td><td>13</td><td>6</td>
</tr>
<tr>
  <td>Dallas</td><td>60</td><td>20</td>
</tr>
</table>
<table border="1">
<tr>
  <td>Calero</td><td>5538</td><td>10050</td>
</tr>
<tr>
  <td>Uvas</td><td>6095</td><td>9935</td>
</tr>
</table>

XML Solves the Problems

To solve the complexity issue, XML was designed as a subset of SGML. It eliminates the features that make SGML difficult to learn and parse while retaining 90% of the power of SGML. Tools that analyze and display XML are easier to write, and are widespread and inexpensive. Since XML is a subset of SGML, it lets you devise any set of tags you wish, thus solving the problem of differentiating what would be otherwise be anonymous numbers:

<temperatures>
<city name="Chicago">
    <max>13</max><min>6</min>
</city>
<city name="Dallas">
    <max>60</max><min>20</min>
</city>
</temperatures>
<water-banks>
<reservoir name="Calero">
   <current>5538</current><capacity>10050</capacity>
</reservoir>
<reservoir name="Dallas">
   <current>6095</current><capacity>9935</capacity>
</reservoir>
</water-banks>

An XML Document and its Terminology

Consider the following example:

<p>Here is some <b>important</b> and
<i>useful</i> information.</p>

The <p> element is the parent of five children:

  1. The text Here is some
  2. The <b> element
  3. The text and
  4. The <i> element
  5. The text information.

Each of these children is the sibling of the other children. Note that the <b> and <i> elements also have children.

Use Lowercase

HTML didn't care whether you wrote your element names or attribute names in uppercase or lowercase. XHTML is case-sensitive; all element and attribute names must be lowercase.

HTML
<OL Type="A">
<li>item one</li>
<li>item two</LI>
</oL>
XHTML
<ol type="A">
<li>item one</li>
<li>item two</li>
</ol>

Notice that the attribute value can be uppercase. Some people use uppercase element names because they stand out better from the surrounding text; but it turns out that all lowercase is easier to read. Hey, it's an imperfect world.

Nest Elements Properly

If you have nested elements (one element inside another), you must end the inner element before the outer one. Older browsers do their best to display improperly nested HTML; XML tools will reject any XHTML document that has a nesting error.

Incorrect
<b>Outer and <i>inner</b> elements</i>
Correct
<b>Outer and <i>inner</i> elements</b>

Rules for Attributes

  1. Attribute values must be enclosed in quote marks. You can use either double or single quotes to enclose the value of an attribute, but they must be there.
  2. Attribute names must be unique.
  3. Attributes must be separated by whitespace (spaces, tabs, or new lines)
Incorrect
<a href=page2.html>
<a href="page2.html" name="b"
   href="abc.html">
<a href="page2.html"name="b">
Correct
<a href="page2.html">
<a href="page2.html" name='b'>

Finally, all attributes must have both a name and a value. For those attributes in HTML that didn't require values, you must duplicate the attribute name as the value. Here are some examples:

HTML
<dl compact>
<option selected>
<td nowrap>
XHTML
<dl compact="compact">
<option selected="selected">
<td nowrap="nowrap">

All Opening Tags Must Have Closing Tags

This is a big one.

Incorrect
<p>
Paragraph one
<p>
Paragraph two
Correct
<p>
Paragraph one
</p>
<p>
Paragraph two
</p>

What, then, are we to do with elements like <br> and <img>, which don't have any content, and thus don't need any closing tags in HTML? We can do one of two things: we can put in a closing tag, or we can use a “shorthand form” by placing a / before the > of the element, as in the following examples.

<br></br>
<br />
<img src="wsp.png" alt="WaSP logo"></img>
<img src="wsp.png" alt="WaSP logo" />

You'll note that we've put a blank before the slash; this keeps older browsers from freaking out when they encounter one of these shorthand elements. You should use the shorthand form only for empty elements (elements that don't have content). If you want a paragraph with no text in it, use the opening-and-closing-tag form. This reminds people who read your source that the <p> element is still a container element—one that can have content, but doesn't happen to at this moment. This, too, will prevent older browsers from freaking out.

Not Recommended
<p />
Recommended
<p></p>

No Double Dashes in Comments

You cannot put two hyphens in a row inside an HTML comment, but you may use equal signs, underscores, or dashes with spaces between them.

Incorrect
<!-- Comments ------- like this -->
Correct
<!-- Comments - - - - like this -->
<!-- Comments ======= like this -->
<!-- Comments _______ like this -->

You Must Encode < and & Symbols

You can't put a < or & directly into the text of your XHTML. You must instead use &lt; and &amp;. And yes, the semicolon at the end of each of these entities is required! You don't have to encode a greater than sign as &gt;, as it never causes any ambiguity. However, we recommend that you do so; this will keep your markup looking symmetrical.

Incorrect
<p>
He & I graphed the
inequality x + 3 < y
</p>
Correct
<p>
He &amp; I graphed the
inequality x + 3 &lt; y
</p>

By the way, all XML processors also accept &quot; as a synonym for double quotes. This lets you do things like:

<img src="hello.jpg" alt="Mrs. O'Hara says &quot;Hi&quot; to us!" />

Additional reading

XML in 10 Points