(ready to attack) rex © v2.0 lud July 23, 1998

A real goodie on HTML

author: Arjun Ray
mail: aray@interactrx.com


Veronica wrote:
|
| Toby Speight wrote:
| > You're falling into the
| embedded-formatting-commands model of HTML,
| > which causes all sorts of confusion.
|
| (The what? please explain!)
			

The idea that a tag constitutes a command or directive, or in general by itself conveys semantic information. It doesn't.

A tag is just a marker, like a parenthesis: its role is syntactic only, to group with its matching parenthesis the contents in between, where the "block" as a whole is given a name for semantic purposes.

Such a name (called the generic identifier) is included in the start-tag because it's syntactically convenient to do so, but this name is not (a property) of the *tag*. It's of the *element* for which the tag serves as a delimiter. In other words, all semantics apply to elements: the tags merely locate where these elements are.

Similarly, the scope of the element determines the scope of the semantics to be applied, so proper semantic processing depends critically on such scopes being identifiable.

The syntax to scope an element has three parts: a start, content, and an end. The content in turn consists of other elements (also with generic identifiers as "hooks" to semantics) and possibly text. The elements form a containment hierarchy that is actually a tree, as a data structure, with all text data at the leaf nodes. In fact, a document with SGML markup is no more than a linearized representation of such a tree, where all text is embedded in markup.

In terms of semantic information content, these representations are exactly equivalent:

   1.
     <HTML>
       <HEAD>
         <TITLE>Example</TITLE></HEAD>
       <BODY>
         <H1>Hello World!</H1></BODY></HTML>

   2.
     <HTML>
       <HEAD>
         <TITLE>Example</></>
       <BODY>
         <H1>Hello World!</></></>

   3.
     ((HTML)
       ((HEAD)
         ((TITLE) '(Example)))
       ((BODY)
         ((H1) '(Hello World!))))
			

#1 is a "normalized" form with all omissible tags included.

#2 happens to be valid HTML!

#3 makes the *tree* more evident at the expense of inconvenient syntax.

But #1 is best understood in terms of #3.

See http://www.oasis-open.org/cover/general.html

Contrast this paradigm ("all text embedded in markup") with another where all markup is embedded in text. Here, each markup construct is intended to function by itself, as an independent unit of semantic information. The data structure underlying this is a linked list with two types of nodes (text and markup) in arbitrary order. A linearized representation is basically trivial; semantic processing is similarly well suited to "stream mode", do-something-one-tag-at-a-time, and no-tag-no-action. It's possible to treat the linear representation of a tree in tag-at-a-time mode (if only to reconstruct the tree!), but neither do lists correspond to trees in general nor does the mere fact of a linear representation per se cancel or invalidate the tree and thus mandate a tag-at-a-time approach.

The surface syntax of HTML doesn't clarify which paradigm should apply. Either one could. For instance, it could be argued that all tags are between '<' and '>', and end-tags are distinguished by the presence of a "cancellation operator", '/', so that </FOO> parses as...

   </FOO>  ::   <  +  /FOO  +  >  or  { '<' { '/' 'FOO' } '>' }
			

neither of which reflects the correct parse:

   </FOO>  ::   </  +  FOO  +  >
			

(There's plenty of rubbish like this on the Web: it seems to be a "theory" many people are prone to fall into. More on this below.)

The fact is that the only existing formal specification for HTML identifies the correct paradigm: a HTML document is a tree of elements with text, not a list of tags and text. But there's another giveaway, which has to do with characteristic usage in the two paradigms.

With a tree of elements, the structural relation of containment -- an essentially descriptive function -- is built into the syntax, and the semantics come in as a "late binding" of element names to procedures.

The names are just identifiers, and so by and large they tend to be nouns. With a list of tags, there's no further structure beyond the sequencing, and so the names -- in terms of what they're supposed to convey -- are typically verbs. This is a characteristic difference...

...between descriptive markup such as...

   <List>
     <Item>Item 1</Item>
     <Item>Item 2</Item>
   </List>
			

...and procedural markup such as

   <IndentIn>
     <Bullet>Item 1
     <Bullet>Item 2
   <IndentOut>
			

Unfortunately, HTML doesn't help much here. Many of its elements have utterly impenetrable names, such as UL, OL, LI, TR, TD, DL, DT, DD, etc. etc. For all anyone might care to know, "UL" could easily be "Indent" in Sanskrit, and "LI" could be "Bullet" in Swahili. But, in general, reading the specs should make clear that HTML is largely about nouns, not verbs. But as most people don't bother with specs...

Ahem. Sanskrit? Swahili? Why not "Mosaic"?

It so happens that the current crop of browsers are procedural markup processors: their MO is basically one-tag-at-a-time, with contortions to handle problems such as two-pass parsing of tables (so it's no surprise that they make heavy weather of getting such things right.)

Faced with figuring out what "UL" means, someone might just look at what Mosaic did with it, and conclude that the <UL> was a "command" that Mosaic seemed to obey dutifully -- and "consistently" in the sense that the same observable result was independent of context. In fact, it was probably extremely fortuitous that HTML elements were named so obscurely, because there's nothing obviously "wrong" in believing that...

   <UL>
      Indented stuff
   </UL>
			

is actually just "computerese" for...

   <IndentIn>
      Indented stuff
   <IndentOut>
			

whereas something like...

   <List>
      Indented stuff
   </List>
			

might just give pause for thought about the plain meanings of words.

When Toby mentioned the "embedded-formatting-commands model", he was referring to the tendency to think of tags as (angle brackets around) verbs. It's a theory; obscure names help; it seems superficially valid since the Mosaic spawn in fact have precisely that implementation strategy; but the specs say otherwise.

[Aside: IMHO, the prospects for good CSS implementations in the Mosaic spawn are poor. Stylesheets also constitute a kind of late binding of properties to elements, and the inheritance model relies on that in critical ways. But the Mosaic spawn prefer to deal with tags one at a time in isolation, which is why their tag-salad parsers have to be helped along with explicit endtags quite often. No surprise there.]

Veronica wrote:
| As I asked in another post, what about:
|
| ..bla <i> text1 <b> text2 </i>
| text3 </b> bla..
|
| How does that survive your cute little "nesting" theory?

Well, it isn't Toby's nesting theory. It's the paradigm of hierarchic structure in all SGML applications. The trouble is in confusing "it works for me" with "it's what I meant", which really was...

..bla <i-on> text1 <b-on> text2 <i-off>
text3 <b-off> bla..

Right?

:ar


|W3C-HTML4/chk| |CSS in use| html mark-up and style design by rex
rex.butler@mbox300.swipnet.se
d.tek.jre@ebox.tninet.se