Writing HTML


  1. Introduction
  2. Current tools force you to learn HTML
  3. HTML Basics
    1. The standard structure of a HTML document
    2. Header elements
    3. Normal text
    4. Lists
    5. Inline images
    6. Horizontal rules
    7. Hypermedia links
    8. All about URL's
    9. Special characters
    10. Preformatted and other special paragraph types
  4. Forms
  5. Tips
  6. Good luck - and have fun!

Introduction

HTML underlies all the WWW’s pretty surface

Strange as it may seem, underneath the slick point-and-click user interface of Mosaic lies an ASCII “markup language” that can be edited on character mode terminals. When you think about it, though, it makes a lot of sense.

The WWW runs on a large and extremely heterogeneous mass of computers. There’s no common operating system. There’s no common word size. There’s no consensus on whether the least significant bit of a number come before or after the most significant bit. Sometimes, WWW pages even get passed around via 7-bit links.

One implication of all this heterogeneity is that 7-bit ASCII is about the only data representation that can be understood by every machine in the Web. Two Windows machines running identical software could pass their internal binary data representation of Web pages back and forth over a network, but they’re going to have to do some translation before they can talk to a machine with different word sizes or byte ordering - or perhaps even just a different program version. While they could have specialized code to do translations to and from the internal formats of every machine or program that they might need to talk to, it’s easier and more reliable to create a sort of programming language that allows them to describe the Web page.

A Web editor then can use this language to describe a document, without having to know anything about the internal data structures of the Web viewers that will display it. A Web server can store documents from a variety of machines without having to know anything about their architecture - nor need it know anything about the machine that requests a document. A Web viewer can display documents that were generated on a Windows machine just as easily as it can display documents that were generated on a Mac or an Unix workstation.

As you probably already know, the particular language used by the WWW is called HTML, which stands for hypertext markup language. The hypertext part means that a Web page can contain references to other Web pages or to various net resources like gophers and ftp sites. The markup part comes from the days when book and magazine editors made special marks on their authors’ manuscripts to tell the typesetters how to format the text. This process was called markup, and the term was adopted when people started inserting formatting instructions into their computer files.

HTML is a member of the SGML family of markup languages

Over the past few decades, there have been a lot of different markup languages. Each has had an idiosyncratic solution to the common problem of differentiating the text to be formatted from descriptions of how to format it. Some have required all formatting commands to be at the start of a line and to start with a special character like . or ;, while others have enclosed formatting statements in slashes like /this/. SGML, or Standard Generalized Markup Language, is partly an attempt to stem the markup chaos by defining a common format that all markup languages can follow.

However, the important difference between SGML and older types of markup languages is not what the markup looks like but what it says. Originally, markup tags said things like ‘make this bold’ or ‘center this’ or ‘use this font’: they were concerned with how the text should look. If you wanted to change the way chapter headings looked, you had to go through and change each example. In SGML, on the other hand, the markup would say ‘this is a chapter heading’. If you change the appearance of chapter headings, the appearance of all the chapter headings changes, without any possibility of error.

More generally, SGML markup contains meta-information, or information about the text. This information doesn’t have to be presentation commands. As in HTML and the WWW, it can be hypertext links. It can say things like ‘what follows is an author’s name’ or ‘here is a quote from this source’. It can contain revision history or it can say ‘here is something only for those who want all the details’.

Obviously, in the end this information does affect the presentation. The important point is that how the meta- information affects the presentation is up to the presenting software - and the user. A division has been made between content and presentation, and presentation has been made to depend on content.

This division between content and presentation should be quite familiar to anyone who’s used a word processor (like Microsoft Word) with style sheets. Of course, where a style sheet or a document file format is often specific to a particular program running under a particular operating system, SGML is a way to encode meta-information in ASCII so that it can be used by different programs on different systems.

However, SGML is not a markup language per se; rather, SGML is a format that markup languages can follow. In programming terms, SGML is an object class, while particular “SGML compliant” markup languages, like HTML, are instances of the class. In practice, this means very little more than that HTML tags look like SGML tags, and that a HTML document has both a “head”, which contains information about the document as a whole, like its title and its author, and a “body”, which contains the actual text.

You certainly don’t need to know all this about the nature of SGML to write Web pages, but as you wander the Web, you’re bound to run into references to SGML from time to time. Now you know what it is.

HTML is easy to write; hard to read

SGML commands are enclosed in angle brackets, like <this>. Most commands come in pairs that mark the beginning and end of a part of the text. The end command is a repetition of the start command, except that there is a forward slash between the opening bracket and the command name. For example, the title of a HTML document called "Habanero-Mango Chutney" would look like <title>Habanero-Mango Chutney</title>. Similarly, a word or phrase that Mosaic shows in bold type, would look like <b>bold</b> type in HTML.

As you might imagine, it’s not too hard to markup your text, but if you’re like most people, all the tags in brackets get in the way of the text. Proofreading a heavily marked-up text is very difficult. Remember, though, that the markup is meant to be read by a computer, not by you - and that there really isn’t any alternative to mixing formatting instructions in with the text. The only sort of format description that can be easily passed between different types of computers is an ASCII description.

It’s not “archaic”, it’s “retro”

If you’re much under 30, you may never have had to use a markup language before. You may accept the necessity of ASCII markup and still find all the tags messy and awkward; a painful reminder of the days before personal computers.

And of course they are awkward and painful, but to those of us to whom terminal-based text editors and early markup languages represented freedom from both typos and retyping second and third drafts, there’s something nostalgic about markup languages. Don’t get me wrong - I love how easy my GUI word processor makes complex layouts, and would never go back to runoff or anything like it - but I still get a bit of a kick out of how the ’90’s reality of heterogeneous networks has made ’70’s word processing technology new again.

Current tools force you to learn HTML

However, as I’m writing this in the summer of 1994, it really doesn’t much matter whether you like, tolerate, or despise HTML: If you want to be a Web-spinner, you have to learn HTML. No one has yet written a true ‘Web processor’, a WYSIWYG word processor that ‘just happens’ to read and write HTML files. The current exponential growth of the Web is naturally spawning a lot of Web writing tools, so I’m sure we’ll see good tools someday, but for now we have to use word processors, programming editors, or ill-conceived "HTML editors" that display the tags, not the effects.

HTML editors -- not ‘WWW editors’

Many people do prefer a HTML editor over a word processor like Word, an unadorned text editor like Window’s Notepad, or a programmer’s editor like Unix’s vi or those in a Borland language’s IDE. For one thing, most of the HTML editors will do word wrap within a paragraph; while almost all word processors can wrap even "non-document" files, Notepad and the programmer’s editors can not. Once you’ve gotten used to an editor wrapping your paragraphs for you, doing it by hand seems exceedingly annoying. Of course, Web browsers don’t care if your HTML source file is neatly word wrapped (they’ll fit the text to the viewer’s screen, and will ignore any line breaks in the source) but it’s sure easier for you, the author, to read your source text if words never start on one line and end on another.

More substantially, it is easier to start writing HTML with a HTML editor than with a basic text editor because the HTML editor typically offers some sort of menu of tags. This can be helpful, because the HTML tag set is hard to learn, having all the consistency and predictability of BASIC. Some tags, like <b> and <dl> are just initials. Others, like <pre> or <img>, are arbitrarily abbreviated. There are also some like <blockquote> and < address> which are full words or phrases. (And, of course, obviously a bulleted list is an "unordered list", or <ul>.)

The final advantage of a HTML editor is that when it inserts tags for you, it inserts both the beginning and the end tag, thus greatly reducing the chance that your whole document will end up in the <h1> (first level header) style, or that a bold word will become three bold paragraphs.

You see the markup, not the effect.

On the other hand, if you’re already quite comfortable with a word processor that will wrap raw text ("non-document") files or with some programming editor, you may well prefer to use that to write HTML than to learn to use an entirely new editor that offers only a little ‘mnemonic relief’ in creating your Web pages. Yes, a menu of tags and automatic insertion of closing tags puts less burden on your memory, but a non-WYSIWYG "HTML editor" that displays tags and that uses a single font for all text (whether plaintext or emphasized, 1st level headers or preformatted source code listings) just isn’t all that helpful. All those tags make your file hard to read - <b>this</b> doesn’t look much like this!

You still have to read HTML

This points to another problem with a non-WYSIWYG editor: While a tag menu may make it easier to insert an numbered list, it won’t help much when you come back to your document in a week or a month and can’t remember that the <ol> tag is a numbered list. That is, since the HTML editor shows the tags and not their effects, you still have to learn to read HTML to do any sort of serious Web spinning. And, since the HTML editor let you start writing without knowing all that much HTML, you might quickly find yourself in a position where you can’t read your own work!

There are a few WYSIWYG solutions

As you can gather, I don’t use any of the existing HTML editors. I think the very idea is wrong: We need word processors that read and write HTML without our having to know HTML, not specialized text editors that only make it a little easier to add HTML tags. While such "second generation" tools are starting to appear, they really aren’t all that good yet, and here and now you do still have to know HTML to create Web pages.

’Piggyback editors’

One approach to WYSIWYG Web editing is to somehow piggyback off an existing word processor. This has two advantages: Users don’t have to learn a whole new program, and programmers have to write a lot less code. The latter advantage undoubtedly accounts for the fact that the first WYSIWYG approaches to editing HTML were extensions of existing word processors.

Unfortunately, there are also two disadvantages. The first is that converting a file from your word processors format to HTML can be slow. Even if conversion is relatively fast, inserting an extra step between editing and previewing slows you down and makes you less productive. The second, more substantial, disadvantage is that HTML formats that don’t correspond directly to the word processor’s formats are either not supported or only supported in an awkward and round-about way. For example, inserting a hypertext link with a RTF to HTML converter requires you to use a special format for the ‘pointer’ and a different special format for the anchor.

RTF to HTML converters

One example of a WYSIWYG piggyback is a RTF to HTML translation program. This sort of program takes a RTF [or Rich Text Format] file generated by a word processing program and compiles it to an HTML file. I haven’t actually used one of these programs, so don’t know anything that I haven’t read on Usenet, but:

At least in principle, this is a "portable" solution. You should be able to use a RTF to HTML translator on just about any system that has a word processor that can generate RTF files, because the translator is a "batch" operation which requires little or no localization to compile for just about any platform. Or, you could take a RTF file from your own machine and convert it to HTML on the system that you will use to distribute your Web server. However, RTF has been said to stand for Redefine The Format, and portability may be more theoretical than actual: You may find that any particular translator will only work with one particular version of one particular word processor.

Another disadvantage is that using a translator turns a simple ’reality test’ into a three step operation: Save the document; translate it; load it into a Web viewer like Mosaic. (It’s even worse if you have to upload the file to another system before you can translate it.) While this may be less trouble than using a non-WYSIWYG editor that forces you to decode HTML tags in your head, it’s certainly more of an annoyance and productivity decreaser than a word processor add-in that can save documents directly to HTML.

cu_html

Lets you use W4W; maps Word styles to WWW styles
cu_html, which was developed at The Chinese University of Hong Kong, whence the "cu" in the name, is just such an add-in for Word for Windows. cu_html automatically translates character attributes like bold and italic to the appropriate HTML tags. It maps standard W4W paragraph styles like Normal and Heading 1 to the HTML equivalents, and adds W4W styles for HTML paragraph types like numbered and bulleted lists, preformatted text, horizontal rules, and addresses. It adds tools to the toolbar and Tools menu to insert images and links, and to save the W4W document as a HTML file. In my opinion, cu_html is the best currently available tool for writing HTML on a Windows system.

A little buggy and only supports a subset of HTML
This doesn’t mean I think it’s perfect, though! It has some bugs, particularly in the way it handles "block formats" like preformatted and address, and only supports some of HTML: no forms; no character attributes except bold, italic, and underline; no blockquotes; and no menu, directory or definition lists. (I’ll explain these terms later.) While you can do an awful lot without these missing constructs, the definition list is particularly useful. Between the bugs and the lack of definition lists, you might have to do some manual touchup of your html.

Can only write HTML; can’t read it
This will bring you face-to-face with cu_html’s second biggest problem: It can only translate from W4W to HTML, not the other way around. If you make any changes in the HTML - from adding a description list to changing a link’s path - you will lose them next time you change the document and tell cu_html to save it as a HTML file.

To some extent you can avoid this problem by only using cu_html for Web pages that don’t need any of the features that it doesn’t fully support, but it’s hard to avoid it entirely. If you develop on your pages on a Windows system but upload them to a Unix system for distribution, you’ll find yourself tempted to make trivial changes directly to the Unix files because it’s so much easier than changing the cu_html "source", compiling it to html, then uploading the new file to your distribution system and replacing the old file. Of course, once you do this, your W4W document no longer matches your Web document, and if you don’t make changes to both documents, you risk losing that "trivial change" the next time you make a more substantial change to the source.

Slow compiler
A final problem with cu_html is that it’s rather slow. It may not be too bad, if you have the latest Pentium system with a video accelerator card, but on older, more humble systems, watching cu_html make several passes through your document, changing one feature at a time, quickly goes from interesting to annoying.

HoT MetaL

HoT MetaL is a new (as I write this) nearly-WYSIWYG editor for Windows and X-Windows (Unix) systems. As you might expect from the cutesy way the authors embedded HTML in the name HoT MetaL, it suffers to some extent from the same conceptual failings as "HTML editors": It’s too much a HTML tool and not enough a writing tool. Thus, its default mode shows HTML tags in silly little hexagons , and you insert HTML "entities" instead of "foreign characters" or “special symbols”. Still, it does have a WYSIWYG mode that turns off the tags and shows your document somewhat the way Mosaic will.

Only sort of WYSIWYG

HoT MetaL doesn’t do what you might expect of a WYSIWYG Web editor, which is to find your Mosaic configuration file and use the font choices stored there. Instead, it chooses some fonts of its own, seemingly at random. Of course, you can choose your own fonts in HoT MetaL and perhaps its bizarre defaults aren’t wholly a bad thing, since you have no control over what fonts your readers will have selected, and seeing your document with strange fonts may help guard against any unconscious assumption that your readers will see what you do, but you really shouldn’t have to choose fonts in both Mosaic and your editor. Perhaps by the time you read this, there will be a new version of HoT MetaL which fixes this problem.

Perhaps, too, a new version will show images and hyperlinks more like the way Mosaic will actually show them. Instead of showing an image, or the standard inline image ‘place holder’ bitmap, HoT MetaL just shows the image file name in square brackets [src: like this]. Similarly, instead of showing a hypertext anchor in blue, it shows the link data in square brackets [href: like this] and doesn’t do anything special to indicate the anchor’s length. (What’s with the "src" and "href" you ask? That’s HTML showing through.)

Slow and picky

If these were the only problems, you might find HoT MetaL usable, but it’s also incredibly slow and perversely picky. On a 386-33, it’s so far from being able to keep up with my (relatively slow) typing as to be ludicrous. Of course, if your machine is an order of magnitude faster than this four year old clunker, then HoT MetaL might be merely annoyingly sluggish.

If so, you might find as I did that it can only read perhaps 10% of the HTML files that Mosaic can display with no problem at all. The authors claim that it conforms to the nascent HTML 2.0 standards, but it seems to me to reject perfectly fine files. In any event, as far as I’m concerned, a Web editor that can’t read anything that a Web browser can display is useless. It’s one thing to correct "bad" HTML, but it should be able to read it.

State-of-the-art?

Why do I spend so much time on a bad editor? Sad as it may seem, in some ways HoT MetaL is the state of the art in Web editors in the late summer of 1994. It’s the first even vaguely WYSIWYG "native" HTML editor. There will probably be better editors available by the time you read this, but given where they’re coming from, they may have a long way to go for some time yet.

Expect to do manual touchup

Given that the best available Web editors don’t support all of HTML, or are slow and picky, you should expect to have to do manual touchup - or even to write HTML by hand - for some time to come. And, as I’ve already mentioned, if you are distributing your pages on a different system than you’re writing them on, you may find yourself doing some manual touchup well into the 21st Century.

The bottom line here is you’ll have to learn to read and write HTML if you want to be a Web spinner. Future tools will greatly diminish this necessity, but I suspect that you’ll always have to have at least some familiarity with HTML if you want to publish anything more than the very most basic documents on the Web.

HTML Basics

All HTML files consist of a mixture of text to be displayed and HTML tags, which describe how to display the text. Normally, extra whitespace (spaces, tabs, and line breaks) is ignored, and text is displayed with a single space between each word, no matter how much whitespace separates them in the HTML source file. Text is always wrapped to fit within the reader’s browser’s window in the reader’s choice of fonts: line breaks in the HTML source are treated just like any other whitespace, and a paragraph break must be explicitly marked with the HTML <p> tag.

Tags are always set off from the surrounding text by angle brackets, or the less-than and greater-than signs. Most tags come in "begin" and "end" pairs: e.g., <i>e.g.</i>. That is, the end tag looks just like the begin tag, except that it has a slash between the opening bracket and the tag name. Just to keep things legible, I’ll speak of tag pairs, instead of always writing out the begin and end form. That is, I’ll say "an ADDRESS tag pair" instead of "an <address></address> pair".

There are a few tags which appear by themselves, not paired with an end tag. Since most tags do come in pairs, I’ll take particular care to point out the exceptions as they come up.

HTML is case insensitive: <HTML> means exactly the same thing as <html> or even <hTmL>. However, many Web servers are running on Unix systems, which are case sensitive. This will never affect HTML interpretation, but will affect your hyperlinks: My.gif is not the same file as my.gif or MY.GIF.

Some begin tags can take parameters, which come between the tag name and the closing bracket like this: <dl compact>. Some tags, like character attributes and most paragraph types, are never parameterized. Others, like description lists, have optional parameters that will alter their appearance, if your reader’s browser supports that option. Still others, like anchors and images, absolutely require certain parameters (like where the hyperlink should go to, or what image file to display) and can also take other, optional parameters.

The standard structure of a HTML document

All HTML documents have a certain standard structure. Currently, there is no great incentive to follow this structure: Mosaic will treat any file that ends in .htm as an HTML file, even if it contains no HTML tags; Mosaic will not interpret a file that does not end in .htm as HTML, even if it’s totally compliant and follows the rules to the last detail. However, Mosaic is neither a finished product nor the only possible Web browser: Future browsers may interpret any compliant file as an HTML file, regardless of its name, or may just "work better" with compliant documents than with non-compliant documents.

It’s even possible that future browsers will totally refuse to interpret non-compliant documents as HTML: While the majority of existing Web pages are non-compliant in one way or another, if the current exponential growth of the Web continues, all currently existing Web pages will be just a small fraction of all Web pages at some point in the relatively near future. That is, if everyone starts writing compliant documents, all that is lost in switching to a browser that can’t handle non-compliant documents will be a few of the oldest documents. While I personally suspect that market forces will be unkind to browsers that don’t handle non-compliant documents at least as well as current ones do, it’s still a good idea not to knowingly write non-compliant documents.

<html> </html>

All HTML documents should be contained within an HTML tag pair. That is, documents should start with <html> and end with </html>. While this level of compliance is optional so far, as current versions of Mosaic don’t particularly care if text appears outside of a HTML tag pair, this behavior is not guaranteed, and one should not place text outside of the HTML tag pair.

In particular, it is at least possible that there may someday be a reason to have two or more SGML content types in one document. For example, documents might contain the same text marked up in HTML and in "HTML2". If this ever happens, browsers probably will ignore any text that’s outside of a SGML content type tag pair.

<head> </head>

All HTML documents are divided into a header which contains the title and other information about the document, and a body which contains the actual document. That is, the HTML tag pair should contain a HEAD tag pair, followed by a BODY tag pair. See the Header elements section for information on the <title> tag and other tags that belong only in the document header.

While you should not place display text anywhere outside the body section, this too is currently optional, as Mosaic will format and display any text that’s not in a tag, whether its in the body of the document or not. Also, while you can get away with not using the HEAD tag pair, it’s strongly recommended: using it allows software like Web-crawling robots to look at only the information that belongs in the header, without having to retrieve and/or parse the whole document.

<body> </body>

The body of the document should contain the actual contents of the Web page. The tags that appear within the body do not separate the document into sections. Rather, they’re either special parts of the text, like images or forms elements, or they’re tags that say something about the text they enclose, like character attributes or paragraph styles.

Headings and paragraphs
In some ways, a HTML text is a series of paragraphs. Within a paragraph, the text will be wrapped to fit upon the reader’s screen. In most cases, any line breaks that may appear in the source file are totally ignored, although they do make the source easier to read. In the couple of places where line breaks do matter, they can be either a DOS-style CR-LF pair or a Unix-style bare LF.

Paragraphs are separated either by a explicit paragraph break command, <p>, or by paragraph style commands. The paragraph style determines both the font used for the paragraph and any special indenting. Paragraph styles include several levels of section headers, five types of lists, three different "block formats", and the normal, or default paragraph style. Any text outside of a explicit paragraph style command will be rendered in the normal style.

<address> </address>
The last part of the document body should be an ADDRESS tag pair, which contains information about the author and, often, the document’s copyright date and/or revision history. While the address block is not a required part of the document in the same way that the header or the body is, official style guides urge that all documents have one. In current practice, while "most" documents use the HTML, HEAD, and BODY tag pairs, almost all documents have address blocks - perhaps because the address block is actually visible.

Figure 1 - The standard structure of a HTML document

<html>
	<head>
        	<title>The document title</title>
        </head>
        <body>
        	Text and markup
                <address>
                	Author and version info
                </address>
        </body>
</html>

Header elements

<title> </title>

Every document should have a title. How browsers show the title varies from system to system and browser to browser. It may be the window title, or it may appear in a pane within the window, but the title should be short: 64 characters or less. The title should not contain anything but simple text: No markup or line breaks. Well-written browsers will probably ignore any extraneous characters, while partially debugged ones may act rather strangely if there is anything ‘funny’ in the title.

The title should appear in the head section, marked off with a TITLE tag pair. For example, <title>Lime-Jerked Chicken</title>. Mosaic actually has such an easy-going parser that the title can appear anywhere in the document, even after the </html>, but future browsers might not be quite so clever and accommodating.

Other <head> elements

There are a few HTML optional elements which may only appear in the document’s header. That is, they are not required, but if you do use them, they must be within the HEAD tag pair. The header elements that browsers pay attention to are the <base> and <isindex> tags. Both are empty, or solitary tags which do not have a closing </...> tag and thus do not enclose any text.

The base tag contains the current document’s URL, or Uniform Resource Locator; browsers can use it to find "local URL’s". See the section All About URL’s for more information about the base statement.

The isindex tag tells browsers that this document is an index document, which means that the server can support keyword searches based on the document’s URL. Searches are passed back to the Web server by concatenating a question mark and one or more keywords to the document URL and then requesting this extended URL. This is very similar to one of the ways that forms data is returned, and you can see the section on Form action and method attributes for somewhat more information.

There are other header elements, like <nextid> and <link>, that are included in HTML for the benefit of editing and cataloging software. They have no visible effect; browsers simply ignore them, and so does this book.

Normal text

Any pp within the body that’s not explicitly tagged is Normal Text

Most Web pages are composed mostly of unadorned, or plain, text. Any text that appears outside of a format command tag pair is displayed as plaintext. That is, any text outside of a character attribute tag pair or paragraph style is plaintext.

Word-wrapped by the client, depending on the user’s window and font sizes

Plaintext, like every other type of paragraph style except the preformatted style, is wrapped at display time, to fit in the reader’s window, using the reader’s choice of fonts. I know I’ve said this before, but it bears repeating, as what you see is not likely to be what the reader gets. A larger or smaller font or window size will result in a totally different number of words on each line, so don’t do anything like change the wording of a sentence to make the line breaks come at particularly appropriate places!

<br>

If line breaks are important, as in postal addresses or poetry, you can use the <br> command to insert a line break. Subsequent text will appear one line down, on the left margin.

For example, to keep

PC Techniques Bookstream
7721 East Gray Road, Suite 204
Scottsdale, Arizona 85260-6912
from coming out as

PC Techniques Bookstream 7721 East Gray Road, Suite 204 Scottsdale, Arizona 85260-6912

you would write
PC Techniques Bookstream<br>
7721 East Gray Road, Suite 204<br>
Scottsdale, Arizona 85260-6912<br>
The <br> command is one of the few empty HTML tags; it never has a closing </br>. While the empty header elements make assertions about the document as a whole, empty body elements can be thought of as special characters. That is, a <br> can be thought of as a special "end of line" character.

<p>

The <br> command causes a linebreak within a paragraph, but more commonly what we want to do is to separate one paragraph from another. We can do this by enclosing each paragraph in a P tag pair, starting the paragraph with <p> and ending it with </p>. The actual appearance of the paragraphs will depend on your reader’s Web browser, and perhaps on her preferences: paragraph breaks may be shown with an extra line or half line of spacing, a leading indent, or both.

The </p> tag is actually optional, and most documents simply put a <p> at the beginning or end of each paragraph, or by itself on a line between two paragraphs.

Logical and physical attributes

Character attribute commands let you emphasize words or phrases within a paragraph. HTML has two different types of character attributes: Physical and logical. Physical attributes include the familiar bold, italic, and underline, as well as a tty attribute for monospaced text, such as appears on a CRT or in a program’s source file.

Logical attributes are different. In keeping with the SGML concept of describing the content, not the formatting, logical attributes let you describe what sort of emphasis you want to put on a word or phrase, but leave the actual appearance up to the reader, or her browser. That is, where a <b>bold</b> word will always appear in bold type, an <em>emphasized</em> word may be italicized, underlined, bolded, or colored, as the reader prefers.

Web style guides suggests that you should use logical attributes whenever you can, but there’s a slight problem: some current browsers only support some physical attributes, and few or no logical attributes. Since Web browsers simply ignore any HTML tag that they don’t "understand", being "nice" and deferring to your readers’ sensibilities by using logical attributes runs the risk that they will not see any emphasis at all! The compromise I have adopted is to start using logical attributes as the current version of Windows Mosaic supports them; since X-Mosaic seems to always be a step or two ahead of Windows Mosaic, this means that I’m only using the logical attributes that most people can see.

List of physical attributes

Attribute
Tag
Sample
Effect
Bold <b> Some <b>bold</b> text. Some bold text.
Italic <i> Some <i>italicized</i> text. Some italicized text.
Underline <u> Some <u>underlined</u> text. Some underlined text.
TTY <tt> Some <tt>monospaced (tty)</tt> text. Some monospaced (tty) text.

Nesting of attributes is allowed, though the results will vary from browser to browser. Some can, for example, show bold italic text, while others will only show the innermost attribute. (That is, <b><i>bold italic</i></b> may show as bold italic.) If you do use nested attributes, be sure to place the end tags in reverse order of the start tags; that is, don’t ever write something like <b><i>bold italic</b></i>! While this will work on some Web browsers, it may cause problems with others.

As of the time I’m writing this, Windows Mosaic supports the bold and italic attributes, but not the underline or tty attributes.

List of logical attributes

Attribute
Tag
Use and/or interpretation
Typical rendering
Citation <cite> Titles of books and films Italic
Code <code> Source code fragments Monospaced
Definition <dfn> A word being defined Italic
Emphasis <em> Emphasize a word or phrase Italic
Keyboard <kbd> Something the user should type, word-for-word Bold monospaced
Sample <samp> Computer status messages Monospaced
Strong <strong> Strong emphasis Bold
Variable <var> A description of something the user should type, like <filename> Italic

An important point to bear in mind here is that even if current browsers arbitrarily decide that <em> will show as italic and <kbd> as Courier, future browsers will probably defer more to their user’s wishes. That is, you shouldn’t conclude that citations, definitions, and variables all look alike so you should just ignore them and use italic: Future browsers will probably let their users select fonts for each logical attribute, just as they now can for each paragraph style.

As of the time I’m writing this, Windows Mosaic supports only the <em> and <strong> attributes.

Headings

HTML includes six levels of section headers, <h1> through <h6>. While these are typically short phrases that fit on a line or two, the various headers are actually full-fledged paragraph types: they can even contain line and paragraph break commands. (As with most other paragraph types, any linebreaks or extra whitespace in the source file will be ignored.)

Since they are paragraph types, you mark a phrase as a header by putting it within a tag pair. You do not include the section’s text within the header tag pair: Header tags are not outlining commands.

Although many documents include the title (or an extended version of it) at the top of the document body as a first level heading, how you use headings is entirely up to you. There is no requirement that you use a <h1> before you use a <h2>, or that a <h4> follow only a <h3> or another <h4>. (There is, however, at least a chance that skipping header levels will confuse non-commercial format conversion software, such as might convert HTML to Postscript. It’s up to you to decide whether to worry about this possibility.)

Please remember, though, that you can assume nothing about a header’s actual appearance, not even that any header will be larger than plaintext.

Lists

It’s a bit of an understatement to say that HTML has extensive support for lists. In fact, HTML has five different list types, which certainly meshes with all my stereotypes about Switzerland, where the WWW began.

All five list elements can be thought of as a sort of paragraph type, as the list is enclosed within a list tag pair. The first four list types share a common syntax, and differ only in how they format their list elements; the "description" list is unique in that each list element has two parts, a tag and a description of the tag.

All five list types display some sort of element tag - whether a number, a bullet, or a few words - on the left margin, and then the actual list elements appear indented. List elements do not have to fit on a single line or consist of a single paragraph: they may contain <p> and <br> tags.

Lists can be nested, but the appearance of a nested list depends on the browser. For example, some browsers use different bullets for inner lists than for outer lists, and some browsers do not indent nested lists. However, Mosaic and Lynx, which are probably the most common graphical and text mode browsers, do indent nested lists: the tags of a nested list line up with the elements of the outer list, and the elements of the nested list are further indented. For example,

* This is the first element of a bulleted list.
  * This is the first element of a nested list
  * This is the second element of the nested list
* This is the third element of the main list.
In the four list types with ‘simple’ list elements, the list item tag, <li>, is used to mark the start of each list element. Like the line break tag, <br>, there is no corresponding </li> tag; unlike the <br> tag, the < li> tag always appears at the start of a list element, not at the end.

Thus, all simple lists look something like

<ListType>

<li>
There isn’t really any ListType list, however the OL, UL, DIR, and
MENU lists all follow this format.

<li>
Since whitespace is ignored, you can keep your source legible by
putting blank lines between your list elements. Sometimes, I like to put the &lt;li&gt; tags on their own lines, too.

<li>
(If I hadn’t used the ampersand quotes in the previous list element,
the "&lt;li&gt;" would have been interpreted as the start of a new
list element.)

</ListType>

Numbered

In HTML, numbered lists are referred to as ordered lists, and so the list type tag is <ol>. As with other lists, numbered lists can be nested, but some browsers (especially older versions of Mosaic) get confused by the close of a nested lists, and start numbering the subsequent elements of the outer list from 1.

Bulleted

If a numbered list is an ordered list, what could an unnumbered, bulleted list be but an unordered list? Isn’t it just so obvious? (Excuse me while I mutter darkly about refugees from the BASIC ‘Design’ Labs.)

All sarcasm aside, the tag for an unordered (bulleted) list is <ul>. While bulleted lists can be nested, you should keep in mind that the list nesting may not be visible: Some browsers indent nested lists; some don’t. Some use multiple bullet types; others don’t.

Directory and menu lists

The directory and menu lists are special types of unordered lists. The menu list, <menu>, is meant to be visually more compact than a standard unordered list: Menu list items should all fit on a single line. The directory list, < dir>, is supposed to be even more compact: All list items should be less than 20 characters long, so that the list can be displayed in three (or more) columns.

In practice, I’m not sure if I’ve ever seen these lists in use, and their implementation is still spotty: Current versions of Mosaic do not create multiple columns for a <dir> list) and while they let you choose a directory list font and a menu list font, they do not actually use these fonts.

Description

The description list, or <dl>, does not use the <li> tag the way the simple lists do. Each description list element has two parts, a tag and its description. Each tag begins with a <dt> tag, and each description with a <dd> tag. As with the <li> tag, these appear at the start of the list element, and are not paired with </dt> or </dd> tags.

The description list looks a lot like any other list, except that instead of a bullet or a number, the list tag consists of your text. Description lists are intended to be used for something like a glossary, where a short tag is followed by an indented definition, but the format is fairly flexible. For example, a long tag will wrap, just like any other paragraph, although it should not contain line or paragraph breaks. (Mosaic will indent any <dt> text after a line or paragraph, as if it were the <dd> text.) Further, you needn’t actually supply any tag text: <dt>< dd> will produce an indented paragraph.

Compact vs standard

Normally, a description list puts the tags on one line, and starts the indented descriptions on the next:

Tag 1
Description 1
Tag 2
Description 2.
If you’d like a tighter look, you can ask for a <dl compact>. If the tags are very short, some browsers will start the descriptions on the same line as the tags:
Tag 1 Description 1
Tag 2 Description 2
However, most browsers do not support the compact attribute, and will simply ignore it: for example, with current versions of Windows Mosaic, a <dl compact> will always look like a <dl>, even if the tags are very short.

Inline images

With nothing but text attributes, section headers, and lists, we can build attractive looking documents, in a device independent way. We may not know exactly how they will look on any given reader’s screen, but we can be sure that they will be functionally equivalent to what we see on ours. The next step is to add pictures.

<img ...>

The <img> tag lets us insert inline images into our text. This tag is rather different from the tags we’ve seen so far. Not only is <img> an empty tag that always appears alone, not as part of a pair, the <img> tag has a number of parameters between the opening <img and the closing >. That is, the image file name and some optional modifiers are included in the tag, not between an <img> and a </img> tag pair. You can think of an <img> as a single character, fully specified by its single tag.

Every <img> tag must have a src= parameter. This specifies a URL, or uniform resource locator, which points to a .gif or .xbm bitmap file. See the section All About URL’s, below, for more information; for now, I will simply note that when the bitmap file is in the same directory as the HTML document, that the filename is an adequate URL. For example, <img src=MySmilingFace.gif> would insert a picture of my smiling face.

Of course, some people turn off inline images because they have a slow connection to the Web. When they do so, all your images, no matter what size, will be replaced with a standard graphic. This isn’t so bad if the picture is essentially ancillary to the text, but if you’ve used small inline images as "bullets" in a list or as section dividers, the substitution of the placeholder graphic will usually make your page look rather strange. Some people avoid using graphics as structural elements for this reason; others simply don’t worry about people with slow connections; still others include a note at the top of the page saying that all the images on the page are small, and inviting people with inline images off to turn them on and reload the page.

It’s also important to remember that some people use text-only browsers, like Lynx, to navigate the WWW. If you include a short description of your image with the alt= parameter, text-only browsers can show something in place of your graphic. For example, <img src=MySmilingFace.gif alt="A picture of the author">.

Since the alt parameter has spaces in it, we have to put it within quotes. In general, you can put any parameter value in quotes, but only have to do so if it includes spaces. If your parameter value includes a < or a >, you’ll have to use the ‘escape’ mechanism described below, in the Special Characters section.

Mixing images and text

You can mix text and images within a paragraph; an image does not constitute a paragraph break. However, Web browsers like Mosaic will not wrap paragraphs around images; they will display a single line of text to the left and/or right of an image. Normally, any text in the same paragraph as an image will be lined up with the bottom of the image, and will wrap normally below the image. This works well if the text is essentially a caption for the image, or if the image is a decoration at the start of a paragraph. However, when the image is a part of a header, you may want the text to be centered vertically in the image, or to be lined up with the top of the image. In these cases, you can use the optional align= parameter to specify align=top, align=middle, or align=bottom.

Multiple images per line

Since an image is treated in some ways as a single (rather large) character, you can have more than one image on a single line. In fact, you can have as many images on a line as will fit in your reader’s window! Of course, if you put ‘too many’ images on a single line, the browser will wrap the line, and your images will appear on two or more lines - so don’t specify a series of images that rely for their effect on all being on one line. Conversely, if you don’t want images to appear on the same line, be sure to place a <br> or <p> between them.

IsMap

The optional IsMap parameter allows you to place hyperlinks to other documents “in” a bitmapped image. See the Many anchors in an image section for somewhat more detail.

Summary of <img> parameters

Parameter

Required?

Parameters

SRC Yes URL
ALT No A text string
ALIGN No TOP, MIDDLE, or BOTTOM
ISMAP No None

Horizontal rules

The <hr> tag draws a horizontal rule, or line across the screen, to separate parts of your text. It’s fairly common to use one between the ‘body of the body’ and the ADDRESS block at the end. It’s also fairly common to put a rule before and after a form, to help set off the user entry areas from the plaintext. Beyond these two cases, there are no real standards; you can use as many or as few rules as your text and taste dictate.

Many people use small inline images for decoration and separation, instead of rules. While using images this way lets you customize your pages’ appearance, it also makes them take longer to transfer - and it makes them look horrible with inline images off.

Hypermedia links

Text and decorations make up a single page. The ability to add links to other Web pages or to entirely different sorts of documents is what makes the Web a hypermedia system. The special sort of highlight that your reader clicks on to traverse a hypermedia link is called an anchor, and all links are created with the anchor tag, < a>.

Links to other documents

While you can define a link to another point within the current page, most links are to other documents. Links to points within a document are very similar to links to whole documents, but they are slightly more complicated, so we will talk about them later, in the section on Links to anchors.

Each link has two parts: The visible part, or anchor, which the user clicks on, and the invisible part, which tells the browser where to go. The anchor is the text in between the <a> and </a> tags of the A tag pair, while the actual link data appears in the <a> tag.

Just as the <img> tag had a src= parameter which specified an image file, so does the <a> tag have an href= parameter which specifies the hypermedia reference. Thus, "<a href=SomeFile.Type> click here</a>" is a link to "SomeFile.Type", with the visible anchor "click here".

Browsers will generally use the linked document’s filename extension to decide how to display the linked document. For example, .htm files will be interpreted and displayed as HTML, whether they come from a http server, a ftp server, or a gopher site. Conversely, a link can be to any sort of file, not just HTML files: You use the same <a href=FileName.Type> tag to plant a link to another Web page as to a large bitmap, a sound file, or a movie .

Images as hot-spots

Since inline images are in many ways just big characters, there’s no problem with using an image in an anchor. The anchor can include text on either side of the image, or the image can be an anchor by itself. Mosaic shows an image anchor by drawing a blue border around the image (or around the placeholder graphic). The image anchor can somehow be a picture of what is being linked to or it can just point to another copy of itself: <a href=image.gif><img src=image.gif></a>.

Thumbnail images
One sort of ‘picture of the link’ is a thumbnail image. This is a tiny image, perhaps 100 pixels in the smaller dimension, which is either a condensed version of a larger image or of a detail which stands for the whole. The chief advantage of a thumbnail image is that it can be transmitted quickly, even over slow lines, leaving it up to the reader to decide which larger images to request. A secondary issue is aesthetic: large images take up an awful lot of the screen, but a small image does not dominate the screen the way a larger one does.

Why link an image to itself?
There are actually two very good reasons for an inline image to point to itself. For one, many people do turn off inline images to improve performance over a slow network link. If the inline image is an anchor for itself, these people can then click on the placeholder graphic to see what they missed.

There is also the implementation issue that (current versions of) Windows Mosaic do not "realize a palette" before displaying an image. Presumably this isn’t a problem on a 24-bit display, but most people have 256-color displays, and Windows maps the colors in inline images to the closest colors in the current palette. This can make for some funny looking images! If, however, the image is linked to itself, the reader can simply click on the image to load it into her favorite viewer, which will probably handle the colors much better.

Many anchors in an image
The <img> tag’s optional IsMap parameter allows you to turn rectangular regions of a bitmap image into clickable anchors. Clicking on these parts of the image will activate an appropriate URL, and there is also a default URL for when the user clicks on an area outside of any specially declared rectangle. While forms let you do this a bit more flexibly, the IsMap approach a) doesn't require any custom programming, just a simple text file that defines the rectangles and their URL's, and b) may work even on browsers that do not support forms. See http://wintermute.ncsc.uiuc.edu:8080/map-tutorial/image-maps.htm to learn how to do this.

Links to anchors

When a href specifies a filename, the link is to the whole document. If the document is HTML, it will replace the current document, and the reader will be placed at the top of the new document. Often this just what you want; sometimes it’s not. Sometimes you’d rather have a link go to a specific section of a document.

Doing this takes two anchor tags: One that defines a name for a location, and one that points to that name. These two tags can be in the same document or in different documents, but you always need to have the two tags: You can’t say something like ‘this link is to Act III, Scene 1 of Hamlet’.

Declaring an anchor
To define an anchor name, you use the name parameter: <a name=AnchorName>. You can attach this name to a phrase, not just a single point, by following the <a> tag with a </a> tag. While current browsers work just fine with or without any anchor text and a </a>, and do not do anything special to show named anchor phrases, future browsers may highlight them in some special way, and may require the </a>. As with other compliance issues, it’s probably best to go the extra mile and include a </a> to forestall possible future problems.

Linking to an anchor in the current document
To then use this name, you simply insert an <a href=...> tag as usual, except that instead of a filename, you use a # followed by an anchor name. For example, <a href=#AnchorName> refers to the example in the last paragraph.

Names do not have to be defined before they are used: It’s actually fairly common for lengthy documents to have a table of contents with links to names defined later in the document. It’s also worth noting that while tag and parameter names are not case sensitive, anchor names are case sensitive: <a href=#anchorname> will not take you to the AnchorName example.

Linking to an anchor in a different document
You can also link to specific places in any other HTML document, anywhere in the world - provided, of course, that it contains named anchors. To do this, you simply add the # and the anchor name after the URL that tells where the document can be found. For example, to plant a link to the anchor named "Section 1" in a file named "Complex.htm" in the same directory as the current file, you could use <a href="Complex.htm#Section 1">. Similarly, if the named anchor was in http://www.another.org/Complex.htm, you’d use <a href="http://www.another.org/Complex.htm#Section 1">.

Summary of <a> syntax

To
Use
Link to another document <a href="URL">highlighted anchor text</a>
Name an anchor <a name="Anchor Name">normal text</a>
Link to a named anchor in this document <a href="#Anchor Name">highlighted anchor text</a>
Link to a named anchor in another document <a href="URL#Anchor Name">highlighted anchor text</a>

All about URL’s.

So far I’ve skirted around the edges of URL’s, using local file names as much as I could, to keep the discussion of inline images and hypermedia links as simple as possible. However, there’s nothing very "World Wide" about a Web spun in a single directory so, having covered URL-users, it’s time to cover URL’s themselves.

Local and ’global’ references

When you type a filename into an application like a word processor or spreadsheet, it can be short, and assumed to be in the current directory, or long, and anywhere in the file name space. There is a similar distinction with URL’s.

Just as a complete DOS file name starts with a drive letter followed by a colon, so a full URL starts with a resource type - HTTP, FTP, GOPHER, &c - followed by a colon. If the name doesn’t have a colon in it, it’s assumed to be a local reference, which is a filename on the same file system as the current document. Thus, <a href=Another.htm> refers to the file "Another.htm" in the same directory as the current file, while <a href=/html/File.htm> refers to the file "File.htm" in the top-level directory "html". One thing to note here is that a URL always uses "/", the Unix-style forward slash, as a directory separator even when the files are on a Windows machine which normally uses "\", the DOS-style backwards slash.

Local URL’s can be very convenient when you have several HTML files with links to each other, or when you have a large number of inline images. If you ever have to move them all to another directory, or to another machine, you don’t have to change all the URL’s.

The <base> statement

One drawback of local URL’s is that if someone makes a copy of your document, the local URL’s will no longer work. Adding the optional <base> statement to the <head> section of your document will help eliminate this problem. While many browsers do not yet support it, the intent of the base statement is precisely to provide a context for local URL’s.

The base statement is like the <img> statement in that it’s a so-called empty tag, without a concluding < /base> tag that encloses some text. It requires a href parameter - e.g. <base href=http://www.imaginary.org/index.htm> - which should contain the URL of the document itself. When a browser that supports the base statement encounters a URL that doesn’t contain a protocol and path, it will look for it relative to the base URL, instead of relative to wherever it actually loaded the document from.

Reading and constructing URL’s

Where a local URL is just a filename, a ’global’ URL specifies an instance of one of several resource types, which may be located on any Internet machine, anywhere in the world. The wide variety of resources is reflected in a complex URL syntax. For example, while most URL’s consist of a resource type followed by a colon, two forward slashes, a machine name, another forward slash, and a resource name, others consist only of a resource type, a colon, and the resource name.

Broadly speaking, the resource-type://machine-name/resource-name URL form is used with centralized resources, where there’s a single server that supplies the document to the rest of the net, using a particular protocol. Thus, "http://www.another.org/Complex.htm" means ‘use the Hypertext Transfer Protocol to get the file Complex.htm from the main WWW directory on the machine www.another.org’, while "ftp://foo.bar.net/pub/www/editors/README" means ‘use the File Transfer Protocol to get the file /pub/www/editors/README from the machine foo.bar.net’.

Conversely, many resource types are distributed. We don’t all get our news or mail from the same central server, but from the nearest one of many news and mail servers. URL’s for distributed resources use the simpler form resource- type:resource-name. For example, "news:comp.infosystems.www.providers" refers to the Usenet newsgroup comp.infosystems.www.providers, which is a good place to look for further information about writing HTML.

Use of www vs machine-name

In the HTTP domain, you’ll often see "machine names" like "www.foo.org". This usually does not mean that there’s a machine named www.foo.org that you can ftp or telnet to; "www" is an alias that a webmaster can set up when she registers her server. Using the www alias makes sense, because machines come and go, but sites (and, we hope, the Web) last for quite awhile. If URL’s refer to www at the site and not to a specific machine, the httpd server and all the HTML files can be moved to a new machine simply by changing the www alias, without having to update all the URL’s.

A Partial Table Of URL Resource Type

Resource
Interpretation
Format
HTTP Hypertext Transfer Protocol http://machine-name/file-name
FTP File Transfer Protocol ftp://machine-name/file-name
GOPHER Gopher gopher://machine-name/file-name
NEWS Internet News news:group-name
TELNET Logon to a remote system telnet://machine-name
MAILTO Normal Internet email mailto:user-name@machine-name

Often possible to just cut ’n paste

There’s no doubt that URL’s are complex little beasts. It’s fairly easy to make your link fail by typing a . where you need a / or vice versa. Fortunately, in many cases it’s possible to let a computer generate a URL for you to cut and paste.

For example, if you’re putting together your own pages of Neat Things To Click On, you can (with some versions of Mosaic) copy the current URL out of a text box in the header pane, next to the NCSA logo. Similarly, the "menu editor" has a way to copy the current URL to an editable field, where you can copy the URL to the Clipboard and then paste it to some other application.

In the character-mode world, if you type "=" at any Gopher menu, you will get some technical information about that menu which includes an URL that you can put into your HTML files.

Copy vs reference

In general, if you run across some interesting and useful (or just plain fun and weird) information anywhere on the Net, it’s much better to describe it and link to it than to copy it. With a link, any changes will automatically be seen by your readers. If you make a copy, any changes to the original will be invisible until you update your copy.

Naturally, this causes a problem if you want to add HTML markup to an existing document to make it easier on the eyes or to make it easier to jump to specific sections! If the document is relatively static - like Alice In Wonderland or the US Constitution - there’s probably no alternative to making a copy and marking it up. However, with something that changes often, like a FAQ [Frequently Asked Questions] file, probably the best thing to is to try to come up with some automatic way to, say, build a clickable table of contents.

FTP sites

Local ftp vs symbolic links:

While a URL can specify a ftp-able file on the same machine as the .htm document, anonymous ftp login can consume a fair number of cycles and slow the host down. If the machine is running Unix, you can do yourself and the other users of the machine a favor by using a symbolic link to let people get your ftp-able files via the normal WWW http.

While security concerns ensure that anonymous ftp users have no access to the regular file system, so that ftp has to have its own, self-contained file system without any links to the regular file system, you can have links from the regular file system. That is, you can’t establish a link from a ftp directory to a .public_html directory, but you can create a link from a html directory to a ftp directory.

For example, in my html directory, I did ln -s ~ftp/pub/user/jon ftp, which created a ftp ‘directory’ in my html directory. I can thus create URL’s that look like "ftp/README" instead of "ftp://deeptht.armory.com/pub/user/jon/README". The shorter URL is not only easier to read, but is also easier for the host to process.

Raw ftp

Somewhat similarly, while downloading files through Mosaic is awfully convenient, it can also be pretty slow, for large files. You may be able to save your readers some time by including "raw ftp" instructions in the text around your reference.

Special characters

Since < and > have special meanings in SGML, there has to be an escape mechanism that lets you include them in your text without causing syntax errors. Similarly, since URL’s with embedded spaces need to be quoted, there has to be an escape mechanism to include a quote character in a URL without closing the quotes. Finally, while the default character set for the Web is ISO Latin-1, which includes European language characters like é and ß in the range from 128 to 255, it’s not uncommon to pass around snippets of HTML in 7-bit email, or to edit them on dumb terminals, so the escape mechanism also has to include a way to specify high-bit characters using only 7-bit characters.

Two forms: numeric and symbolic

There are two ways to specify an arbitrary character: numeric and symbolic. To include the copyright symbol, ©, which is character number 169, you can use &#169;. That is, &#, then the number of the character you want to include, and a closing semicolon. The numeric method is very general, but not very easy to read.

The symbolic form is much easier to read, but its use is restricted to the four low-bit characters with special meaning in SGML and to the European-language characters. To use the other symbols in the ISO Latin-1 character set, like ® and the various currency symbols, you have to use the numeric form. The symbolic escape is like the numeric escape, except there’s no #. For example, to insert é, you would use &eacute;, or &, the character name, and a closing semicolon. You should be aware that symbol names are case sensitive: & Eacute; is É, not é, while &EAcute; is no character at all, and will show up as &EAcute;!

There is a complete table of all characters with HTML names, as well as all the other high-bit characters in the ISO Latin-1 character set, in Appendix B, HTML Special Characters.

ISO Latin-1 vs Windows ANSI

Of course, if you compose your HTML documents on a system like Windows that essentially uses the ISO Latin-1 character, and you don’t plan to send your HTML via email, you can just insert the high-bit characters directly into your HTML files: Except for the multiply and divide symbols (× and ÷), Windows ANSI is ISO Latin-1.

Since HTTP, the WWW’s document transfer protocol, is always 8-bit, you only need to use the escape mechanism for the high-bit characters if you plan to quote your document in email, or if you are editing it on a ASCII terminal. If you can use the high-bit characters directly, you will only have to use the & escape mechanism on the relatively few occasions when you have to quote <, >, ", or &.

Preformatted and other special paragraph types

HTML has three special "block" formats. Any plaintext within them is supposed to appear in a distinctive font. While they can be used simply as paragraph styles, they can also enclose headers and lists. It’s worth noting, though, that current browsers don’t really do a very good job of living up to the spec, here, and using these block formats for anything but paragraph styles will probably not do what it should.

<BlockQuote> </BlockQuote>

The intent of the block quote is to set off an extended quote from normal text. That is, a BLOCKQUOTE tag pair does not imply indented, single-spaced, and italicized; rather, it’s just meant to change the default, plaintext font. Browser support for block quotes is currently pretty weak.

<pre> / </pre>

Everything in a preformatted block will appear in a monospaced font. The PRE tag pair is also the only HTML element that pays any attention to line breaks in the source file: Any line break in a preformatted block will be treated just the same as a <br> elsewhere. The <pre> tag is probably most commonly used for samples of source code, but it’s also the default format for any non-HTML text: any file without an .htm or .HTM extension will be displayed as if it were within a PRE tag pair.

Preformatted is a block format, and any HTML markup will be processed, so you can have anchors as well as bold or italic monospaced text within your preformatted block. Headers and lists in preformatted blocks tend to confuse current browsers, though.

The initial <pre> tag can have an optional width= parameter. Browsers will not trim lines to this length; the intent is to allow the browser to select a monospaced font that will allow the maximum line length to fit in the browser window.

<Address> </Address>

The third block format is the address format, <address>. This is generally displayed in italics, and is intended for information about the document, like its creation date and revision history, and how to get in touch with the document’s author. Official style guides say that every document should have one.

Many people put a horizontal rule, <hr>, between the body of the document and the address block. If you include a link to your home page or to a page that lets the reader send mail to you, you don’t ‘have’ to include a lot of information on each individual page.

Forms

Everything we have seen so far corresponds to traditional publishing: You create a hypermedia document, and others read it. You are a producer, your readers are consumers. Your readers can only see what you have put out for them.

With HTML forms, though, that begins to change. You can create a form that lets your readers search a database using any criteria they like. Or you can create a form that lets them critique your Web pages, or your new software. Or - and this is what excites business people - you can use forms to sell things over the net.

It’s pretty easy to create forms. However, to actually use them you’ll need a program that runs on your Web server to process the information that the user’s client sends back to you. For simple things like a "comments page", you can probably use an existing program; for anything more complex, you’ll probably need a custom program. While I will briefly describe the way forms data looks to the receiving program, any discussion of forms programming is quite beyond the scope of this book.

As of the summer of 1994, forms are still a relatively new feature that do not work with Mac Mosaic, only with Windows and X-Windows Mosaic, but this condition will probably not last long.

<form> </form>

All input widgets must appear within a FORM tag pair. When a user clicks on a submit button or an image map, the contents of all the widgets in the form will be sent to the program that you specify in the <form> tag. HTML widgets include single and multi-line text boxes, radio buttons and check boxes, pull down lists, image maps, a couple of standard buttons, and a hidden widget that might be used to identify the form to a program that can process several forms.

Within your form, you can use any other HTML elements, including headers, images, rules, and lists. This gives you a fair amount of control over your forms’ appearance, but you should always remember that the users screen size and font choices will strongly affect the actual appearance of your form.

While you can have more than one form on a page, you cannot nest one form within another.

Form action and method attributes

Nothing gets sent to your Web server until the user presses the Submit button or clicks on an image map. What happens then depends on the action, method, and EncType parameters of the <form> tag.

The action parameter gives a URL to which the form’s content will be sent. This is most commonly in the cgi-bin directory of a Web server. If you do not specify an action parameter, the contents will be sent to the current document’s URL.

The method parameter tells how to send the forms contents. There are two possibilities, here: get and post. If you do not specify a method, get will be used. Get and post both format the form’s data identically; they differ only in how they pass the forms data to the program that uses that data.

Get and post both send the forms contents as a single long text vector consisting of a list of WidgetName=WidgetValue pairs, each separated from its successor by an ampersand. For example, "Name=Jon Shemitz&Address=jon@midnightbeach.com". (Any & or = sign in a widget name or value will be quoted using the standard ampersand escape; any ‘bare &’ and any = sign can therefore be taken as a separator.) You will not necessarily get a name and a value for every widget in the form: While empty text is explicitly sent as a WidgetName= with an empty value, unselected radio buttons and check boxes don’t send even their name.

Where get and post differ is that the get method creates a "query URL" which consists of the action URL, a question mark, and the formatted form data, whereas the post method sends the formatted form data to the action URL in a special data block. The Web server parses the query URL that a get method creates and passes the form data to the form processing program as a command line parameter. This creates a limitation on form data length that the post method does not.

Currently, all forms data is sent in plaintext. This creates a security problem, which will be discussed below. The optional EncType parameter offers a possible solution: Though currently this only allows you to ratify the plaintext default, in the future there will be probably be values that call for an encrypted transmission. For example, in August of 1994, a Massachusetts startup demonstrated forms transmission using PGP (Pretty Good Privacy) data encryption: Presumably, this depends on a custom Mosaic client that understands a new EncType parameter, but by the time you read this there may be public standards for encrypting forms data.

Widgets

While from the users’ point of view, there are seven types of Web widgets, all are generated by one of three HTML elements.

The input element

The <input> element is the most versatile, and the most complex. It can create single-line text boxes, radio buttons, check boxes, image maps, the two standard buttons, and the hidden widget. It’s somewhat like the < img> tag in that it appears by itself, not as part of a tag pair, and has some optional parameters. Of these, the type= parameter determines both the widget type and the meaning of the other parameters; if there’s no type parameter, the input tag generates a text box. Except for the standard buttons, all widgets must have a name.

Text boxes
If there is no type parameter or if the type parameter is "text", the input widget will be a text box. The "password" input type is just like the text type, except that the value shows only as a series of asterisks. All text areas must have a name. Text areas always report their value, even if it is empty.

Syntax of the TEXT and PASSWORD input types:
Attribute
Required?
Format
Meaning
Type No Type="Text" or Type="Password" Determines what type of widget this will be. Default is "text".
Name Yes Name="WidgetName" Identifies the widget.
Value No Value="Default text" Lets you supply a default value. Cannot contain html commands.
Size No Size=Cols Width (in characters) of a single line text area. Default is 20.
Size No Size=Cols,Rows Height and width (in characters) of a multi-line text area.
MaxLength No MaxLength=Chars Longest value a single line text area can return. Default is unlimited.

Check boxes and radio buttons
Check boxes and radio buttons are created by an input tag with a "checkbox" or "radio" type. Both must have a name and a value parameter, and may be initially checked. The name parameter is the widget’s symbolic name, used in returning a value to your Web server, not its on-screen tag. For that, you use normal HTML text, next to the <input> tag. Since the display tag is not part of the <input> tag, Mosaic check boxes and radio buttons operate differently from their dialog box kin: You cannot toggle a widget by clicking on its text, you have to click on the widget itself.

A group of radio buttons are associated by having identical names. Only one (or none) of the group can be checked at any one time; clicking a radio button will turn off whichever button in the name group was already on.

Check boxes and radio buttons return their value if and only if they are checked. An unchecked widget is completely silent.

Syntax of the CHECKBOX and RADIO types
Attribute
Required?
Format
Meaning
Type Yes Type="CheckBox" or Type="Radio" Determines what type of widget this will be. Default is "text".
Name Yes Name="WidgetName" A unique identifier for a checkbox; a group identifier for radio buttons.
Value Yes Value="WidgetValue" The value is sent iff the widget is checked.
Checked No Checked If this attribute is present, the widget starts out checked.

Image Maps
Image maps are created with the "image" type code. They return their name and a pair of numbers that represents the position that the user clicked on: The form handling program is responsible for interpreting this pair of numbers. Since this program can do anything you want with the click position, you are not restricted to rectangluar anchors as with <img ismap>.

Clicking on an image map, like clicking on a Submit button, will send all form data to the Web server.

Syntax of the IMAGE type
Attribute
Required?
Format
Meaning
Type Yes Type="Image" Determines what type of widget this will be. Default is "text".
Name Yes Name="WidgetName" Identifies the widget
Src Yes Src="URL" The URL of a bitmapped image to display.

Submit/Reset buttons
The "submit" and "reset" types let you create one of the two standard buttons. Clicking on a Submit button, like clicking on an image map, will send all form data to the Web server. Clicking on a reset button resets all widgets in the form to their default values. These buttons are the only widgets that don’t need to have names. By default, they will be labeled Submit and Reset; you can specify the button text by supplying a Value parameter.

Syntax of the SUBMIT and RESET types
Attribute
Required?
Format
Meaning
Type Yes Type="Submit" or Type="Reset" Determines what type of widget this will be. Default is "text".
Name No Name="WidgetName" The buttons never return their values, so a name will never be used.
Value No Value="WidgetValue" The button text. Default is Submit or Reset, respectively.

Hidden fields
A "hidden" type creates an invisible widget. This does not appear on the screen, but its name and value are included in the form’s contents when the user presses the Submit button or clicks on an image map. This might be used to identify the form to a program that processe several different forms.

Syntax of the HIDDEN type
Attribute
Required?
Format
Meaning
Type Yes Type="Hidden" Determines what type of widget this will be. Default is "text".
Name Yes Name="WidgetName" Identifies the widget.
Value Yes Value="WidgetValue" Whatever constant data you might want to include with the form.

The TextArea tag pair

The TextArea element is similar to a multi-line text input widget. The primary difference is that you always use a TEXTAREA tag pair and put any default text between the <TextArea> and </TextArea> tags. As with < pre> blocks, any line breaks in the source file are honored: This lets you include line breaks in the default text. The ability to have a long, multi-line default text is the only functional difference between a TextArea and a multi-line input widget. Of course, the syntax is rather different - and, at least in current version of Windows Mosaic, TextAreas work, while multi-line input widgets do not.

Syntax of the <TextArea> tag
Attribute
Required?
Format
Meaning
Name Yes Name="WidgetName" Identifies the widget.
Rows No Rows=Rows TextArea height, in characters.
Cols No Cols=Cols TextArea width, in characters. Default is 20.

The select tag pair

The SELECT tag pair allows you to present your users with a set of N choices. This is not unlike a set of check boxes, except that it takes less room on-screen.

Just as you can use check boxes for 0 to N selections, or radio buttons for 0 or 1 selections, you can specify the cardinality of selection behavior. Normally, select widgets act like a set of radio buttons: your users can only select zero or one of the options. However, if you specify the MULTIPLE option, the select widget will act like a set of check boxes: your users may select any or all of the options.

Syntax of the <select> tag
Attribute
Required?
Format
Meaning
Name Yes Name="WidgetName" Identifies the widget.
Size No Size=Rows This is the widget height, in character rows. If the size is 1, you get a pull-down list. If the size is greater than 1, you get a scrolling list. Default is 1.
Multiple No Multiple Allows more than one option to be selected.

Within the SELECT tag pair are a series of <option> statements, followed by the option text. These are similar to <li> list items, except that <option> text may not include any HTML markup. The option tag may include an optional selected attribute; more than one option may be selected if and only if the select tag includes the multiple option.

For example,

Which Web browsers do you use?
<select name="Web Browsers" multiple>
<option>Mosaic
<option>Lynx
<option>WinWeb
<option>Cello
</select>

The CGI

The CGI, or Common Gateway Interface defines how a form handling program on a Web server should act. This includes the name1=value1&name2=value2 format of the form data vector, as well as how these programs interact with remote Web clients. A CGI program can be any sort of executable code, but on Unix servers, the most common executable seems to be a Perl script.

Security issues

You should be aware that it’s always possible for people to intercept forms data bound for your Web server. This means that until forms with encrypted EncType-s are widely supported, forms data can not be considered 100% reliable - or 100% confidential.

The problem is that anyone who loads your form can read the HTML source to see where the forms data goes. If that data includes any tempting information like a credit card number, a thief may be tempted to watch traffic to your server for credit card numbers to steal. Since it can be relatively easy to intercept TCP/IP packets, this is a problem that you shouldn’t ignore!

Basically, if you want to do on-line sales, DON’T use a plaintext form to ask for a credit card number. Instead, use a service that may let customers create accounts over the Web but will only accept credit card numbers and expiration dates via a voice phone call or through snail (physical letter) mail. When your customers want to place an order, they don’t run the risk of having their credit card number stolen; they would only have to supply a name and address to let the order taking system look up their credit card number.

Tips

Use any editor you like

WWW browsers ignore ASCII carriage returns and line feeds, except in <pre> and <textarea> blocks, so it doesn’t really matter whether your editor uses a DOS-style CR-LF or a Unix-style LF for an end-of-line marker. For that matter, the browser doesn’t care at all if you have any line breaks outside of <pre> and <textarea> blocks: You can use a word processor like Word for Windows that only puts a line break between paragraphs.

On the other hand, if you upload your HTML files to a Unix system for distribution, you may find it hard to avoid using vi to make minor changes. vi is old-fashioned, to say the least, and not only does it have a maximum line length, it doesn’t word wrap or scroll horizontally: It wraps long lines to the next screen line, right in the middle of words. If you’re going to have to use vi, it will be best if your source is word-wrapped to fit on an 80-column screen.

Previewing

Until the day when we have fully WYSIWYG Web editors, we’re going to have to live with an edit-preview-edit cycle.

Mosaic

If you can run Mosaic on your personal computer (or X terminal), you can use the Load Local File feature to preview your HTML. When you’re ready to test a document, save it to disk, load Mosaic, and select Load Local File from the File menu. Odds are, you’ll find a few mistakes!

If you have enough memory to run both Mosaic and your editor at the same time, just switch back to your editor, correct your mistakes, and switch back to Mosaic. Then either select Reload from the Navigate menu or just click on the reload button on the tool bar. This will show your corrections without having to reselect or retype the file in the Load Local File dialog box.

lynx

If you have a character mode connection to the Unix system you use to distribute your Web pages, you can use lynx to test your Web pages. While obviously you should test your pages before you upload them, it can be hard to avoid making small changes to the uploaded documents, and it’s a good idea to test them as soon as possible after changing them. (After all, if your pages are popular, someone might see any mistakes as soon as they’re saved.) The lynx version of your document won’t look much like the Mosaic version, but lynx does provide a good way to test both your links and the alt text in your <img> tags.

Keep It Simple

The fact that you have five lists, three block types, and twelve character attributes to work with doesn’t mean you should use them all in each document! Since computer monitors display less text than a sheet of paper, it’s even easier to go overboard with layout on your Web pages than in print. Use character attributes sparingly: If every other sentence seems to need bold face or strong emphasis to emphasize key points, perhaps you need to work on your wording. Use lists and images and rules when they add to your page, not just for the sake of creating a "fancy document" that uses "sophisticated features".

Your readers will not see what you do

Perhaps the hardest thing for many of us to learn about writing Web pages is that What You See Is Not What You Give! We’ve gotten used to desktop publishing, where we have total control over our document’s appearance. It’s hard to remember that, on the Web, you often can’t do much more than hint at what the page should look like.

Character mode browsers

For example, most of us equate the Web with Mosaic, but the Web predates Mosaic, and a significant fraction of Web users still use character mode browsers like Lynx. With a character mode browser, everything is monospaced, and the only "character attribute" available is "highlight" so that bold and italic look just the same as anchors.

Different screen/window sizes

Nor can you think of the majority of users who do use Mosaic or some other graphical browser as a homogeneous mass. Some people still have 640 by 480 (or even 640 by 350) screens, while others do not routinely "maximize" their browser window. The inline image that takes up a small piece of the window on your 1024 by 1280 screen may not even completely fit on a smaller screen. Similarly, the document that looks wonderful with tiny little inline image bullets and rules will look pretty terrible when it’s filled with the (larger) placeholder graphic.

Different font face/size choices

Also, your readers will not be using the same fonts for plaintext and headers as you are. Some will not have a very visible difference between, say, a second and a third level header - don’t write documents that depend on there being obvious differences between various header levels.

Many browsers do not support all of HTML

Perhaps most importantly, many browsers do not support all of HTML. For example, as of the summer of 1994, Mac Mosaic does not support forms, while Windows Mosaic does not support compact description lists, has bugs with many block types, only supports a handful of character attributes, and so on. While your browser may support a given HTML element, you have no real assurance that your readers’ browsers will. (You can’t really assume that everyone will always have the latest update, even of free software.)

Naturally, this raises the question of whether you should even use the "unsupported" features. The answer depends on the expected lifetime and readership of your document. If you’re writing an announcement of a picnic next month for a group that all uses Windows Mosaic, it would be silly to use a compact description list or some other feature that will just be ignored. On the other hand, if you expect your document to last indefinitely and to be read by a wide variety of people, then it’s probably best to use currently unsupported features - where they’re appropriate! - but to be careful that the document is legible without them. For example, when I post snippets of code, I use the <code> or <tt> attributes, even though my browser ignores them. Someday, it won’t.

Connect speeds vary widely

Some people have wonderful network connections that can deliver megabytes per second, while others limp along with dialup SLIP connections that can’t even keep a 14.4 modem busy full time. The pretty, image-filled page that comes up as fast as the CPU can decode the GIF files on the well-connected machine can take several minutes to show up on the poorly-connected machine. If the latter machine has been set to not display inline images as a result, the screen will not look at all like what it does with inline images on.

Avoid large images: Use a "thumbnail" image, linked to a full-size image, and let your reader decide if they really want to see the full picture.

Similarly, take advantage of hypertext. Don’t put up a long essay as a single, long and detailed document; write what amounts to an outline or overview, and let your readers click on the sections that seem interesting to them.

Colors of inline images can be semi-random

The current crop of Windows browsers do not "realize a palette" before displaying an inline image. Windows therefore maps the image colors to the closest colors in the current palette. This can result in weird effects like some shades of gray in a black and white image getting colored, or people’s faces being rendered in lurid red and yellow.

Sometimes this doesn’t really matter: If the inline image is just a line drawing, any colors it has are pretty arbitrary. On the other hand, most photographs look pretty weird with the wrong palette: you should link even small photos to themselves, so that your reader can look at your photos with the viewer of her choice.

Good luck - and have fun!

There’s no doubt that it’s easier to use some nice WYSIWYG desktop publishing software than it is to write HTML - but, really, after just a little practice, HTML becomes easy to write. (After a little more practice, it even begins to seem almost readable.) But electronic publishing is not just more environmentally benign than paper publishing and physical distribution, it also offers a flexibility and immediacy that paper can’t touch. One way or another, electronic publishing is the future - I’d rather see that future belong to something like the Worldwide Web of thousands or millions of individuals with something to say than to the large corporations with their "500 channels" of recycled crap. Wouldn’t you?


This document originally appeared in The Mosaic & Web Explorer

Copyright © 1994,1995 Jon Shemitz - jon@midnightbeach.com - Written August '94, HTML markup 12-Jun-95