The WWW runs on a large and extremely heterogeneous mass of computers. There’s no common operating system. There’s no common word size. There’s no consensus on whether the least significant bit of a number come before or after the most significant bit. Sometimes, WWW pages even get passed around via 7-bit links.
One implication of all this heterogeneity is that 7-bit ASCII is about the only data representation that can be understood by every machine in the Web. Two Windows machines running identical software could pass their internal binary data representation of Web pages back and forth over a network, but they’re going to have to do some translation before they can talk to a machine with different word sizes or byte ordering - or perhaps even just a different program version. While they could have specialized code to do translations to and from the internal formats of every machine or program that they might need to talk to, it’s easier and more reliable to create a sort of programming language that allows them to describe the Web page.
A Web editor then can use this language to describe a document, without having to know anything about the internal data structures of the Web viewers that will display it. A Web server can store documents from a variety of machines without having to know anything about their architecture - nor need it know anything about the machine that requests a document. A Web viewer can display documents that were generated on a Windows machine just as easily as it can display documents that were generated on a Mac or an Unix workstation.
As you probably already know, the particular language used by the WWW is called HTML, which stands for hypertext markup language. The hypertext part means that a Web page can contain references to other Web pages or to various net resources like gophers and ftp sites. The markup part comes from the days when book and magazine editors made special marks on their authors’ manuscripts to tell the typesetters how to format the text. This process was called markup, and the term was adopted when people started inserting formatting instructions into their computer files.
However, the important difference between SGML and older types of markup languages is not what the markup looks like but what it says. Originally, markup tags said things like ‘make this bold’ or ‘center this’ or ‘use this font’: they were concerned with how the text should look. If you wanted to change the way chapter headings looked, you had to go through and change each example. In SGML, on the other hand, the markup would say ‘this is a chapter heading’. If you change the appearance of chapter headings, the appearance of all the chapter headings changes, without any possibility of error.
More generally, SGML markup contains meta-information, or information about the text. This information doesn’t have to be presentation commands. As in HTML and the WWW, it can be hypertext links. It can say things like ‘what follows is an author’s name’ or ‘here is a quote from this source’. It can contain revision history or it can say ‘here is something only for those who want all the details’.
Obviously, in the end this information does affect the presentation. The important point is that how the meta- information affects the presentation is up to the presenting software - and the user. A division has been made between content and presentation, and presentation has been made to depend on content.
This division between content and presentation should be quite familiar to anyone who’s used a word processor (like Microsoft Word) with style sheets. Of course, where a style sheet or a document file format is often specific to a particular program running under a particular operating system, SGML is a way to encode meta-information in ASCII so that it can be used by different programs on different systems.
However, SGML is not a markup language per se; rather, SGML is a format that markup languages can follow. In programming terms, SGML is an object class, while particular “SGML compliant” markup languages, like HTML, are instances of the class. In practice, this means very little more than that HTML tags look like SGML tags, and that a HTML document has both a “head”, which contains information about the document as a whole, like its title and its author, and a “body”, which contains the actual text.
You certainly don’t need to know all this about the nature of SGML to write Web pages, but as you wander the Web, you’re bound to run into references to SGML from time to time. Now you know what it is.
As you might imagine, it’s not too hard to markup your text, but if you’re like most people, all the tags in brackets get in the way of the text. Proofreading a heavily marked-up text is very difficult. Remember, though, that the markup is meant to be read by a computer, not by you - and that there really isn’t any alternative to mixing formatting instructions in with the text. The only sort of format description that can be easily passed between different types of computers is an ASCII description.
And of course they are awkward and painful, but to those of us to whom terminal-based text editors and early markup languages represented freedom from both typos and retyping second and third drafts, there’s something nostalgic about markup languages. Don’t get me wrong - I love how easy my GUI word processor makes complex layouts, and would never go back to runoff or anything like it - but I still get a bit of a kick out of how the ’90’s reality of heterogeneous networks has made ’70’s word processing technology new again.
Current tools force you to learn HTML
However, as I’m writing this in the summer of 1994, it really doesn’t much matter whether you like, tolerate, or
despise HTML: If you want to be a Web-spinner, you have to learn HTML. No one has yet written a true ‘Web
processor’, a WYSIWYG word processor that ‘just happens’ to read and write HTML files. The current exponential
growth of the Web is naturally spawning a lot of Web writing tools, so I’m sure we’ll see good tools someday, but for
now we have to use word processors, programming editors, or ill-conceived "HTML editors" that display the tags, not
the effects.
More substantially, it is easier to start writing HTML with a HTML editor than with a basic text editor because the HTML editor typically offers some sort of menu of tags. This can be helpful, because the HTML tag set is hard to learn, having all the consistency and predictability of BASIC. Some tags, like <b> and <dl> are just initials. Others, like <pre> or <img>, are arbitrarily abbreviated. There are also some like <blockquote> and < address> which are full words or phrases. (And, of course, obviously a bulleted list is an "unordered list", or <ul>.)
The final advantage of a HTML editor is that when it inserts tags for you, it inserts both the beginning and the end tag, thus greatly reducing the chance that your whole document will end up in the <h1> (first level header) style, or that a bold word will become three bold paragraphs.
Unfortunately, there are also two disadvantages. The first is that converting a file from your word processors format to HTML can be slow. Even if conversion is relatively fast, inserting an extra step between editing and previewing slows you down and makes you less productive. The second, more substantial, disadvantage is that HTML formats that don’t correspond directly to the word processor’s formats are either not supported or only supported in an awkward and round-about way. For example, inserting a hypertext link with a RTF to HTML converter requires you to use a special format for the ‘pointer’ and a different special format for the anchor.
At least in principle, this is a "portable" solution. You should be able to use a RTF to HTML translator on just about any system that has a word processor that can generate RTF files, because the translator is a "batch" operation which requires little or no localization to compile for just about any platform. Or, you could take a RTF file from your own machine and convert it to HTML on the system that you will use to distribute your Web server. However, RTF has been said to stand for Redefine The Format, and portability may be more theoretical than actual: You may find that any particular translator will only work with one particular version of one particular word processor.
Another disadvantage is that using a translator turns a simple ’reality test’ into a three step operation: Save the document; translate it; load it into a Web viewer like Mosaic. (It’s even worse if you have to upload the file to another system before you can translate it.) While this may be less trouble than using a non-WYSIWYG editor that forces you to decode HTML tags in your head, it’s certainly more of an annoyance and productivity decreaser than a word processor add-in that can save documents directly to HTML.
To some extent you can avoid this problem by only using cu_html for Web pages that don’t need any of the features that it doesn’t fully support, but it’s hard to avoid it entirely. If you develop on your pages on a Windows system but upload them to a Unix system for distribution, you’ll find yourself tempted to make trivial changes directly to the Unix files because it’s so much easier than changing the cu_html "source", compiling it to html, then uploading the new file to your distribution system and replacing the old file. Of course, once you do this, your W4W document no longer matches your Web document, and if you don’t make changes to both documents, you risk losing that "trivial change" the next time you make a more substantial change to the source.
Perhaps, too, a new version will show images and hyperlinks more like the way Mosaic will actually show them. Instead of showing an image, or the standard inline image ‘place holder’ bitmap, HoT MetaL just shows the image file name in square brackets [src: like this]. Similarly, instead of showing a hypertext anchor in blue, it shows the link data in square brackets [href: like this] and doesn’t do anything special to indicate the anchor’s length. (What’s with the "src" and "href" you ask? That’s HTML showing through.)
If so, you might find as I did that it can only read perhaps 10% of the HTML files that Mosaic can display with no problem at all. The authors claim that it conforms to the nascent HTML 2.0 standards, but it seems to me to reject perfectly fine files. In any event, as far as I’m concerned, a Web editor that can’t read anything that a Web browser can display is useless. It’s one thing to correct "bad" HTML, but it should be able to read it.
The bottom line here is you’ll have to learn to read and write HTML if you want to be a Web spinner. Future tools will greatly diminish this necessity, but I suspect that you’ll always have to have at least some familiarity with HTML if you want to publish anything more than the very most basic documents on the Web.
HTML Basics
All HTML files consist of a mixture of text to be displayed and HTML tags, which describe how to display the text.
Normally, extra whitespace (spaces, tabs, and line breaks) is ignored, and text is displayed with a single space
between each word, no matter how much whitespace separates them in the HTML source file. Text is always
wrapped to fit within the reader’s browser’s window in the reader’s choice of fonts: line breaks in the HTML source
are treated just like any other whitespace, and a paragraph break must be explicitly marked with the HTML
<p> tag.
Tags are always set off from the surrounding text by angle brackets, or the less-than and greater-than signs. Most tags come in "begin" and "end" pairs: e.g., <i>e.g.</i>. That is, the end tag looks just like the begin tag, except that it has a slash between the opening bracket and the tag name. Just to keep things legible, I’ll speak of tag pairs, instead of always writing out the begin and end form. That is, I’ll say "an ADDRESS tag pair" instead of "an <address></address> pair".
There are a few tags which appear by themselves, not paired with an end tag. Since most tags do come in pairs, I’ll take particular care to point out the exceptions as they come up.
HTML is case insensitive: <HTML> means exactly the same thing as <html> or even <hTmL>. However, many Web servers are running on Unix systems, which are case sensitive. This will never affect HTML interpretation, but will affect your hyperlinks: My.gif is not the same file as my.gif or MY.GIF.
Some begin tags can take parameters, which come between the tag name and the closing bracket like this: <dl compact>. Some tags, like character attributes and most paragraph types, are never parameterized. Others, like description lists, have optional parameters that will alter their appearance, if your reader’s browser supports that option. Still others, like anchors and images, absolutely require certain parameters (like where the hyperlink should go to, or what image file to display) and can also take other, optional parameters.
The standard structure
of a HTML document
All HTML documents have a certain standard structure. Currently, there is no great incentive to follow this structure:
Mosaic will treat any file that ends in .htm as an HTML file, even if it contains no HTML tags; Mosaic will not
interpret a file that does not end in .htm as HTML, even if it’s totally compliant and follows the rules to the last
detail. However, Mosaic is neither a finished product nor the only possible Web browser: Future browsers may
interpret any compliant file as an HTML file, regardless of its name, or may just "work better" with compliant
documents than with non-compliant documents.
It’s even possible that future browsers will totally refuse to interpret non-compliant documents as HTML: While the majority of existing Web pages are non-compliant in one way or another, if the current exponential growth of the Web continues, all currently existing Web pages will be just a small fraction of all Web pages at some point in the relatively near future. That is, if everyone starts writing compliant documents, all that is lost in switching to a browser that can’t handle non-compliant documents will be a few of the oldest documents. While I personally suspect that market forces will be unkind to browsers that don’t handle non-compliant documents at least as well as current ones do, it’s still a good idea not to knowingly write non-compliant documents.
In particular, it is at least possible that there may someday be a reason to have two or more SGML content types in one document. For example, documents might contain the same text marked up in HTML and in "HTML2". If this ever happens, browsers probably will ignore any text that’s outside of a SGML content type tag pair.
While you should not place display text anywhere outside the body section, this too is currently optional, as Mosaic will format and display any text that’s not in a tag, whether its in the body of the document or not. Also, while you can get away with not using the HEAD tag pair, it’s strongly recommended: using it allows software like Web-crawling robots to look at only the information that belongs in the header, without having to retrieve and/or parse the whole document.
Paragraphs are separated either by a explicit paragraph break command, <p>, or by paragraph style commands. The paragraph style determines both the font used for the paragraph and any special indenting. Paragraph styles include several levels of section headers, five types of lists, three different "block formats", and the normal, or default paragraph style. Any text outside of a explicit paragraph style command will be rendered in the normal style.
<html> <head> <title>The document title</title> </head> <body> Text and markup <address> Author and version info </address> </body> </html>
The title should appear in the head section, marked off with a TITLE tag pair. For example, <title>Lime-Jerked Chicken</title>. Mosaic actually has such an easy-going parser that the title can appear anywhere in the document, even after the </html>, but future browsers might not be quite so clever and accommodating.
The base tag contains the current document’s URL, or Uniform Resource Locator; browsers can use it to find "local URL’s". See the section All About URL’s for more information about the base statement.
The isindex tag tells browsers that this document is an index document, which means that the server can support keyword searches based on the document’s URL. Searches are passed back to the Web server by concatenating a question mark and one or more keywords to the document URL and then requesting this extended URL. This is very similar to one of the ways that forms data is returned, and you can see the section on Form action and method attributes for somewhat more information.
There are other header elements, like <nextid> and <link>, that are included in HTML for the benefit of editing and cataloging software. They have no visible effect; browsers simply ignore them, and so does this book.
For example, to keep
PC Techniques Bookstreamfrom coming out as
7721 East Gray Road, Suite 204
Scottsdale, Arizona 85260-6912
PC Techniques Bookstream 7721 East Gray Road, Suite 204 Scottsdale, Arizona 85260-6912you would write
PC Techniques Bookstream<br>The <br> command is one of the few empty HTML tags; it never has a closing </br>. While the empty header elements make assertions about the document as a whole, empty body elements can be thought of as special characters. That is, a <br> can be thought of as a special "end of line" character.
7721 East Gray Road, Suite 204<br>
Scottsdale, Arizona 85260-6912<br>
The </p> tag is actually optional, and most documents simply put a <p> at the beginning or end of each paragraph, or by itself on a line between two paragraphs.
Logical attributes are different. In keeping with the SGML concept of describing the content, not the formatting, logical attributes let you describe what sort of emphasis you want to put on a word or phrase, but leave the actual appearance up to the reader, or her browser. That is, where a <b>bold</b> word will always appear in bold type, an <em>emphasized</em> word may be italicized, underlined, bolded, or colored, as the reader prefers.
Web style guides suggests that you should use logical attributes whenever you can, but there’s a slight problem: some current browsers only support some physical attributes, and few or no logical attributes. Since Web browsers simply ignore any HTML tag that they don’t "understand", being "nice" and deferring to your readers’ sensibilities by using logical attributes runs the risk that they will not see any emphasis at all! The compromise I have adopted is to start using logical attributes as the current version of Windows Mosaic supports them; since X-Mosaic seems to always be a step or two ahead of Windows Mosaic, this means that I’m only using the logical attributes that most people can see.
List of physical attributes |
|||
---|---|---|---|
Attribute |
Tag |
Sample |
Effect |
Bold | <b> | Some <b>bold</b> text. | Some bold text. |
Italic | <i> | Some <i>italicized</i> text. | Some italicized text. |
Underline | <u> | Some <u>underlined</u> text. | Some underlined text. |
TTY | <tt> | Some <tt>monospaced (tty)</tt> text. | Some monospaced (tty) text. |
Nesting of attributes is allowed, though the results will vary from browser to browser. Some can, for example, show bold italic text, while others will only show the innermost attribute. (That is, <b><i>bold italic</i></b> may show as bold italic.) If you do use nested attributes, be sure to place the end tags in reverse order of the start tags; that is, don’t ever write something like <b><i>bold italic</b></i>! While this will work on some Web browsers, it may cause problems with others.
As of the time I’m writing this, Windows Mosaic supports the bold and italic attributes, but not the underline or tty attributes.
List of logical attributes |
|||
---|---|---|---|
Attribute |
Tag |
Use and/or interpretation |
Typical rendering |
Citation | <cite> | Titles of books and films | Italic |
Code | <code> | Source code fragments | Monospaced |
Definition | <dfn> | A word being defined | Italic |
Emphasis | <em> | Emphasize a word or phrase | Italic |
Keyboard | <kbd> | Something the user should type, word-for-word | Bold monospaced |
Sample | <samp> | Computer status messages | Monospaced |
Strong | <strong> | Strong emphasis | Bold |
Variable | <var> | A description of something the user should type, like <filename> | Italic |
An important point to bear in mind here is that even if current browsers arbitrarily decide that <em> will show as italic and <kbd> as Courier, future browsers will probably defer more to their user’s wishes. That is, you shouldn’t conclude that citations, definitions, and variables all look alike so you should just ignore them and use italic: Future browsers will probably let their users select fonts for each logical attribute, just as they now can for each paragraph style.
As of the time I’m writing this, Windows Mosaic supports only the <em> and <strong> attributes.
Since they are paragraph types, you mark a phrase as a header by putting it within a tag pair. You do not include the section’s text within the header tag pair: Header tags are not outlining commands.
Although many documents include the title (or an extended version of it) at the top of the document body as a first level heading, how you use headings is entirely up to you. There is no requirement that you use a <h1> before you use a <h2>, or that a <h4> follow only a <h3> or another <h4>. (There is, however, at least a chance that skipping header levels will confuse non-commercial format conversion software, such as might convert HTML to Postscript. It’s up to you to decide whether to worry about this possibility.)
Please remember, though, that you can assume nothing about a header’s actual appearance, not even that any header will be larger than plaintext.
Lists
It’s a bit of an understatement to say that HTML has extensive support for lists. In fact, HTML has five
different list types, which certainly meshes with all my stereotypes about Switzerland, where the WWW
began.
All five list elements can be thought of as a sort of paragraph type, as the list is enclosed within a list tag pair. The first four list types share a common syntax, and differ only in how they format their list elements; the "description" list is unique in that each list element has two parts, a tag and a description of the tag.
All five list types display some sort of element tag - whether a number, a bullet, or a few words - on the left margin, and then the actual list elements appear indented. List elements do not have to fit on a single line or consist of a single paragraph: they may contain <p> and <br> tags.
Lists can be nested, but the appearance of a nested list depends on the browser. For example, some browsers use different bullets for inner lists than for outer lists, and some browsers do not indent nested lists. However, Mosaic and Lynx, which are probably the most common graphical and text mode browsers, do indent nested lists: the tags of a nested list line up with the elements of the outer list, and the elements of the nested list are further indented. For example,
* This is the first element of a bulleted list. * This is the first element of a nested list * This is the second element of the nested list * This is the third element of the main list.In the four list types with ‘simple’ list elements, the list item tag, <li>, is used to mark the start of each list element. Like the line break tag, <br>, there is no corresponding </li> tag; unlike the <br> tag, the < li> tag always appears at the start of a list element, not at the end.
Thus, all simple lists look something like
<ListType> <li> There isn’t really any ListType list, however the OL, UL, DIR, and MENU lists all follow this format. <li> Since whitespace is ignored, you can keep your source legible by putting blank lines between your list elements. Sometimes, I like to put the <li> tags on their own lines, too. <li> (If I hadn’t used the ampersand quotes in the previous list element, the "<li>" would have been interpreted as the start of a new list element.) </ListType>
All sarcasm aside, the tag for an unordered (bulleted) list is <ul>. While bulleted lists can be nested, you should keep in mind that the list nesting may not be visible: Some browsers indent nested lists; some don’t. Some use multiple bullet types; others don’t.
In practice, I’m not sure if I’ve ever seen these lists in use, and their implementation is still spotty: Current versions of Mosaic do not create multiple columns for a <dir> list) and while they let you choose a directory list font and a menu list font, they do not actually use these fonts.
The description list looks a lot like any other list, except that instead of a bullet or a number, the list tag consists of your text. Description lists are intended to be used for something like a glossary, where a short tag is followed by an indented definition, but the format is fairly flexible. For example, a long tag will wrap, just like any other paragraph, although it should not contain line or paragraph breaks. (Mosaic will indent any <dt> text after a line or paragraph, as if it were the <dd> text.) Further, you needn’t actually supply any tag text: <dt>< dd> will produce an indented paragraph.
If you’d like a tighter look, you can ask for a <dl compact>. If the tags are very short, some browsers will start the descriptions on the same line as the tags:
- Tag 1
- Description 1
- Tag 2
- Description 2.
However, most browsers do not support the compact attribute, and will simply ignore it: for example, with current versions of Windows Mosaic, a <dl compact> will always look like a <dl>, even if the tags are very short.
Tag 1 Description 1 Tag 2 Description 2
Every <img> tag must have a src= parameter. This specifies a URL, or uniform resource locator, which points to a .gif or .xbm bitmap file. See the section All About URL’s, below, for more information; for now, I will simply note that when the bitmap file is in the same directory as the HTML document, that the filename is an adequate URL. For example, <img src=MySmilingFace.gif> would insert a picture of my smiling face.
Of course, some people turn off inline images because they have a slow connection to the Web. When they do so, all your images, no matter what size, will be replaced with a standard graphic. This isn’t so bad if the picture is essentially ancillary to the text, but if you’ve used small inline images as "bullets" in a list or as section dividers, the substitution of the placeholder graphic will usually make your page look rather strange. Some people avoid using graphics as structural elements for this reason; others simply don’t worry about people with slow connections; still others include a note at the top of the page saying that all the images on the page are small, and inviting people with inline images off to turn them on and reload the page.
It’s also important to remember that some people use text-only browsers, like Lynx, to navigate the WWW. If you include a short description of your image with the alt= parameter, text-only browsers can show something in place of your graphic. For example, <img src=MySmilingFace.gif alt="A picture of the author">.
Since the alt parameter has spaces in it, we have to put it within quotes. In general, you can put any parameter value in quotes, but only have to do so if it includes spaces. If your parameter value includes a < or a >, you’ll have to use the ‘escape’ mechanism described below, in the Special Characters section.
Summary of <img> parameters |
||
---|---|---|
Parameter |
Required? |
Parameters |
SRC | Yes | URL |
ALT | No | A text string |
ALIGN | No | TOP, MIDDLE, or BOTTOM |
ISMAP | No | None |
Many people use small inline images for decoration and separation, instead of rules. While using images this way lets you customize your pages’ appearance, it also makes them take longer to transfer - and it makes them look horrible with inline images off.
Hypermedia links
Text and decorations make up a single page. The ability to add links to other Web pages or to entirely different sorts
of documents is what makes the Web a hypermedia system. The special sort of highlight that your reader
clicks on to traverse a hypermedia link is called an anchor, and all links are created with the anchor tag, <
a>.
Each link has two parts: The visible part, or anchor, which the user clicks on, and the invisible part, which tells the browser where to go. The anchor is the text in between the <a> and </a> tags of the A tag pair, while the actual link data appears in the <a> tag.
Just as the <img> tag had a src= parameter which specified an image file, so does the <a> tag have an href= parameter which specifies the hypermedia reference. Thus, "<a href=SomeFile.Type> click here</a>" is a link to "SomeFile.Type", with the visible anchor "click here".
Browsers will generally use the linked document’s filename extension to decide how to display the linked document. For example, .htm files will be interpreted and displayed as HTML, whether they come from a http server, a ftp server, or a gopher site. Conversely, a link can be to any sort of file, not just HTML files: You use the same <a href=FileName.Type> tag to plant a link to another Web page as to a large bitmap, a sound file, or a movie .
There is also the implementation issue that (current versions of) Windows Mosaic do not "realize a palette" before displaying an image. Presumably this isn’t a problem on a 24-bit display, but most people have 256-color displays, and Windows maps the colors in inline images to the closest colors in the current palette. This can make for some funny looking images! If, however, the image is linked to itself, the reader can simply click on the image to load it into her favorite viewer, which will probably handle the colors much better.
Many anchors in an image
The <img> tag’s optional IsMap parameter allows you to turn rectangular regions of a bitmap image into
clickable anchors. Clicking on these parts of the image will activate an appropriate URL, and there is also a default
URL for when the user clicks on an area outside of any specially declared rectangle. While forms let you do this a bit
more flexibly, the IsMap approach a) doesn't require any custom programming, just a simple text file that defines the
rectangles and their URL's, and b) may work even on browsers that do not support forms. See
http://wintermute.ncsc.uiuc.edu:8080/map-tutorial/image-maps.htm to learn
how to do this.
Doing this takes two anchor tags: One that defines a name for a location, and one that points to that name. These two tags can be in the same document or in different documents, but you always need to have the two tags: You can’t say something like ‘this link is to Act III, Scene 1 of Hamlet’.
Names do not have to be defined before they are used: It’s actually fairly common for lengthy documents to have a table of contents with links to names defined later in the document. It’s also worth noting that while tag and parameter names are not case sensitive, anchor names are case sensitive: <a href=#anchorname> will not take you to the AnchorName example.
Summary of <a> syntax |
|
---|---|
To |
Use |
Link to another document | <a href="URL">highlighted anchor text</a> |
Name an anchor | <a name="Anchor Name">normal text</a> |
Link to a named anchor in this document | <a href="#Anchor Name">highlighted anchor text</a> |
Link to a named anchor in another document | <a href="URL#Anchor Name">highlighted anchor text</a> |
Just as a complete DOS file name starts with a drive letter followed by a colon, so a full URL starts with a resource type - HTTP, FTP, GOPHER, &c - followed by a colon. If the name doesn’t have a colon in it, it’s assumed to be a local reference, which is a filename on the same file system as the current document. Thus, <a href=Another.htm> refers to the file "Another.htm" in the same directory as the current file, while <a href=/html/File.htm> refers to the file "File.htm" in the top-level directory "html". One thing to note here is that a URL always uses "/", the Unix-style forward slash, as a directory separator even when the files are on a Windows machine which normally uses "\", the DOS-style backwards slash.
Local URL’s can be very convenient when you have several HTML files with links to each other, or when you have a large number of inline images. If you ever have to move them all to another directory, or to another machine, you don’t have to change all the URL’s.
The base statement is like the <img> statement in that it’s a so-called empty tag, without a concluding < /base> tag that encloses some text. It requires a href parameter - e.g. <base href=http://www.imaginary.org/index.htm> - which should contain the URL of the document itself. When a browser that supports the base statement encounters a URL that doesn’t contain a protocol and path, it will look for it relative to the base URL, instead of relative to wherever it actually loaded the document from.
Broadly speaking, the resource-type://machine-name/resource-name URL form is used with centralized resources, where there’s a single server that supplies the document to the rest of the net, using a particular protocol. Thus, "http://www.another.org/Complex.htm" means ‘use the Hypertext Transfer Protocol to get the file Complex.htm from the main WWW directory on the machine www.another.org’, while "ftp://foo.bar.net/pub/www/editors/README" means ‘use the File Transfer Protocol to get the file /pub/www/editors/README from the machine foo.bar.net’.
Conversely, many resource types are distributed. We don’t all get our news or mail from the same central server, but from the nearest one of many news and mail servers. URL’s for distributed resources use the simpler form resource- type:resource-name. For example, "news:comp.infosystems.www.providers" refers to the Usenet newsgroup comp.infosystems.www.providers, which is a good place to look for further information about writing HTML.
A Partial Table Of URL Resource Type | ||
---|---|---|
Resource |
Interpretation |
Format |
HTTP | Hypertext Transfer Protocol | http://machine-name/file-name |
FTP | File Transfer Protocol | ftp://machine-name/file-name |
GOPHER | Gopher | gopher://machine-name/file-name |
NEWS | Internet News | news:group-name |
TELNET | Logon to a remote system | telnet://machine-name |
MAILTO | Normal Internet email | mailto:user-name@machine-name |
For example, if you’re putting together your own pages of Neat Things To Click On, you can (with some versions of Mosaic) copy the current URL out of a text box in the header pane, next to the NCSA logo. Similarly, the "menu editor" has a way to copy the current URL to an editable field, where you can copy the URL to the Clipboard and then paste it to some other application.
In the character-mode world, if you type "=" at any Gopher menu, you will get some technical information about that menu which includes an URL that you can put into your HTML files.
Naturally, this causes a problem if you want to add HTML markup to an existing document to make it easier on the eyes or to make it easier to jump to specific sections! If the document is relatively static - like Alice In Wonderland or the US Constitution - there’s probably no alternative to making a copy and marking it up. However, with something that changes often, like a FAQ [Frequently Asked Questions] file, probably the best thing to is to try to come up with some automatic way to, say, build a clickable table of contents.
While security concerns ensure that anonymous ftp users have no access to the regular file system, so that ftp has to have its own, self-contained file system without any links to the regular file system, you can have links from the regular file system. That is, you can’t establish a link from a ftp directory to a .public_html directory, but you can create a link from a html directory to a ftp directory.
For example, in my html directory, I did ln -s ~ftp/pub/user/jon ftp
, which created a ftp ‘directory’ in
my html directory. I can thus create URL’s that look like "ftp/README" instead of
"ftp://deeptht.armory.com/pub/user/jon/README". The shorter URL is not only easier to read, but is also easier for
the host to process.
Special characters
Since < and > have special meanings in SGML, there has to be an escape mechanism that lets you include them
in your text without causing syntax errors. Similarly, since URL’s with embedded spaces need to be quoted, there
has to be an escape mechanism to include a quote character in a URL without closing the quotes. Finally, while the
default character set for the Web is ISO Latin-1, which includes European language characters like é and ß in the
range from 128 to 255, it’s not uncommon to pass around snippets of HTML in 7-bit email, or to edit them on dumb
terminals, so the escape mechanism also has to include a way to specify high-bit characters using only 7-bit
characters.
The symbolic form is much easier to read, but its use is restricted to the four low-bit characters with special meaning in SGML and to the European-language characters. To use the other symbols in the ISO Latin-1 character set, like ® and the various currency symbols, you have to use the numeric form. The symbolic escape is like the numeric escape, except there’s no #. For example, to insert é, you would use é, or &, the character name, and a closing semicolon. You should be aware that symbol names are case sensitive: & Eacute; is É, not é, while &EAcute; is no character at all, and will show up as &EAcute;!
There is a complete table of all characters with HTML names, as well as all the other high-bit characters in the ISO Latin-1 character set, in Appendix B, HTML Special Characters.
Since HTTP, the WWW’s document transfer protocol, is always 8-bit, you only need to use the escape mechanism for the high-bit characters if you plan to quote your document in email, or if you are editing it on a ASCII terminal. If you can use the high-bit characters directly, you will only have to use the & escape mechanism on the relatively few occasions when you have to quote <, >, ", or &.
Preformatted and
other special paragraph types
HTML has three special "block" formats. Any plaintext within them is supposed to appear in a distinctive font. While
they can be used simply as paragraph styles, they can also enclose headers and lists. It’s worth noting, though, that
current browsers don’t really do a very good job of living up to the spec, here, and using these block formats for
anything but paragraph styles will probably not do what it should.
Preformatted is a block format, and any HTML markup will be processed, so you can have anchors as well as bold or italic monospaced text within your preformatted block. Headers and lists in preformatted blocks tend to confuse current browsers, though.
The initial <pre> tag can have an optional width= parameter. Browsers will not trim lines to this length; the intent is to allow the browser to select a monospaced font that will allow the maximum line length to fit in the browser window.
Many people put a horizontal rule, <hr>, between the body of the document and the address block. If you include a link to your home page or to a page that lets the reader send mail to you, you don’t ‘have’ to include a lot of information on each individual page.
Forms
Everything we have seen so far corresponds to traditional publishing: You create a hypermedia document, and others
read it. You are a producer, your readers are consumers. Your readers can only see what you have put out for
them.
With HTML forms, though, that begins to change. You can create a form that lets your readers search a database using any criteria they like. Or you can create a form that lets them critique your Web pages, or your new software. Or - and this is what excites business people - you can use forms to sell things over the net.
It’s pretty easy to create forms. However, to actually use them you’ll need a program that runs on your Web server to process the information that the user’s client sends back to you. For simple things like a "comments page", you can probably use an existing program; for anything more complex, you’ll probably need a custom program. While I will briefly describe the way forms data looks to the receiving program, any discussion of forms programming is quite beyond the scope of this book.
As of the summer of 1994, forms are still a relatively new feature that do not work with Mac Mosaic, only with Windows and X-Windows Mosaic, but this condition will probably not last long.
Within your form, you can use any other HTML elements, including headers, images, rules, and lists. This gives you a fair amount of control over your forms’ appearance, but you should always remember that the users screen size and font choices will strongly affect the actual appearance of your form.
While you can have more than one form on a page, you cannot nest one form within another.
Form action and method attributes
Nothing gets sent to your Web server until the user presses the Submit button or clicks on an image map. What
happens then depends on the action, method, and EncType parameters of the <form> tag.
The action parameter gives a URL to which the form’s content will be sent. This is most commonly in the cgi-bin directory of a Web server. If you do not specify an action parameter, the contents will be sent to the current document’s URL.
The method parameter tells how to send the forms contents. There are two possibilities, here: get and post. If you do not specify a method, get will be used. Get and post both format the form’s data identically; they differ only in how they pass the forms data to the program that uses that data.
Get and post both send the forms contents as a single long text vector consisting of a list of WidgetName=WidgetValue pairs, each separated from its successor by an ampersand. For example, "Name=Jon Shemitz&Address=jon@midnightbeach.com". (Any & or = sign in a widget name or value will be quoted using the standard ampersand escape; any ‘bare &’ and any = sign can therefore be taken as a separator.) You will not necessarily get a name and a value for every widget in the form: While empty text is explicitly sent as a WidgetName= with an empty value, unselected radio buttons and check boxes don’t send even their name.
Where get and post differ is that the get method creates a "query URL" which consists of the action URL, a question mark, and the formatted form data, whereas the post method sends the formatted form data to the action URL in a special data block. The Web server parses the query URL that a get method creates and passes the form data to the form processing program as a command line parameter. This creates a limitation on form data length that the post method does not.
Currently, all forms data is sent in plaintext. This creates a security problem, which will be discussed below. The optional EncType parameter offers a possible solution: Though currently this only allows you to ratify the plaintext default, in the future there will be probably be values that call for an encrypted transmission. For example, in August of 1994, a Massachusetts startup demonstrated forms transmission using PGP (Pretty Good Privacy) data encryption: Presumably, this depends on a custom Mosaic client that understands a new EncType parameter, but by the time you read this there may be public standards for encrypting forms data.
Syntax of the TEXT and PASSWORD input types: |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Type | No | Type="Text" or Type="Password" | Determines what type of widget this will be. Default is "text". |
Name | Yes | Name="WidgetName" | Identifies the widget. |
Value | No | Value="Default text" | Lets you supply a default value. Cannot contain html commands. |
Size | No | Size=Cols | Width (in characters) of a single line text area. Default is 20. |
Size | No | Size=Cols,Rows | Height and width (in characters) of a multi-line text area. |
MaxLength | No | MaxLength=Chars | Longest value a single line text area can return. Default is unlimited. |
A group of radio buttons are associated by having identical names. Only one (or none) of the group can be checked at any one time; clicking a radio button will turn off whichever button in the name group was already on.
Check boxes and radio buttons return their value if and only if they are checked. An unchecked widget is completely silent.
Syntax of the CHECKBOX and RADIO types |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Type | Yes | Type="CheckBox" or Type="Radio" | Determines what type of widget this will be. Default is "text". |
Name | Yes | Name="WidgetName" | A unique identifier for a checkbox; a group identifier for radio buttons. |
Value | Yes | Value="WidgetValue" | The value is sent iff the widget is checked. |
Checked | No | Checked | If this attribute is present, the widget starts out checked. |
Clicking on an image map, like clicking on a Submit button, will send all form data to the Web server.
Syntax of the IMAGE type |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Type | Yes | Type="Image" | Determines what type of widget this will be. Default is "text". |
Name | Yes | Name="WidgetName" | Identifies the widget |
Src | Yes | Src="URL" | The URL of a bitmapped image to display. |
Syntax of the SUBMIT and RESET types |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Type | Yes | Type="Submit" or Type="Reset" | Determines what type of widget this will be. Default is "text". |
Name | No | Name="WidgetName" | The buttons never return their values, so a name will never be used. |
Value | No | Value="WidgetValue" | The button text. Default is Submit or Reset, respectively. |
Syntax of the HIDDEN type |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Type | Yes | Type="Hidden" | Determines what type of widget this will be. Default is "text". |
Name | Yes | Name="WidgetName" | Identifies the widget. |
Value | Yes | Value="WidgetValue" | Whatever constant data you might want to include with the form. |
Syntax of the <TextArea> tag |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Name | Yes | Name="WidgetName" | Identifies the widget. |
Rows | No | Rows=Rows | TextArea height, in characters. |
Cols | No | Cols=Cols | TextArea width, in characters. Default is 20. |
Just as you can use check boxes for 0 to N selections, or radio buttons for 0 or 1 selections, you can specify the cardinality of selection behavior. Normally, select widgets act like a set of radio buttons: your users can only select zero or one of the options. However, if you specify the MULTIPLE option, the select widget will act like a set of check boxes: your users may select any or all of the options.
Syntax of the <select> tag |
|||
---|---|---|---|
Attribute |
Required? |
Format |
Meaning |
Name | Yes | Name="WidgetName" | Identifies the widget. |
Size | No | Size=Rows | This is the widget height, in character rows. If the size is 1, you get a pull-down list. If the size is greater than 1, you get a scrolling list. Default is 1. |
Multiple | No | Multiple | Allows more than one option to be selected. |
Within the SELECT tag pair are a series of <option> statements, followed by the option text. These are similar to <li> list items, except that <option> text may not include any HTML markup. The option tag may include an optional selected attribute; more than one option may be selected if and only if the select tag includes the multiple option.
For example,
Which Web browsers do you use?
<select name="Web Browsers" multiple>
<option>Mosaic
<option>Lynx
<option>WinWeb
<option>Cello
</select>
The problem is that anyone who loads your form can read the HTML source to see where the forms data goes. If that data includes any tempting information like a credit card number, a thief may be tempted to watch traffic to your server for credit card numbers to steal. Since it can be relatively easy to intercept TCP/IP packets, this is a problem that you shouldn’t ignore!
Basically, if you want to do on-line sales, DON’T use a plaintext form to ask for a credit card number. Instead, use a service that may let customers create accounts over the Web but will only accept credit card numbers and expiration dates via a voice phone call or through snail (physical letter) mail. When your customers want to place an order, they don’t run the risk of having their credit card number stolen; they would only have to supply a name and address to let the order taking system look up their credit card number.
On the other hand, if you upload your HTML files to a Unix system for distribution, you may find it hard to avoid using vi to make minor changes. vi is old-fashioned, to say the least, and not only does it have a maximum line length, it doesn’t word wrap or scroll horizontally: It wraps long lines to the next screen line, right in the middle of words. If you’re going to have to use vi, it will be best if your source is word-wrapped to fit on an 80-column screen.
If you have enough memory to run both Mosaic and your editor at the same time, just switch back to your editor, correct your mistakes, and switch back to Mosaic. Then either select Reload from the Navigate menu or just click on the reload button on the tool bar. This will show your corrections without having to reselect or retype the file in the Load Local File dialog box.
Naturally, this raises the question of whether you should even use the "unsupported" features. The answer depends on the expected lifetime and readership of your document. If you’re writing an announcement of a picnic next month for a group that all uses Windows Mosaic, it would be silly to use a compact description list or some other feature that will just be ignored. On the other hand, if you expect your document to last indefinitely and to be read by a wide variety of people, then it’s probably best to use currently unsupported features - where they’re appropriate! - but to be careful that the document is legible without them. For example, when I post snippets of code, I use the <code> or <tt> attributes, even though my browser ignores them. Someday, it won’t.
Avoid large images: Use a "thumbnail" image, linked to a full-size image, and let your reader decide if they really want to see the full picture.
Similarly, take advantage of hypertext. Don’t put up a long essay as a single, long and detailed document; write what amounts to an outline or overview, and let your readers click on the sections that seem interesting to them.
Sometimes this doesn’t really matter: If the inline image is just a line drawing, any colors it has are pretty arbitrary. On the other hand, most photographs look pretty weird with the wrong palette: you should link even small photos to themselves, so that your reader can look at your photos with the viewer of her choice.
Good luck - and have fun!
There’s no doubt that it’s easier to use some nice WYSIWYG desktop publishing software than it is to write HTML -
but, really, after just a little practice, HTML becomes easy to write. (After a little more practice, it even begins to
seem almost readable.) But electronic publishing is not just more environmentally benign than paper publishing and
physical distribution, it also offers a flexibility and immediacy that paper can’t touch. One way or another, electronic
publishing is the future - I’d rather see that future belong to something like the Worldwide Web of thousands or
millions of individuals with something to say than to the large corporations with their "500 channels" of recycled crap.
Wouldn’t you?
This document originally appeared in The Mosaic & Web Explorer
Copyright © 1994,1995 Jon Shemitz - jon@midnightbeach.com - Written August '94, HTML markup 12-Jun-95