XHTML

HTML als XML Anwendung

HTML (3.2, 4.0) ist eine SGML Anwendung
XML ist der Standart für erweiterbares Markup
HTML muss in XML reformuliert werden
Version 1.0, W3C Proposed Recommendation August 1999
HTML 4.01 enthält die notwendigen Anpassungen
im Januar 2000 als W3C Recommendation verabschiedet

XHTML10

Warum XHTML?

XHTML Dokumente sind XML konform.
Sie können mit XML Tools bearbeitet werden.
XHTML Dokumente können als text/html von HTML 4.0 Browsern verwendet werden.
XHTML Dokumente können aber auch als text/xml oder als application/xml (mit geeigneten Style Sheets) verwendet werden.
XHTML Dokumente können mit DOM bzw. XML-DOM verwendet werden, d.h. mit (Java)Scripts und Applets.
XHTML Dokumente verschiedener Autoren (Systeme, Umgebungen) werden besser zusammenpassen als HTML Dokumente.
Da XHTML eine XML Anwendung ist, können neue Markup-Elemente einfach hinzugefügt werden.
XHTML ist nicht mehr nur auf Browser beschränkt. Viele andere User-Agents (Handys, Sprachausgabe, etc.) werden damit umgehen können (best effort content transformation).

Bedingungen für XHTML konforme Dokumente

Sie müssen entsprechend einer XHTML DTD gültige (valid) XML Dokumente sein.
Das Root-Element muss <html> sein.
Das Root-Element muss einen gültigen XHTML Namensraum bestimmen, der ein gültiger XML Namensraum sein muss.
Es muss eine XML DOCTYPE Deklaration vor dem Root-Element verhanden sein.
Die Internet Medien Typen (Mime Types) dürfen text/html, text/xml oder application/xml sein.

Beispiel

<?xml version="1.0"?>
<!DOCTYPE html 
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/strict.dtd">
<html xmlns="http://www.w3.org/TR/xhtml1">
<head>
<title>Browser Titel</title>
</head>
<body>
<h1>Dokument Titel</h1>
<p>
Ein Paragraph <br />
auf zwei Zeilen.
</p>
<math xmlns="http://www.w3.org/TR/REC-MathML">
   ... Text in MathML ...
</math>
</body>
</html>

Verhalten von XML User Agents (Browser)

XML-UAs müssen well-formedness feststellen.
Validierende UAs müssen das Dokument gegen alle DTDs validieren.
Bei unbekannten Elementen muss der Inhalt dargestellt werden (wie bei HTML, nicht bei XML).
Unbekannte Attribute müssen ignoriert werden (wie bei HTML, nicht bei XML).
Bei unbekannten Attribut-Werten muss der Defaultwert verwendet werden (wie bei HTML, nicht bei XML).
Unbekannte Entities müssen als Zeichenkette (&xyz;) dargestellt werden (wie bei HTML, nicht bei XML).
Unbekannte Zeichen (Characters) müssen so dargestellt werden, dass klar ist das sie nicht bekannt, aber erkennbar sind (nicht bei HTML, nicht bei XML).
Whitespace direkt nach einem Start-Tag und unmittelbar vor einem End-Tag muss ignoriert werden (falls nicht per XML etwas anderes bestimmt wurde).

Unterschiede zu HTML 4.0

XHTML Dokumente müssen well-formed sein, d.h. gültige Schachtelungsstruktur haben.

<p>Paragraph <em>Hervorhebung</em></p>

statt
<p>Paragraph <em>Hervorhebung</p></em>

Element- und Attribut-Namen müssen in Kleinbuchstaben geschrieben sein.
```
<li> statt <LI>
```
End-Tags müssen immer vorhanden sein (falls nicht per XML das Element als EMPTY deklariert wurde).
```
Paragraph weiterer Paragraph 
statt
Paragraph weiterer Paragraph 
```
Bei leeren Elementen ohne End-Tag muss das Start-Tag mit "/>" beendet werden.
```
 
```
Attributwerte müssen in Anführungszeichen eingeschlossen werden. Auch bei Zahlenwerten.
```
<img ... width="300" />
statt
<img ... width=300 />
```

Attributwerte müssen immer angegeben werden.

<dl compact="compact" >
statt
<dl compact >

In Attributwerten wird Whitespace auf jeweils ein Blank verkürzt, bzw. am Beginn und Ende von Zeichenketten abgeschnitten.
```
alt="   Beschreibung     eines   Bildes     "

wird zu
alt="Beschreibung eines Bildes"
```
Script-Texte müssen als CDATA markiert werden, falls sie < oder & enthalten.
```
<script>
 <![CDATA[ 
 ... Inhalt des Scripts
 ]]>
</script>
```
SGML Ausschluss-Definitionen sind nur informell festgelegt.
z.B. das a-Element darf kein weiteres a-Element enthalten.
Das name Attribut von HTML muss als XML id Attribut angegeben werden.
```
<a name="section1" id="section1" ... >
```

Diese Datei in XHTML und als XML.

Tips und Hinweise

Processing Instructions und Zeichensätze werden nicht von allen UAs erkannt
<?... >, UTF-8, UTF-16.
Benutze Blanks vor />
Benutze   statt  
Benutze  statt .
Benutze externe Scripte, falls "<", "&" oder "]]>" vorkommen.
Verwende keine Zeilenumbrüche und mehrfache Leerzeichen in Attribut werten.
Benutze lang und xml:lang gleichzeitig als Attribute.
Benutze name="xyz" und id="xyz" gleichzeitig als Attribute für die Bezeichnung von Elementen.
Benutze <?xml ... encoding="iso-8859-1"> und <meta http-equiv="Content-type" ... charset="iso-8859-1"> gleichzeitig für die Bezeichnung von Zeichensätzen.
Einige UAs haben Probleme mit Booleschen Attributen.
Problem bei DOMs:
HTML 4.0 DOM benutzt Grossbuchstaben,
XHTML 1.0 DOM benutzt Kleinbuchstaben,
XML 1.0 DOM benutzt Gross-/Klein-Buchstaben.
Problem mit "&" in Attributwerten, z.B.
href=".../script.pl?n1=w1&n2=w2"
Probleme mit Style Sheets (CSS):
Gross-/Klein-Schreibung von Elementen.

Ausblick

Modularisierung von XHTML, d.h. zuschneiden auf bestimmte UAs
Formalisieren der Bildung von Teilmengen und Erweiterungen.
Dokument Profile
Stand im Mai 2003
- XHTML Basic, 2000 Dezember
- XHTML 1.1 - Module based XHTML, 2001 Mai
- XHTML 2.0, Working Draft, 2003 Januar

HTML Tidy

Tool zur Fehlersuche in HTML
bietet auch Fehlerkorrektur
kann Müll von HTML-Editoren entfernen
kann HTML nach XHTML konvertieren
offizielles Tool des W3C
erkennt, prüft und korrigiert Dokumenttyp
für fast alle Plattformen verfügbar
Unterstützung von XML, ASP, PHP

Tidy-Logo

Beispiele für die Arbeitsweise von HTML Tidy

Beispiel für schlechtes HTML bad.html und das Ergebnis nach Bearbeitung mit HTML Tidy good.html.

Fehlermeldungen aus dem Beispiel

> tidy  exam/bad.html >exam/good.html

Tidy (vers 19th October 1999) Parsing "exam/bad.html"
line 3 column 1 - Warning: inserting missing 'title' element
line 5 column 2 - Warning: replacing unexpected <h2> by </h1>
line 5 column 37 - Warning: discarding unexpected </h3>
line 7 column 42 - Warning: replacing unexpected </i> by </b>
line 8 column 15 - Warning: replacing unexpected </b> by </i>
line 10 column 45 - Warning: missing </i> before </h2>
line 12 column 4 - Warning: inserting implicit <i>
line 14 column 2 - Warning: missing </i> before <p>
line 14 column 4 - Warning: inserting implicit <i>
line 14 column 43 - Warning: discarding unexpected <a>
line 16 column 2 - Warning: missing </a> before <li>
line 16 column 2 - Warning: missing </i> before <li>
line 16 column 2 - Warning: inserting implicit <ul>
line 24 column 1 - Warning: unknown attribute "tidy"
line 30 column 1 - Warning: <img> lacks "alt" attribute

"exam/bad.html" appears to be HTML proprietary
15 warnings/errors were found!

The alt attribute should be used to give a short description
of an image; longer descriptions should be given with the
longdesc attribute which takes a URL linked to the description.
These measures are needed for people using non-graphical browsers.

For further advice on how to make your pages accessible
see "http://www.w3.org/WAI/GL". You may also want to try
"http://www.cast.org/bobby/" which is a free Web-based
service for checking URLs for accessibility.

HTML & CSS specifications are available from http://www.w3.org/
To learn more about Tidy see http://www.w3.org/People/Raggett/tidy/
Please send bug reports to Dave Raggett care of <html-tidy@w3.org>
Lobby your company to join W3C, see http://www.w3.org/Consortium

Aufruf und Verwendung von HTML Tidy

> tidy  [[options] files]*

tidy: file1 file2 ...
Utility to clean up & pretty print html files
see http://www.w3.org/People/Raggett/tidy/

options for tidy released on 19th October 1999
  -config <file>  set options from config file
  -indent or -i   indent element content
  -omit   or -o   omit optional endtags
  -wrap 72        wrap text at column 72 (default is 68)
  -upper  or -u   force tags to upper case (default is lower)
  -clean  or -c   replace font, nobr & center tags by CSS
  -raw            leave chars > 128 unchanged upon output
  -ascii          use ASCII for output, Latin-1 for input
  -latin1         use Latin-1 for both input and output
  -iso2022        use ISO2022 for both input and output
  -utf8           use UTF-8 for both input and output
  -mac            use the Apple MacRoman character set
  -numeric or -n  output numeric rather than named entities
  -modify or -m   to modify original files
  -errors or -e   only show errors
  -quiet or -q    suppress nonessential output
  -f <file>       write errors to <file>
  -xml            use this when input is wellformed xml
  -asxml          to convert html to wellformed xml
  -slides         to burst into slides on h2 elements
  -help   or -h   list command line options
Input/Output default to stdin/stdout respectively
Single letter options apart from -f may be combined
as in:  tidy -f errs.txt -imu foo.html
For further info on HTML see http://www.w3.org/MarkUp

Einige wichtige Optionen

markup: yes, no: Erzeugen des verbesserten Markups.
wrap: number: Zeilenumbruch bei angegebener Spalte. 0 = abgeschaltet.
input-xml: yes, no: Einlesen als XML.
output-xml: yes, no: Ausgabe von XML.
output-xhtml: yes, no: Ausgabe von XHTML.
doctype: omit, auto, strict, loose or <fpi>: Festlegen des DOCTYPE in der Ausgabe.
char-encoding: raw, ascii, latin1, utf8 or iso2022: Festlegen des Zeichensatzes in der Ausgabe.
fix-backslash: yes, no: Wandelt "\" in URLs zu "/".
word-2000: yes, no: Versucht Müll, der von Word 2000 produziert wird zu entfernen.
clean: yes, no: Versucht überflüssigen Präsentations-Markup durch Stilregeln (CSS) oder Struktur-Markup zu ersetzen.
logical-emphasis: yes, no: Ersetzt i durch em, b durch strong, impliziert clean.
enclose-text: yes, no: Fasst Text auf Body-Level in Paragraphen. Wichtig für funktionierende Stilvorlagen.
split: yes, no: Teilt die Datei an h2 Elementen in einzelne "Folien".
new-empty-tags: tag1, tag2, tag3
new-inline-tags: tag1, tag2, tag3
new-blocklevel-tags: tag1, tag2, tag3
new-pre-tags: tag1, tag2, tag3: Definition von neuen Tags der entsprechenden Art.

Beispiel für ein Config-File

/* HTML Tidy configuration file */
markup: yes
wrap: 0
doctype: strict
break-before-br: yes
logical-emphasis: yes
enclose-text: yes
/* eof */

Was in Arbeit ist

Validierung aller Attribute
Verbesserter XML Support
Verbesserung der Zeichensatz Unterstüzung
Verbesserung der ASP und PHP Unterstüzung
Verbesserte Folien Erzeugung

XHTML Basic 1.0

Reduktion von XHTML 1.0 auf die ELemente und Attribute, die auch auf kleinen Geräten angezeigt werden können. Zum Beispiel

Handys
Fernseher
PDAs
Verkaufsautomaten
Pager
Fahrzeug Navigationssysteme
Spielekonsolen
Lesegeräte für digitale Bücher
Uhren mit CPUs

Die gemeinsame Fähigkeiten dieser einfachen UAs ermöglichen folgende XHTML Elemente.

einfacher Text (mit Überschriften, Paragraphen und Listen)
Hyperlinks (a und link)
einfache Formulare
einfache Tabellen
Bilder
Meta-Information

Elemente und Attribute, die nur auf aktuellen grafischen (PC-) Systemen funktionieren sind weggelassen.

keine Stilvorlagen (CSS)
keine Scripte (JavaScript) und zugehörige Attribute
nur einfache Fonts (Schreibmaschinenschriften)
kein File-Upload und Bilder in Formularen
keine geschachtelten Tabellen
keine Frames

Der Document Type ist

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">

Definierte Module

Structure Module*: body, head, html, title
Text Module*: abbr, acronym, address, blockquote, br, cite, code, dfn, div, em, h1, h2, h3, h4, h5, h6, kbd, p, pre, q, samp, span, strong, var
Hypertext Module*: a
List Module*: dl, dt, dd, ol, ul, li
Basic Forms Module: form, input, label, select, option, textarea
Basic Tables Module: caption, table, td, th, tr
Image Module: img
Object Module: object, param
Metainformation Module: meta
Link Module: link
Base Module: base

(*) = diese Module müssen bei XHTML Basic 1.0 mindestens unterstützt werden.

XHTML Basic 1.0 konformes Beispiel

XHTML 1.1 - Modul basiertes XHTML

Neuordnung von striktem XHTML 1.0 mit Hilfe von Modulen. Der Dokumenttyp ist

<!DOCTYPE
 html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

Definierte Module

Structure Module*: body, head, html, title
Text Module*: abbr, acronym, address, blockquote, br, cite, code, dfn, div, em, h1, h2, h3, h4, h5, h6, kbd, p, pre, q, samp, span, strong, var
Hypertext Module*: a
List Module*: dl, dt, dd, ol, ul, li
Object Module: object, param
Presentation Module: b, big, hr, i, small, sub, sup, tt
Edit Module: del, ins
Bidirectional Text Module: bdo
Forms Module: button, fieldset, form, input, label, legend, select, optgroup, option, textarea
Table Module: caption, col, colgroup, table, tbody, td, tfoot, th, thead, tr
Image Module: img
Client-side Image Map Module: area, map
Server-side Image Map Module: Attribut ismap von img
Intrinsic Events Module: Event Attribute
Metainformation Module: meta
Scripting Module: noscript, script
Stylesheet Module: style element
Style Attribute Module Deprecated: style attribute
Link Module: link
Base Module: base
Ruby Annotation Module: ruby, rbc, rtc, rb, rt, rp

Weitere Änderungen gegenüber XHTML 1.0 Strict sind: lang wird ersetzt durch xml:lang und name wird ersetzt durch id.

XHTML 1.1 konformes Beispiel

Extensible Hypertext Markup Language (XHTML)

HTML als XML Anwendung

Warum XHTML?

Bedingungen für XHTML konforme Dokumente

Beispiel

Verhalten von XML User Agents (Browser)

Unterschiede zu HTML 4.0

Tips und Hinweise

Ausblick

HTML Tidy

Beispiele für die Arbeitsweise von HTML Tidy

Fehlermeldungen aus dem Beispiel

Aufruf und Verwendung von HTML Tidy

Einige wichtige Optionen

Beispiel für ein Config-File

Was in Arbeit ist

XHTML Basic 1.0

Definierte Module

XHTML 1.1 - Modul basiertes XHTML

Definierte Module