What is XML?

XML (eXtensible Markup Language) is a simplified dialect of SGML designed to describe hierarchical data structures on the World Wide Web. It was developed by the W3C working group in 1996; the currently accepted recommendation is the second version of XML 1.0. XML is undoubtedly one of the most promising technologies of the WWW, which explains the interest paid to it by both corporations-developers and the general public.

Before proceeding to describe it, it seems appropriate to discuss the reasons for its emergence and subsequent rapid development. To do this, let’s try to look at the problems of the WWW that must be solved by means of the new generation of Web technologies.

XML is an attempt to solve these problems by creating a simple markup language that describes arbitrary structured data. To be more precise, it is a meta-language in which specialized languages are written that describe data of a certain structure.

Such languages are called XML vocabularies. Unlike HTML, XML does not contain any instructions on how the data described in an XML document should be displayed. The way data is displayed for different devices is specified by the XSL style sheet language, which plays a similar role for XML as CSS does for HTML.

Another fundamental difference between it and HTML is that XML can contain any tags that the creators of an XML vocabulary deem necessary.

Here is a list of just a few specialized XML-based languages that are currently in various stages of development by W3C working groups: MathML is a language for mathematical formulas; SMIL is a language for integrating and synchronizing multimedia; SVG is a language for two-dimensional vector graphics; RDF is a language for meta-descriptions of resources; XHTML is a reformulation of HTML in terms of XML.

The process of processing an XML document is as follows. Its text is analyzed by a special program called an XML processor. An XML processor does not know anything about the semantics of the data in the document; it only performs parsing of the document text and checks its correctness in terms of XML rules. If the document is well-formed, the XML processor passes the parsing results to the application program that performs meaningful processing; if the document is incorrectly formatted, i.e. contains syntax errors, the XML processor must inform the user about them. HTML does not express the content of documents.

The HTML language was created to describe the structure of documents (titles, headings, lists, paragraphs, etc.) and, to some extent, the rules for their display (bold, italic, etc.). It is in no way intended to describe the meaning of the documents written on it, and in many cases it is the data that makes up the body of the document, whether it is a stock exchange report or a scientific publication.

That’s why there was a need for a language to describe data, and data organized in hierarchical structures. HTML is cumbersome and inflexible. In recent years, HTML has turned into an accumulation of tags that often duplicate each other and do not make the text of the document clear.

If you add to this the non-standard HTML extensions that all browser developers are guilty of, then creating small, complex HTML documents becomes a serious task. On the other hand, a set of tags that has been fixed once and for all is often not flexible enough to express the content we need.

The concept of a Web browser is too limited. With the advent of Java applets, scripting languages, and ActiveX elements, Web browsers are no longer simple “visitors” to HTML documents; today they are more like programs that run specific applications.

Nevertheless, the concept of a browser imposes unnecessary restrictions on the user; in many cases, we need Web-oriented programs, i.e. programs that can read specialized information from Web sites and present it to us in a familiar form, such as spreadsheets.

Suppose I need all the texts of books available on the Web. Trying to search by author’s name will result in a list of all links with that name, including memoirs about Dovlatova, reviews of her books, etc. It would be much more convenient to use a special tag to indicate what I am looking for. It is impossible to find interrelated resources. Let’s assume now that I did find several stories by Dovlatov that clearly constitute a single collection. It’s good if they contain links to the table of contents, but often they don’t. Therefore, you need a way to indicate that this group of pages constitutes a single resource and should be handled accordingly.