Handling Special Characters and Encodings

One of PHPs most glaring weaknesses is poor support for various character sets. As long as a string is ASCII text, no problems occur. However, there is no simple and transparent way, using the standard PHP string functions, to properly handle multi-byte encodings such as the BIG5 encoding used for the Chinese language.

This is a problem especially for XML parsers like DOMIT!, since according to the XML specification they are required to be unicode compliant.

One way of resolving this is using the mb_string extension. This will in the near future be integrated with DOMIT!, but the one disadvantage of such an approach is that mb_string must be installed separately from the base PHP distribution, which may not be possible with some web hosting providers.

The loadXML_utf8, parseXML_utf8, and saveXML_utf8 methods were originally intended to get around some of these problems, but they have not proved to be very effective and are currently not recommended for use.

A partial, though imperfect, solution to multibyte encodings has been introduced as of DOMIT! 0.95. This is the appendEntityTranslationTable method.

The appendEntityTranslationTable method allows you to encode non-ASCII characters as XML entities in your xml document, and provide a translation table at parse time to convert these entities into their equivalent characters. It assumes that the document being loaded has all non-ASCII characters converted into entities.

For example, the following code provides an (abridged) entity translation table for accented French characters:

$xmldoc =& new DOMIT_Document();
$xmldoc->appendEntityTranslationTable(array('é' => 'é'));
$xmldoc->loadXML('francais.xml', true);

When the document is saved using the saveXML method, all characters in the translation table are automatically converted back into entities.


Documentation generated by ClassyDoc, using the DOMIT! and SAXY parsers.
Please visit Engage Interactive to download free copies.