|
One of PHPs most glaring weaknesses is poor support for various character sets.
As long as a string is ASCII text, no problems occur. However, there is no simple and transparent way,
using the standard PHP string functions, to properly handle multi-byte encodings such as the BIG5 encoding
used for the Chinese language.
This is a problem especially for XML parsers like DOMIT!, since according to the XML specification
they are required to be unicode compliant.
One way of resolving this is using the mb_string extension.
This will in the near future be integrated with DOMIT!,
but the one disadvantage of such an approach is that mb_string must be installed separately from the base
PHP distribution, which may not be possible with some web hosting providers.
The loadXML_utf8, parseXML_utf8,
and saveXML_utf8 methods were originally intended to get around some of these
problems, but they have not proved to be very effective and are currently not recommended for use.
A partial, though imperfect, solution to multibyte encodings has been introduced as of DOMIT! 0.95.
This is the appendEntityTranslationTable method.
The appendEntityTranslationTable method allows you to encode non-ASCII characters as
XML entities in your xml document, and provide a translation table at parse time to convert these entities into
their equivalent characters. It assumes that the document being loaded has all non-ASCII characters converted into entities.
For example, the following code provides an (abridged) entity translation table for accented French characters:
$xmldoc =& new DOMIT_Document();
$xmldoc->appendEntityTranslationTable(array('é' => 'é'));
$xmldoc->loadXML('francais.xml', true);
When the document is saved using the saveXML method, all characters in the translation table are automatically
converted back into entities.
|