Ïðèãëàøàåì ïîñåòèòü
Áèàíêè (bianki.lit-info.ru)

Character Encoding

Table of Contents
Previous Next

Character Encoding

It's important to understand how the characters we write in our PHP scripts are dealt with by PHP itself, and by the browser.

Writing the Locales

PHP doesn't care much about character encoding. It just reads the strings, and passes the contents to the browser. This means that we must pay attention to any pitfalls if we're writing a Greek translation. Normally it is up to the browser to interpret the characters we send. There are times when it matters what character sets we are using.

Making the Browser Understand the Language

The browser acts on a number of triggers when trying to figure out document encoding:

  • The browser will use the character encoding that the server specified in the HTTP charset parameter (as part of a Content-Type field)

  • If such a field isn't present, it will looks for a <meta> declaration with http-equiv set to Content-Type and a value set for charset

  • If that fails too, it might try to guess the encoding, but don't rely on this

Either way, we have to consider telling the browser about the character encoding very early on in the script. HTTP headers have to be sent before any other output can be sent to the browser. Likewise, we'll have to add the <meta> declaration in the <head> element of the web page.

The following web page first adds the necessary HTTP header and then also uses a <meta> declaration in the <head> tag to convince the browser that we want to use ISO-8859-2 for the Czech text on this particular page:

    <?php
    header('Content-Type: text/html; charset=ISO-8859-2');
    ?>
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">
        <title>Test Czech Encoding</title>
      </head>
      <body>

         <h1>Bohu3/4el!</h1>

         <p>Pro místo 'Aalborg, Denmark' nejsou k disposici 3/4ádná data</p>

      </body>
    </html>

The English translation of the text is "Sorry! There's no data for ‘Aalborg, Denmark’." and is taken from a script that produces weather reports.

Some sections of the text look very strange – instead of zs topped by a wedge, there was a three-quarter- sign. This just helps stress the point, that it is the browser and not the editor or PHP that has to understand the characters.

We can see the same effect on any web page written in a foreign language. We often have to accept a change for the character set used by the browser. The letters a-z will still be the same, but the special characters may change.

Reacting to Browsers using PHP

Fortunately, we don't have to guess which country visitors are actually coming from. Normally the browser gives up this information on first server request using the Accept-Language header.

Do not to assume the browser will complain at a German document just because it doesn't mention German as one of its preferred languages. The header is just a hint for a web server.

Note 

A site that uses this header extensively is http://www.debian.org/. If you have your browser set up correctly, then the pages should automatically appear in your native language (provided that the page has been translated into your particular language, of course).

In PHP, the Accept-Language header is automatically made available as the global variable $HTTP_ACCEPT_LANGUAGE. This is a string with the accepted languages listed – the language with highest priority is first in the list. By feeding this string to explode(), we get an array to work with instead.

Here's a function that takes an array of available languages, and then returns the best of those, with regard to $HTTP_ACCEPT_LANGUAGE. The array of available languages should be ordered, so that the first one is the default, in case $HTTP_ACCEPT_LANGUAGE doesn't match any of the available languages:

    function getBestLanguage($avail_lang)
    {
        $accept_lang = explode(', ', $GLOBALS['HTTP_ACCEPT_LANGUAGE']);

        while (list($key, $lang) = each($accept_lang)) {
            if (in_array($lang, $avail_lang)) {
                return $lang;
            }
        }
        return reset($avail_lang);
    }

By using this function, it's easy to show our page in the right language the first time the user sees it. We should, of course, make sure that the user gets a chance to change the language if desired. It's rather annoying when all web pages are suddenly displayed in Czech, and you don't know how to get back to English.

We now have a way to determine the user's preferred language. If our site supports that language, then the locale should provide us with information about the character set the page should display.

We can extend our little file system browser by applying what we've just learned:

    <?php
    class App
    {
        var $output;
        var $avail_lang;

The constructor initializes the array of available languages, and chooses the best one:

        function App()
        {
            $this->avail_lang = array('en', 'da', 'pl');
            $this->setLanguage($this->getBestLanguage());
        }

Based on the array of available languages ($avail_lang), and the browser's Accept-Language header, this function will return the language code for the best language:

        function getBestLanguage()
        {
            $accept_lang = explode(', ', $GLOBALS['HTTP_ACCEPT_LANGUAGE']);

            while (list($key, $lang) = each($accept_lang)) {
                if (in_array($lang, $this->avail_lang)) {
                    return $lang;
                }
            }
            return reset($this->avail_lang);
        }

setLanguage() switches to another language by constructing a new output object:

        function setLanguage($new_language = '')
        {
            switch ($new_language) {
            case 'en':
                $this->output = new English_Output();
                break;

            case 'da':
                $this->output = new Danish_Output();
                break;

            case 'pl':
                $this->output = new Polish_Output();
                break;

            default:
                $this->setLanguage($this->getBestLanguage);
                break;
            }
        }
    }

Here's our base class for all output classes:

    class Basic_Output
    {
        var $strings;

        function _($string)
        {
            if (isset($this->strings[$string])) {
                return $this->strings[$string];
            } else {
                return $string;
            }
        }
        function gettext($string)
        {
            return $this->_($string);
        }

A generic outNumFiles() function that we've seen before:

        function outNumFiles($count)
        {
            if ($count == 0) {
                return $this->gettext("No files.");
            } elseif ($count == 1) {
                return $this->gettext("1 file.");
            } else {
                return sprintf($this->gettext("%s files"), $count);
            }
        }

getCharset() returns the character set for the translation:

        function getCharset()
        {
            return $this->strings['charset'];
        }
    }

    class English_Output extends Basic_Output
    {

The constructor initializes the $strings array with the correct character set. The other strings are in basicOutput:

        function English_Output()
        {
             $this->strings = array(
                 'charset' => 'ISO-8859-1');
        }
    }

    class Danish_Output extends Basic_Output
    {
        function Danish_Output()
        {
            $this->strings = array(
                'No files.' => 'Ingen filer.',
                '1 file.' => '1 fil.',
                '%s files.' => '%s filer',
                'charset' => 'ISO-8859-1');
        }
    }
    class Polish_Output extends Basic_Output
    {
        function Polish_Output()
        {
            $this->strings = array(
                'charset' => 'ISO-8859-2');
        }

        function outNumFiles($count)
        {
            if ($count == 0) {
                return "Nie ma plików.";
            } elseif ($count == 1) {
                return "1 plik.";
            } elseif ($count <= 4) {
                return "$count pliki.";
            } elseif ($count <= 21) {
                return "$count plików.";
            } else {
                $last_digit = substr($count, -1);
                if ($last_digit >= 2 && $last_digit <= 4) {
                    return "$count pliki.";
                } else {
                    return "$count plików.";
                }
            }
        }
    }

Finally we create a new App object and send the appropriate header, before constructing the HTML:

    $obj = new App();

    header('Content-Type: text/html; charset=' . $obj->output->getCharset());
    ?>

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                    "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=<?php
        echo($obj->output->getCharset()); ?>">
        <title>My App</title>
      </head>
      <body>

      <?php
      echo("<p>" . $obj->output->outNumFiles(7) . "</p>\n");
      ?>
      </body>
    </html>

Table of Contents
Previous Next