Приглашаем посетить
Чарская (charskaya.lit-info.ru)

Globalization and Locales

Previous
Table of Contents
Next

Globalization and Locales

So far, we have output numbers by using the echo or print functions and let the functions choose how to format various variables on output. However, this default output can sometimes appear raw or unprocessed. It takes more than a glance to understand how big the number 286196812 is (the population of the United States as of the 2000 Census). Similarly, seeing the current date and time written as 1106175862 (the output of the time function in PHP) is a little distressing.

It would be much better to format these numbers in the same way we would expect to see them in a book or newspaper. Therefore, the American population would be formatted as 286,196,812; and the current date would be formatted as 2005-Jan-19 15:01:58. A complication to this format is that users in France expect to see large numbers written as 1 234 567,89; while users in Italy expect to see them as 1.234.567,89. We would like a system to handle all of these possibilities and give us the chance to change them in our code.

This chapter mostly concerns itself with the topic of globalizationthe art of making our application able to handle input and users from different countries and cultures. The art of making our application run in multiple languages is called localization. Before we can look at the available functions for formatting of numeric data and writing our own, we must spend some time examining the concept of locales and learn how our operating system understands them.

Locales and Their Properties

One of primary concepts we need to understand for this formatting discussion is locale. All computers and operating systems operate with a basic concept of their location that helps them determine how they display information to the user. Once they are told that the user wishes to see things as they are seen in Italy, the computer knows to show numbers as 1.234,56; times as 20:52:16; dates as dd/mm/yy; and monetary values as Globalization and Locales1.234.56. On the other hand, American users will want to see 1,234.56; 8:52:16 PM; mm/dd/yy; and $1,234.56 respectively.

One problem we have when writing web applications is that the server can run in a different locale than our client. If our server runs an American English version of the operating system, it defaults to processing all information in that locale. However, users browsing from Hong Kong hope to see information presented in their way.

It should be noted that many web applications do not bother to deal with these issues; therefore, international users are probably used to seeing information presented in American English. However, in the interest of writing the highest-quality application, we will do our best to be fully globalized.

Therefore, we must solve two problems when writing our web applications: which locale the user is visiting from, and how locale of our application is set.

Learning the User's Locale

Determining what locale settings the user is browsing your web application with is not something you may be able to determine. However, there are a couple of key clues we can use to help us make an informed decision:

  • We can look at the Accept-Language: header if it is sent with the HTTP request. This is available in the $_SERVER superglobal array and is identified by the key HTTP_ACCEPT_LANGUAGE.

  • If we are truly enterprising, we can look at the IP address from which the user is visiting (through $_SERVER['REMOTE_ADDR']) and then determine which Internet Service Provider has this IP address and in what country the ISP is located. While this is a reasonably advanced subject that is beyond the scope of this book, you should be aware that it is used in web applications.

The content of the Accept-Language: header is often in this format:

en-us,en;q=0.5

This basically says that the browser prefers U.S. English output (en-us) overall, and other versions of English otherwise (en). The q=0.5 is a quality factor; the value 0.5 indicates that we are only half as keen on any English (en) as we are on en-us. A language without a quality factor (such as the preceding en-us) is assumed to have a value of 1. Different language entries are separated by commas, and the quality factor is always separated from the language entry with a semicolon.

To parse this, we first need to split the various languages and get the quality factors. The first is done with a call to the explode function (we can use the non-multi-byte character set safe explode function instead of split since HTTP headers are transmitted in the ISO-8859-1 character set):

$langs = explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']);

To extract the quality factors, we need to split the strings around any semicolon boundaries. We will write a function to create an array of arrays that each contains a language code and a quality factor:

<?php

function generate_languages()
{
  // split apart language entries
  $rawlangs = explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']);

  // initialize output array
  $langs = array();

  // for each entry, see if there's a q-factor
  foreach ($rawlangs as $rawlang)
  {
    $parts = explode(';', $rawlang);
    if (count($parts) == 1)
      $qual = 1;                        // no q-factor
    else
    {
      $qual = explode('=', $parts[1]);
      if (count($qual) == 2)
        $qual = (float)$qual[1];        // q-factor
      else
        $qual = 1;                      // ill-formed q-f
    }

    // create an array for this entry
    $langs[] = array('lang' => trim($parts[0]), 'q' => $qual);
  }

  // sort the entries
  usort($langs, 'compare_quality');
  return $langs;
}

// this function sorts by q-factors, putting highest first.
function compare_quality($in_a, $in_b)
{
  // quality is at key 'q'
  if ($in_a['q'] > $in_b['q'])
    return -1;
  else if ($in_a['q'] < $in_b['q'])
    return 1;
  else
    return 0;
}

?>

We can parse the preceding Accept-Language: header value with the generate_ languages function and obtain the following output:

Array (
   [0] => Array ( 
       [lang] => en-us 
       [q] => 1
   )
   [1] => Array (
       [lang] => en 
       [q] => 0.5
   )
) 

Unfortunately, learning the user's locale is only half the battle. The second half is telling PHP to use a given locale, which is dependent upon the operating system that PHP is running.

Setting the Locale of the Current Page (Unix)

Unix systems vary in how they support locales, but a common scheme that is seen in some flavors of Linux and FreeBSD is to store locale information in files in /usr/share/locale, where each language has a subdirectory in that location. However, there is still some variation. Many Linux versions (including SuSE and Red Hat) place the language information in directories of the form en_US/ or de/, while FreeBSD places them in directories, such as en_US.ISO8859-1/ or fr_FR.ISO8859-15/.

You can change the locale in the current page with the setlocale function. This function takes two parameters. The first specifies which features the locale is to be set for, while the second indicates the locale to be used (and is both operating-system specific and case-sensitive).

The first parameter will be one of the following values:

  • LC_COLLATE This sets the character collation to that of the given locale.

  • LC_CTYPE This sets the character classification for functions, such as strtoupper.

  • LC_MONETARY This sets information on how money is to be formatted.

  • LC_NUMERIC This is for decimal and thousands separators.

  • LC_TIME This sets date and time formatting information.

  • LC_ALL This sets the information for all of our preceding types to the given locale.

The setlocale function returns the name of the locale set on success, or FALSE if it fails. Our best bet for using this function is to try a series of attempts and default to leaving it unchanged when we cannot find a given locale:

<?php

// Linux version
function set_to_user_locale()
{
  $langs = generate_languages();

  foreach ($langs as $lang)
  {
    // if of form major_sublang, sublang must be uppercase
    if (strlen($lang > 2)
    {
      $lang = substr($lang['lang'], 0, 3)
              . strtoupper(substr($lang['lang'], 3, 2));
    }

    // try to set the locale.
    if (setlocale(LC_ALL, $lang['lang']) !== FALSE)
      break;   // it worked!
  }
}

?>

Unfortunately, web application authors on FreeBSD have to do some extra work to make sure the character set associated with the locale name works properly.

Setting the Locale of the Current Page (Windows)

Users who run on Microsoft Windows operating systems also use the setlocale function to set the locale of the current page; however, they have the added complication that Windows does not use the same locale names that the browser sends. These systems have no choice but to map languages to the codes that Windows uses.

Windows' language strings are largely based on the English pronunciation of a name, though most have three-letter short forms that can also be used. Table 21-1 shows the more common values that you will encounter. You can see a list of these values by going to http://www.msdn.com and searching for "Language Strings."

Table 21-1. Language Strings in Microsoft Windows

Language

Sub-Language

Windows Language String

Chinese

Chinese

"chinese"

Chinese

Chinese (simplified)

"chinese-simplified" or "chs"

Czech

Czech

"csy" or "czech"

English

English (default)

"english"

English

English (United Kingdom)

"eng," "english-uk," or "uk"

English

English (United States)

"american," "american english," "american-english," "english-american," "english-us," "english-usa," "enu," "us," or "usa"

French

French (default)

"fra" or "french"

French

French (Canadian)

"frc" or "french-canadian"

German

German (default)

"deu" or "german"

German

German (Austrian)

"dea" or "german-austrian"

Icelandic

Icelandic

"icelandic" or "isl"

Italian

Italian (default)

"ita" or "italian"

Japanese

Japanese

"japanese" or "jpn"

Russian

Russian (default)

"rus" or "russian"

Slovak

Slovak

"sky" or "slovak"

Spanish

Spanish (default)

"esp" or "spanish"

Spanish

Spanish (Mexican)

"esm" or "spanish-mexican"

Turkish

Turkish

"trk" or "turkish"


For our web applications, we have to map between these languages and the country codes available in Windows:

<?php

// Windows version
function set_to_user_locale()
{
  static $langmappings = array(
    array('codes' => array('en', 'en-us', 'en_us')
          'locale' => 'english')
    array('codes' => array('en-gb', 'en_gb')
          'locale' => 'english-uk')
    array('codes' => array('fr', 'fr-fr', 'fr_fr')
          'locale' => 'french')
    array('codes' => array('fr_ca', 'fr-ca')
          'locale' => 'french-canadian')
    array('codes' => array('de', 'de-de', 'de_de')
          'locale' => 'german')
    array('codes' => array('jp', 'jp-jp', 'jp_jp')
          'locale' => 'japanese')
    array('codes' => array('es', 'es-es', 'es_es')
          'locale' => 'spanish')
    // etc. -- we have skipped many for space.
  );

  // get the languages the browser wants.
  $user_langs = generate_languages();

  // start with the most likely first
  foreach ($user_langs as $user_lang)
  {
    // look through our array of mappings ...
    foreach ($langmappings as $mapping)
    {
      // ... for a code that matches what the user wants
      foreach ($mapping['codes'] as $code)
      {
        if ($code == strtolower($user_lang['lang']))
        {
          setlocale(LC_ALL, $mapping['locale']);
          return;
        }
      }
    }
  }

  // didn't find compatible locale.  just leave it
}

?>

Unfortunately, these functions are inefficient. We should only use them when necessary.

Learning About the Current Locale

When you wish to manually do numeric formatting or you wish to learn more about the locale in which your page is operating, you can use the localeconv function in PHP to retrieve an array of information that is pertinent to the formatting of numbers for the current locale.

The array returned will contain the keys shown in Table 21-2. The trick to using localeconv is to first call setlocale with an appropriate locale name. The reason is that these functions reside in separate libraries in the operating system and do not get initialized until we begin to use them.

Table 21-2. Array Keys and Example Values from the localeconv Function

Array Element

Example

Description

decimal_point

"."

The character to use as a decimal point

thousands_sep

","

The character to use as a thousands separator

int_curr_symbol

"USD"

The international currency symbol for this locale

currency_symbol

"$"

The currency symbol for this locale

mon_decimal_point

"."

The character to use as a decimal point in monetary values

mon_thousands_sep

","

The character to use as a thousands separator in monetary values

positive_sign

"+"

The sign to use for positive numbers

negative_sign

"-"

The sign to use for negative numbers

int_frac_digits

"2"

The number of fraction digits to show for this locale

p_cs_precedes

1

Controls whether the currency symbol appears before the number (1) or after the number (0) in positive numbers

n_cs_precedes

1

Controls whether the currency symbol appears before the number (1) or after the number (0) in negative numbers

p_sep_by_space

1

Controls whether there is a space between the currency symbol and a positive value



Previous
Table of Contents
Next