Configuring PHP for Unicode

Table of Contents

Configuring PHP for Unicode

We will now look at how we will make PHP work with our most commonly used Unicode characters.

Installation and Configuration of mbstring and mbregex

As we mention later in Appendix A, "Installation/Configuration," you will need to configure PHP as you compile or install it to enable multi-byte string and regular expression functions. For fresh builds of PHP that you compile yourself, you will want to make sure you pass the options

--enable-mbstring --enable-mbregex 

when you run the configuration program. For PHP installations on machines running Microsoft Windows (where you will often not compile PHP5 yourself), you will enable mbstring functionality by editing the php.ini file, typically in the Windows root directory (C:\Windows), or the directory into which PHP was installed. Make sure that the following entry is uncommented by verifying that there is no semicolon (;) at the beginning:


You will also need to check that the appropriate directory containing the mbstring extension dynamic link library (DLL) listed previously (php_mbstring.dll) is in the path where PHP searches for extensions by setting the extension_dir configuration option in the same php.ini file:

extension_dir = "c:\php\ext\"

Once you have the extension enabled and ready to go, we will then turn to configuring it, which is the same under Unix and Windows versions of PHP5. We will do this by setting a number of options in php.ini, as shown in Table 6-1 (these options are under the [mbstring] section).

Table 6-1. php.ini Configuration Settings for Multi-Byte support

Setting Name

Value Used




This tells the mbstring code not to prefer any language in its internal workings.



This is the internal coding that mbstring will use for strings with which it worksin this case it will use UTF-8.



This instructs PHP to take any forms and HTML data sent and convert them to the format in the mbstring.http_input setting before we begin to work with them.



We want HTTP input data converted to UTF-8 for us to use.



We want all of our output to be in UTF-8.



When mbstring tries to convert a string for us but cannot find an equivalent character, it replaces that character with a question mark (?) in the output string.



This instructs mbstring to replace a large group of functions that are not multi-byte safe with versions that are. The replacement is seamless to the programmer.

Function Overloading

One of the ways mbstring is made even more useful is through its ability to overload a group of functions that are not normally multi-byte safe and replace them with implementations that are safe. There are three groups of functions available for overloading:

  • The mail function, which PHP programmers can use to send an e-mail message.

  • A major subset of the string functions, made up of the major functions you will use: strlen, strpos, strrpos, substr, strtolower, strtoupper, and substr_count. We will discuss all of these functions in the next section.

  • A major subset of regular expression functions, notably ereg, eregi, ereg_replace, eregi_replace, and split. We will learn more about these functions in Chapter 22, "Data Validation with Regular Expressions."

The three groups of functions are represented by the binary values 1, 2, and 4 respectively; the setting value of 7 we are using for mbstring.func_overload in php.ini is a bitwise OR of these three values.

When you do not wish to use function overloading, all of the functions listed as being overloaded also have non-overloaded versions whose names are the same as their non-multi-byte brethren, with mb_ prefixed to them (mb_strpos, mb_strlen, mb_mail, mb_eregi, and so on).

Table of Contents
© 2000- NIV