Приглашаем посетить
Добычин (dobychin.lit-info.ru)

Operating on Strings

Previous
Table of Contents
Next

Operating on Strings

There are a number of functions that we can use when programming in PHP to manipulate and process strings, which we will now discuss in greater detail. As we mentioned earlier, many of these functions are overloaded when the mbstring module is turned on, and we will thus list the two names available for themthe overloaded version and the original mbstring version of the function.

Getting Information

There are a few functions available to us in PHP where we can learn information about strings, including their length and their location within character sets.

strlen (mb_strlen)

The function you use to get the length of a string is the strlen function (also available as mb_strlen). This function simply takes the string whose character count you want and returns the count.

<?php

  // prints 39
  echo strlen("Is this a dagger I see before me?<br/>\n");

  // prints 9
  echo strlen('Operating on Strings');

?>

The mb_strlen function has one additional feature where you can tell with which character set the string was encoded, as follows:

<?

  // prints 9
  echo mb_strlen('Operating on Strings', 'utf-8');

?>

Since all our strings in code are going to be in the same format in which we saved the file (UTF-8), this second parameter is of little use for the strings we entered in code. However, if we load in a file from a disk or load in data from the database that is not in UTF-8 format, this will permit us to get information for it.

THE strlen FUNCTION AND BINARY DATA

Since strings in PHP can contain binary data, they are commonly used to return chunks of binary information from functions. It is then normal to use the strlen function on these binary strings to determine their size. The strlen function does not count the number of characters, but merely learns from PHP how many bytes are being used to store the data. (In ISO-8859-1, one byte equals one character.)

This contrasts with the mb_strlen function, which actually counts the number of characters in the appropriate character set to correctly handle multi-byte character sets, such as UTF-8, S-JIS, and others.

Now, our problem is that we previously told PHP never to use the native strlen implementation, but instead use the mb_strlen function, even when people type strlen in their code. The mb_strlen function will not return the correct character count from binary data a majority of the time (since binary data is bound to contain some values that look like multi-byte characters). We have effectively prevented ourselves from finding out the length of binary strings commonly returned by many functions in PHP (such as functions to read and write files on the hard disk)!

While we could turn off function overloading, this would seem suboptimal. A better solution exists in the optional second parameter to the mb_strlen functionthis lets us specify in which character set the data is stored. If we give this parameter the value '8bit', mb_strlen will return the total number of bytes in the string, which is what we want in this specific scenario. We could thus write a function as follows:

function binary_size($in_buffer)
{
  return mb_strlen($in_buffer, '8bit');
}

We would then be able to safely learn the size of any binary data we are given.


mb_detect_encoding

If we are given a string or load some data from a file and are uncertain as to which character set the string was encoded with, we can attempt to learn this by calling the mb_detect_encoding function. This function analyzes the given string and makes its best guess as to which character encoding the string uses. It cannot always guarantee 100 percent accuracy in determining differences between similar character sets, such as ISO-8859-1 and ASCII, but it can be very helpful for determining which encoding a given piece of Japanese or Russian text uses.

The function returns a string with the name of the determined character set:

<?php

  $mystery_str = get_string_from_file();

  echo mb_detect_encoding($mystery_str);

?>

Cleaning Up Strings

PHP provides a number of functions we can use to clean up strings and process them for use in other places. Some of these functions are not safe for multi-byte strings but are so common that we will cover them here.

trim

The TRim function takes a string argument and removes whitespace from the beginning and the end of the string. Whitespace is defined as any of the following characters: a space (" ", ASCII 32), a TAB ("\t", ASCII 9), either of the carriage return/newline characters ("\r" and "\n", ASCII 10 and 13), a null character ("\0", ASCII 0), or a vertical TAB character (ASCII 11) (the last of which is rarely seen today).

<?php

  $str = "   \t\tOoops. Too much junk     \r\n \t   ";
  $str = trim($str);
  echo "Trimmed: \"" . $str . "\"<br/>";

  $str2 = ' Operating on Strings.   ';
  $str2 = trim($str2);
  echo "Trimmed: \"" . $str2 . "\"<br/>";

  //
  // optional use - you can tell it what chars to strip
  //
  $enthus = '???? I AM VERY ENTHUSIASTIC !!!!!!!!!!!!!!';
  $calm = trim($enthus, '!?');
  echo "Trimmed: \"" . $calm . "\"<br/>";

?>

The output from the script will be:

Trimmed: "Ooops. Too much junk"
Trimmed: "Operating on Strings."
Trimmed: " I AM VERY ENTHUSIASTIC "

Please note that the TRim function is not multi-byte enabled: This means that it will not be aware of whitespace characters beyond those listed in the previous code, in particular the double-wide space characters seen in many Easast Asian character sets. For those characters, and others about which the TRim function might know, you can use the ability to pass in the set of characters to remove in the second parameter.

The other concern with TRim is that it might accidentally try to strip out a character at the end of a string that is actually part of a multi-byte character. The default set of characters it seeks to remove are such that this is not a concern (they are key codes that will not overlap with trailing bytes in most character sets, including UTF-8), but we will be careful with this function on multi-byte strings to be safe.

ltrim and rtrim

The ltrim and rtrim functions (the latter of which is also known as chop) are similar to the TRim function, except that they only operate on the beginning or the end of the string: They both remove, by default, the same set of characters that the trim function removes, and both can be given the optional parameter that specifies the set of characters to remove.

While neither of these functions is multi-byte enabled, each will end up being pretty safe to use for the same reasons as TRim. Still, we will avoid them in favor of multi-byte aware functions whenever possible.

Searching and Comparing

There will be times when we wish to find things within strings. A number of functions will help us here.

strpos (mb_strpos) and strrpos (mb_strrpos)

The strpos function takes two string argumentsone to search and one to findand returns the zero-based integer index at which the second argument was found within the first. If the second argument is not found embedded within the first, FALSE is returned.

While prior versions only allowed the second argument to contain a single character, PHP5 searches for an entire string if specified.

<?php
  $str = 'When shall we three meet again?';

  // search for an individual character
  $idx = strpos($str, 'w');

  // search for a substring
  $idx2 = strpos($str, 'aga');

  // skip the first 10 chars, then start looking
  $idx3 = strpos($str, 'n', 10);

  echo "\$idx: $idx  \$idx2: $idx2  \$idx3: $idx3 <br/>\n";

  $mbstr = 'Operating on Strings';
  $mbidx = mb_strpos($mbstr, 'Operating on Strings');
  echo "\$mbidx: $mbidx<br/>\n";

?>

The strpos function accepts an optional third argumentthe offset, or the index at which to start looking. This produces the following output:

$idx: 11 $idx2: 25 $idx3: 29
$mbidx: 5

If you specify the offset parameter as a negative number, the strpos function stops searching that number of characters before the end of the string.

Like all of the mbstring functions, the mb_strpos function accepts an optional final (fourth, in this case) parameter where you can specify as which character set string arguments will be treated.

Please note that when inspecting the results of the various flavors of strpos, 0 is returned when the desired character or substring occurs at the beginning of the string, and FALSE is returned if the desired character or substring does not occur. If you write your code as follows:

<?
   $res = strpos($haystack, $needle);
   if ($res == FALSE)
   {
     echo 'Couldn't find it!';
   }
   else
   {
     echo "Found it at index: $res";
   }

?>

you get the message "Couldn't find it!" if the sought-after character(s) occurred at the beginning of the string. Recall from our introduction to operators and type conversions in Chapter 2 that 0 and FALSE are considered equivalent for the simple equality operator == (two equal signs). To distinguish between them, you need to use the identity operator === (three equal signs), which makes sure the two operands are of the same type. Thus, the previous check should be written as

if ($res === FALSE)

The strrpos and mb_strrpos functions are just like their r-less cousins, except they look for the last instance of the given character or substring.

strcmp and strncmp

The strcmp function compares strings by going and comparing bytes (and is therefore safe for searching strings with unprintable binary characters), and is case sensitive (since lower- and uppercased characters have different key codes). It returns one of three values:

  • -1, indicating that the first string is "less than" the second string.

  • 0, indicating that the strings are equal.

  • 1, indicating that the first string is "greater than" the second string.

Since the strcmp function compares byte values, the designations "greater than" and "less than" are of limited use. Because of the way character tables work, lowercase letters have lower numbers than uppercase ones, meaning the letter "z" is considered less than the letter "A."

The strncmp function operates like the strcmp function, except it only compares the first n characters of the two things. You specify n in the second parameter you pass to the function.

<?php

  $stra = 'Cats are fuzzy';
  $strb = 'cats are fuzzy';
  $strc = 'bats are fuzzy';
  $resa = strcmp($stra, $strb);
  $resb = strcmp($strb, $strc);
  echo "\$resa: $resa   \$resb: $resb<br/>\n";

  $mbstr = 'Operating on Strings';
  $mbstr2 = 'Operating on Strings';
  $mbres = strcmp($mbstr, $mbstr2);
  echo "\$mbres:  $mbres<br/>\n";

  // compare first three characters only
  $strx = 'Moo, said the cow.';
  $stry = 'Moocows are where milk comes from.';
  $resz = strncmp($strx, $stry, 3);
  echo "\$resz:  $resz<br/>\n";

  // compare first two characters (but not really!)  
  $mbstr3 = 'Operating on Strings';
  $mbres2 = strncmp($mbstr, $mbstr3, 2);
  echo "\$mbres2:  $mbres2<br/>\n";

?>

The output of this script is as follows:

$resa: -1 $resb: 1
$mbres: 0
$resz: 0
$mbres2: 0

We see that the strcmp function works as expected on both regular ASCII strings and even multi-byte strings; even though the characters are multiple bytes, the function can still go along and see if all of the bytes are the same or not. Similarly, we see that the strncmp function correctly identifies two different strings as at least having the first three characters the same.

Where the code is unusual is in its comparison of the multi-byte string $mbstr with the multi-byte character sequence $mbstr3. We want to know if the first two characters are the same or not, and it returns 0, indicating that they are! What happened? Since strcmp and strncmp are not multi-byte aware, the third parameter to strncmp is actually the number of 8-bit bytes that are to be compared. The character sequence in $mbstr3 is made up of six bytes, but we only told the function to compare the first two (which are the same).

We cannot call strlen on $mbstr3 to get the correct byte count for this third parameter since it is overloaded as mb_strlen and correctly returns 2. However, we can call mb_strlen and tell it that the incoming sequence is "8bit", which would cause it to return the desired byte count 6.

<?php

  // compare first two characters
  $mbstr3 = 'Operating on Strings';
  $bytes_in_mbstr3 = mb_strlen($mbstr3, '8bit');
  $mbres2 = strncmp($mbstr, $mbstr3, $bytes_in_mbstr3);

?>

The comparison now correctly returns a non-zero value for the first two characters.

strcasecmp and strncasecmp

These two functions are very similar to strcmp and strncmp, except they ignore case for ASCII characters. Like their cousins that do not ignore case, they are not multi-byte enabled and should be used with caution. For Unicode strings, none of the trailing bytes in a multi-byte character maps to an ASCII character, meaning that these functions generate correct results. Nonetheless, we will try to be cautious in our usage of these.

strnatcmp and strnatcasecmp

One of the problems when looking at strings is that based on ASCII values, the string picture10.gif is less than picture9.gif since "1" is a lower key code than "9." For those cases when you want to compare strings and have numbers within those strings be treated more "naturally," you can use the strnatcmp and strnatcasecmp functions, which do comparisons similar to the strcmp and strcasecmp functions, but with special number processing.

Thus, strnatcmp would indicate that picture10.gif was in fact greater than picture9.gif. Similarly, strnatcasecmp would indicate that Picture10.gif was greater than pictURE9.gif. Neither of these functions is multi-byte enabled and should be used only when you are sure of what character set you are using.

Extraction

Provides a number of functions to find and extract parts of strings.

substr (mb_substr)

For situations when you want to extract a portion of a string, the substr function and its mb_substr cousin take a string and a starting index as parameters and then return the contents of the given string from said starting index until the end of the string. You can also pass in a third parameter to indicate the number of characters after the starting index to include in the extracted string for cases when you do not want everything until the end.

<?php

  // start is +1 to get the first character AFTER the '
  $str = "User Name is 'Bubba The Giant'.";
  $start = strpos($str, "'") + 1;
  $end = strrpos($str, "'");

  // don't forget: last parameter is number of characters
  $user_name = substr($str, $start, $end - $start);

  echo "User Name: $user_name<br/>\n";

  // this says the user's name is akira (Operating on Strings).
  // note that the quotes in this string are multi-byte
  // quote chars:  '''
  $mbstr = 'Operating on Strings';
  $start = strpos($mbstr, "'") + 1;
  $end = strrpos($mbstr, "'");
  $mbuser_name = substr($mbstr, $start, $end - $start);

  echo "MBCS User Name: $mbuser_name<br/>\n";

?>

The output of this script is

User Name: Bubba The Giant
MBCS User Name: Operating on Strings

Case Manipulation

Many languages of the world distinguish between upper- and lowercase letters, which can prove tricky for situations in which we are not particularly interested in case. We would like to know if "Johan" is the same as "JOHAN" and have the string functions recognize that 'h' and 'H' are the same in this case.

strtoupper (mb_strtoupper) and strtolower (mb_strtolower)

These functions are useful for converting the case of given strings. While the non-multi-byte enabled versions will only convert the 26 basic Latin letters, the multi-byte enabled versions are much more sophisticated and are aware of and able to use the Unicode feature by which certain characters can be marked as 'alphabetic.' In the following code, we can see it is able to convert between "Ö" and "ö."

<?php

  $str = "I AM YELLING LOUDLY";
  echo strtolower($str);  echo "<br/>\n";

  $str = "i am quiet as a mouse";
  echo strtoupper($str);  echo "<br/>\n";

  $str = 'I live in ÅRÑÖÜ.';
  echo strtolower($str);  echo "<br/>\n";
  $str = 'I live in årñöü.';
  echo strtoupper($str);  echo "<br/>\n";
  $str = 'I live in Operating on Strings;
  echo strtolower($str);  echo "<br/>\n";
  echo strtoupper($str);  echo "<br/>\n";

?>

This script produces the following output:

i am yelling loudly
I AM QUIET AS A MOUSE
i live in årñöü.
I LIVE IN ÅRÑÖÜ.
i live in Operating on Strings.
I LIVE IN Operating on Strings.

Character Encoding Conversions

A very powerful extension included with PHP5 is known as iconv; it provides a number of functions that are character set-aware, as the mbstring functions are. While many of the functions in this extension duplicate much of the functionality of the mbstring functions (using a different implementation), there is one of particular interest: the iconv function.

This function takes a string and converts it from one specified character set to another. Therefore, if we were given a text file with Japanese text in Shift-JIS format and we wanted to convert it into UTF-8, we could simply use the following line of code:

$utf8 = iconv('SJIS', 'UTF-8', $sjis_string);

When an external entity (such as the operating system) provides us with data in character sets that we are not using, we can use this function to make sure they are correctly converted before we begin working with them.


Previous
Table of Contents
Next