Caching Reused Data Between Requests

People often ask how to achieve object persistence over requests. The idea is to be able to create an object in a request, have that request complete, and then reference that object in the next request. Many Java systems use this sort of object persistence to implement shopping carts, user sessions, database connection persistence, or any sort of functionality for the life of a Web server process or the length of a user's session on a Web site. This is a popular strategy for Java programmers and (to a lesser extent) mod_perl developers.

Both Java and mod_perl embed a persistent runtime into Apache. In this runtime, scripts and pages are parsed and compiled the first time they are encountered, and they are just executed repeatedly. You can think of it as starting up the runtime once and then executing a page the way you might execute a function call in a loop (just calling the compiled copy). As we will discuss in Chapter 20, "PHP and Zend Engine Internals," PHP does not implement this sort of strategy. PHP keeps a persistent interpreter, but it completely tears down the context at request shutdown.

This means that if in a page you create any sort of variable, like this, this variable (in fact the entire symbol table) will be destroyed at the end of the request:

<? $string = 'hello world';?>

So how do you get around this? How do you carry an object over from one request to another? Chapter 10, "Data Component Caching," addresses this question for large pieces of data. In this section we are focused on smaller piecesintermediate data or individual objects. How do you cache those between requests? The short answer is that you generally don't want to.

Actually, that's not completely true; you can use the serialize() function to package up an arbitrary data structure (object, array, what have you), store it, and then retrieve and unserialize it later. There are a few hurdles, however, that in general make this undesirable on a small scale:

For objects that are relatively low cost to build, instantiation is cheaper than unserialization.
If there are numerous instances of an object (as happens with the Word objects or an object describing an individual Web site user), the cache can quickly fill up, and you need to implement a mechanism for aging out serialized objects.
As noted in previous chapters, cache synchronization and poisoning across distributed systems is difficult.

As always, you are brought back to a tradeoff: You can avoid the cost of instantiating certain high-cost objects at the expense of maintaining a caching system. If you are careless, it is very easy to cache too aggressively and thus hurt the cacheability of more significant data structures or to cache too passively and not recoup the manageability costs of maintaining the cache infrastructure.

So, how could you cache an individual object between requests? Well, you can use the serialize() function to convert it to a storable format and then store it in a shared memory segment, database, or file cache. To implement this in the Word class, you can add a store-and-retrieve method to the Word class. In this example, you can backend it against a MySQL-based cache, interfaced with the connection abstraction layer you built in Chapter 2, "Object-Oriented Programming Through Design Patterns":

class Text_Word {
  require_once 'DB.inc';
  // Previous class definitions
  // ...
  function store() {
    $data = serialize($this);
    $db = new DB_Mysql_TestDB;
    $query = "REPLACE INTO ObjectCache (objecttype, keyname, data, modified)
              VALUES('Word', :1, :2, now())";
    $db->prepare($query)->execute($this->word, $data);
  }
  function retrieve($name) {
    $db = new DB_Mysql_TestDB;
    $query = "SELECT data from  ObjectCache where objecttype = 'Word' and keyname
              = :1";
    $row = $db->prepare($query)->execute($name)->fetch_assoc();
    if($row) {
      return unserialize($row[data]);
    }
    else {
      return new Text_Word($name);
    }
}
}

Escaping Query Data

The DB abstraction layer you developed in Chapter 2 handles escaping data for you. If you are not using an abstraction layer here, you need to run mysql_real_escape_string() on the output of serialize().

To use the new Text_Word caching implementation, you need to decide when to store the object. Because the goal is to save computational effort, you can update ObjectCache in the numSyllables method after you perform all your calculations there:

function numSyllables() {
  if($this->_numSyllables) {
    return $this->_numSyllables;
  }
  $scratch = $this->mungeWord($this->word);
  $fragments = preg_split("/[^aeiouy]+/", $scratch);
  if(!$fragments[0]) {
    array_shift($fragments);
  }
  if(!$fragments[count($fragments) - 1]) {
    array_pop($fragments);
  }
  $this->_numSyllables += $this->countSpecialSyllables($scratch);
  if(count($fragments)) {
    $this->_numSyllables += count($fragments);
  }
  else {
    $this->_numSyllables = 1;
  }
  // store the object before return it
  $this->store();
  return $this->_numSyllables;
}

To retrieve elements from the cache, you can modify the factory to search the MySQL cache if it fails its internal cache:

class CachingFactory {
  static $objects;
  function Word($name) {
    if(!self::$objects[Word][$name]) {
      self::$objects[Word][$name] = Text_Word::retrieve($name);
    }
    return self::$objects[Word][$name];
  }
}

Again, the amount of machinery that goes into maintaining this caching process is quite large. In addition to the modifications you've made so far, you also need a cache maintenance infrastructure to purge entries from the cache when it gets full. And it will get full relatively quickly. If you look at a sample row in the cache, you see that the serialization for a Word object is rather large:

mysql> select data from ObjectCache where keyname = 'the';
+---+
data
+---+
O:4:"word":2:{s:4:"word";s:3:"the";s:13:"_numSyllables";i:0;}
+---+
1 row in set (0.01 sec)

That amounts to 61 bytes of data, much of which is class structure. In PHP 4 this is even worse because static class variables are not supported, and each serialization can include the syllable exception arrays as well. Serializations by their very nature tend to be wordy, often making them overkill.

It is difficult to achieve any substantial performance benefit by using this sort of interprocess caching. For example, in regard to the Text_Word class, all this caching infrastructure has brought you no discernable speedup. In contrast, comparing the object-caching factory technique gave me (on my test system) a factor-of-eight speedup (roughly speaking) on Text_Word object re-declarations within a request.

In general, I would avoid the strategy of trying to cache intermediate data between requests. Instead, if you determine a bottleneck in a specific function, search first for a more global solution. Only in the case of particularly complex objects and data structures that involve significant resources is doing interprocess sharing of small data worthwhile. It is difficult to overcome the cost of interprocess communication on such a small scale .

Table of Contents