In-Memory Caching

Having resources shared between threads or across process invocations will probably seem natural to programmers coming from Java or mod_perl. In PHP, all user data structures are destroyed at request shutdown. This means that with the exception of resources (such as persistent database connections), any objects you create will not be available in subsequent requests.

Although in many ways this lack of cross-request persistence is lamentable, it has the effect of making PHP an incredibly sand-boxed language, in the sense that nothing done in one request can affect the interpreter's behavior in a subsequent request (I play in my sandbox, you play in yours.) One of the downsides of the persistent state in something like mod_perl is that it is possible to irrevocably trash your interpreter for future requests or to have improperly initialized variables take unexpected values. In PHP, this type of problem is close to impossible. User scripts always enter a pristine interpreter.

Flat-File Caches

A flat-file cache uses a flat, or unstructured, file to store user data. Data is written to the file by the caching process, and then the file (usually the entire file) is sourced when the cache is requested. A simple example is a strategy for caching the news items on a page. To start off, you can structure such a page by using includes to separate page components.

File-based caches are particularly useful in applications that simply use include() on the cache file or otherwise directly use it as a file. Although it is certainly possible to store individual variables or objects in a file-based cache, that is not where this technique excels.

Cache Size Maintenance

With a single file per cache item, you risk not only consuming a large amount of disk space but creating a large number of files. Many filesystems (including ext2 and ext3 in Linux) perform very poorly when a large number of files accumulate in a directory. If a file-based cache is going to be large, you should look at creating a multitiered caching structure to keep the number of files in a single directory manageable. This technique is often utilized by mail servers for managing large spools, and it is easily adapted to many caching situations.

Don't let preconceptions that a cache must be small constrain your design choices. Although small caches in general are faster to access than large caches, as long as the cached version (including maintenance overhead) is faster than the uncached version; it is worth consideration. Later on in this chapter we will look at an example in which a multigigabyte file-based cache can make sense and provide significant performance gains. Without interprocess communication, it is difficult to implement a least recently used (LRU) cache removal policy (because we don't have statistics on the rate at which the files are being accessed). Choices for removal policies include the following:

LRU You can use the access time (atime, in the structure returned by stat()) to find and remove the least recently used cache files. Systems administrators often disable access time updates to reduce the number of disk writes in a read-intensive application (and thus improve disk performance). If this is the case, an LRU that is based on file atime will not work. Further, reading through the cache directory structure and calling stat() on all the files is increasingly slow as the number of cache files and cache usage increases.
First in, first out (FIFO) To implement a FIFO caching policy, you can use the modification time (mtime in the stat() array), to order files based on the time they were last updated. This also suffers from the same slowness issues in regards to stat() as the LRU policy.
Ad hoc Although it might seem overly simplistic, in many cases simply removing the entire cache, or entire portions of the cache, can be an easy and effective way of handling cache maintenance. This is especially true in large caches where maintenance occurs infrequently and a search of the entire cache would be extremely expensive. This is probably the most common method of cache removal.

In general, when implementing caches, you usually have specialized information about your data that you can exploit to more effectively manage the data. This unfortunately means that there is no one true way of best managing caches.

Cache Concurrency and Coherency

While files can be read by multiple processes simultaneously without any risk, writing to files while they are being read is extremely dangerous. To understand what the dangers are and how to avoid them, you need to understand how filesystems work.

A filesystem is a tree that consists of branch nodes (directories) and leaf nodes (files). When you open a file by using fopen("/path/to/file.php", $mode), the operating system searches for the path you pass to it. It starts in the root directory, opening the directory and inspecting the contents. A directory is a table that consists of a list of names of files and directories, as well as inodes associated with each. The inode associated with the filename directly corresponds to the physical disk location of the file. This is an important nuance: The filename does not directly translate to the location; the filename is mapped to an inode that in turn corresponds to the storage. When you open a file, you are returned a file pointer. The operating system associates this structure with the file's inode so that it knows where to find the file on disk. Again, note the nuance: The file pointer returned to you by fopen() has information about the file inode you are openingnot the filename.

If you only read and write to the file, a cache that ignores this nuance will behave as you expectas a single buffer for that file. This is dangerous because if you write to a file while simultaneously reading from it (say, in a different process), it is possible to read in data that is partially the old file content and partially the new content that was just written. As you can imagine, this causes the data that you read in to be inconsistent and corrupt.

Here is an example of what you would like to do to cache an entire page:

<?
if(file_exists("first.cache")) {
  include("first.cache");
  return;
}
else {
  // open file with 'w' mode, truncating it for writing
  $cachefp = fopen("first.cache", "w");
  ob_start();
}
?>
<HTML>
<BODY>
<!-- Cacheable for a day -->
Today is <?= strftime("%A, %B %e %Y") ?>
</BODY>
</HTML>
<?
if( $cachefp) {
  $file = ob_get_contents();
  fwrite($cachefp, $file);
  ob_end_flush();
}
?>

The problem with this is illustrated in Figure 10.1. You can see that by reading and writing simultaneously in different processes, you risk reading corrupted data.

Figure 10.1. A race condition in unprotected cache accesses.

In-Memory Caching

You have two ways to solve this problem: You can use file locks or file swaps.

Using file locks is a simple but powerful way to control access to files. File locks are either mandatory or advisory. Mandatory file locks are actually enforced in the operating system kernel and prevent read() and write() calls to the locked file from occurring. Mandatory locks aren't defined in the POSIX standards, nor are they part of the standard BSD file-locking semantics; their implementation varies among the systems that support them. Mandatory locks are also rarely, if ever, necessary. Because you are implementing all the processes that will interact with the cache files, you can ensure that they all behave politely.

Advisory locks tend to come in two flavors:

flock flock dates from BSD version 4.2, and it provides shared (read) and exclusive (write) locks on entire files
fcntl fcntl is part of the POSIX standard, and it provides shared and exclusive locks on sections of file (that is, you can lock particular byte ranges, which is particularly helpful for managing database files or another application where you might want multiple processes to concurrently modify multiple parts of a file).

A key feature of both advisory locking methods is that they release any locks held by a process when it exits. This means that if a process holding a lock suffers an unexpected failure (for example, the Web server process that is running incurs a segmentation fault), the lock held by that process is released, preventing a deadlock from occurring.

PHP opts for whole-file locking with its flock() function. Ironically, on most systems, this is actually implemented internally by using fcntl. Here is the caching example reworked to use file locking:

<?php
$file = $_SERVER['PHP_SELF'];
$cachefile = "$file.cache";

$lockfp = @fopen($cachefile, "a");
if(filesize($cachefile) && flock($lockfp, LOCK_SH | LOCK_NB)) {
  readfile($cachefile);
  flock($lockfp, LOCK_UN);
  exit;
}
else if(flock($lockfp, LOCK_EX | LOCK_NB)) {
  $cachefp = fopen($cachefile, "w");
  ob_start();
}
?>
<HTML>
      <BODY>
      <!-- Cacheable for a day -->
      Today is <?= strftime("%A, %B %e %Y") ?>
      </BODY>
      </HTML>
<?
if( $cachefp) {
  $file = ob_get_contents();
  fwrite($cachefp, $file);
  fclose($cachefp);
  flock($lockfp, LOCK_SH | LOCK_NB);
  ob_end_flush();
}
fclose($lockfp);
?>

This example is a bit convoluted, but let's look at what is happening.

First, you open the cache file in append (a) mode and acquire a nonblocking shared lock on it. Nonblocking (option LOCK_NB) means that the operation will return immediately if the lock cannot be taken. If you did not specify this option, the script would simply pause at that point until the lock became available. Shared (LOCK_SH) means that you are willing to share the lock with other processes that also have the LOCK_SH lock. In contrast, an exclusive lock (LOCK_EX) means that no other locks, exclusive or shared, can be held simultaneously. Usually you use shared locks to provide access for readers because it is okay if multiple processes read the file at the same time. You use exclusive locks for writing because (unless extensive precautions are taken) it is unsafe for multiple processes to write to a file at once or for a process to read from a file while another is writing to it.

If the cache file has nonzero length and the lock succeeds, you know the cache file exists, so you call readfile to return the contents of the cache file. You could also use include() on the file. That would cause any literal PHP code in the cache file to be executed. (readfile just dumps it to the output buffer.) Depending on what you are trying to do, this might or might not be desirable. You should play it safe here and call readfile.

If you fail this check, you acquire an exclusive lock on the file. You can use a nonblocking operation in case someone has beaten you to this point. If you acquire the lock, you can open the cache file for writing and start output buffering.

When you complete the request, you write the buffer to the cache file. If you somehow missed both the shared reader lock and the exclusive writer lock, you simply generate the page and quit.

Advisory file locks work well, but there are a few reasons to consider not using them:

If your files reside on an NFS (the Unix Network File System) filesystem, flock is not guaranteed to work at all.
Certain operating systems (Windows, for example) implement flock() on a process level, so multithreaded applications might not correctly lock between threads. (This is mainly a concern with the ISAPI Server Abstraction API (SAPI), the PHP SAPI for Microsoft's IIS Web server.)
By acquiring a nonblocking lock, you cause any request to the page while the cache file is being written to do a full dynamic generation of the page. If the generation is expensive, a spike occurs in resource usage whenever the cache is refreshed. Acquiring a blocking lock can reduce the system load during regeneration, but it causes all pages to hang while the page is being generated.
Writing directly to the cache file can result in partial cache files being created if an unforeseen event occurs (for example, if the process performing the write crashes or times out). Partial files are still served (the reader process has no way of knowing whether an unlocked cache file is complete), rendering the page corrupted.
On paper, advisory locks are guaranteed to release locks when the process holding them exits. Many operating systems have had bugs that under certain rare circumstances could cause locks to not be released on process death. Many of the PHP SAPIs (including mod_phpthe traditional way for running PHP on Apache) are not single-request execution architectures. This means that if you leave a lock lying around at request shutdown, the lock will continue to exist until the process running that script exits, which may be hours or days later. This could result in an interminable deadlock. I've never experienced one of these bugs personally; your mileage may vary.

File swaps work by taking advantage of a nuance mentioned earlier in this chapter. When you use unlink() on a file, what really happens is that the filename-to-inode mapping is removed. The filename no longer exists, but the storage associated with it remains unchanged (for the moment), and it still has the same inode associated with it. In fact, the operating system does not reallocate that space until all open file handles on that inode are closed. This means that any processes that are reading from that file while it is unlinked are not interrupted; they simply continue to read from the old file data. When the last of the processes holding an open descriptor on that inode closes, the space allocated for that inode is released back for reuse.

After the file is removed, you can reopen a new file with the same name. Even though the name is identical, the operating system does not connect this new file with the old inode, and it allocates new storage for the file. Thus you have all the elements necessary to preserve integrity while updating the file.

Converting the locking example to a swapping implementation is simple:

<?php
$cachefile = "{$_SERVER['PHP_SELF']}.cache";
if(file_exists($cachefile)) {
  include($cachefile);
  return;
}
else {
  $cachefile_tmp = $cachefile.".".getmypid();
  $cachefp = fopen($cachefile_tmp, "w");
  ob_start();
}
?>
<HTML>
<BODY>
<!-- Cacheable for a day -->
Today is <?= strftime("%A, %B %e %Y") ? >
</BODY>
</HTML>
<?php
if( $cachefp) {
  $file = ob_get_contents();
  fwrite($cachefp, $file);
  fclose($cachefp);
  rename($cachefile_tmp, $cachefile);
  ob_end_flush();
}
?>

Because you are never writing directly to the cache file, you know that if it exists, it must be complete, so you can unconditionally include it in that case. If the cache file does not exist, you need to create it yourself. You open a temporary file that has the process ID of the process appended to its name:

$cachefile_tmp = $cachefile.".".getmypid();

Only one process can have a given process ID at any one time, so you are guaranteed that a file is unique. (If you are doing this over NFS or on another networked filesystem, you have to take some additional steps. You'll learn more on that later in this chapter.) You open your private temporary file and set output buffering on. Then you generate the entire page, write the contents of the output buffer to your temporary cache file, and rename the temporary cache file as the "true" cache file. If more than one process does this simultaneously, the last one wins, which is fine in this case.

You should always make sure that your temporary cache file is on the same filesystem as the ultimate cache target. The rename() function performs atomic moves when the source and destination are on the same filesystem, meaning that the operation is instantaneous. No copy occurs; the directory entry for the destination is simply updated with the inode of the source file. This results in rename() being a single operation in the kernel. In contrast, when you use rename() on a file across filesystems, the system must actually copy the entire file from one location to the other. You've already seen why copying cache files is a dangerous business.

These are the benefits of using this methodology:

The code is much shorter and incurs fewer system calls (thus in general is faster).
Because you never modify the true cache file directly, you eliminate the possibility of writing a partial or corrupted cache file.
It works on network filesystems (with a bit of finessing).

The major drawback of this method is that you still have resource usage peaks while the cache file is being rewritten. (If the cache file is missing, everyone requesting it dynamically generates content until someone has created a fresh cached copy.) There are some tricks for getting around this, though, and we will examine them later in this chapter.

Table of Contents