Приглашаем посетить
Добычин (dobychin.lit-info.ru)

Integrating Caching into Application Code

Previous
Table of Contents
Next

Integrating Caching into Application Code

Now that you have a whole toolbox of caching techniques, you need to integrate them into your application. As with a real-world toolbox, it's often up to programmer to choose the right tool. Use a nail or use a screw? Circular saw or hand saw? File-based cache or DBM-based cache? Sometimes the answer is clear; but often it's just a matter of choice.

With so many different caching strategies available, the best way to select the appropriate one is through benchmarking the different alternatives. This section takes a real-world approach by considering some practical examples and then trying to build a solution that makes sense for each of them.

A number of the following examples use the file-swapping method described earlier in this chapter, in the section "Flat-File Caches." The code there is pretty ad hoc, and you need to wrap it into a Cache_File class (to complement the Cache_DBM class) to make your life easier:

<?php
class Cache_File {
  protected $filename;
  protected $tempfilename;
  protected $expiration;
  protected $fp;
  public function _ _construct($filename, $expiration=false) {
    $this->filename = $filename;
    $this->tempfilename = "$filename.".getmypid();
    $this->expiration = $expiration;
  }
  public function put($buffer) {
    if(($this->fp = fopen($this->tempfilename, "w")) == false) {
      return false;
    }
    fwrite($this->fp, $buffer);
    fclose($this->fp);
    rename($this->tempfilename, $this->filename);
    return true;
  }
  public function get() {
    if($this->expiration) {
      $stat = @stat($this->filename);
      if($stat[9]) {
        if(time() > $modified + $this->expiration) {
          unlink($this->filename);
          return false;
        }
      }
    }
    return @file_get_contents($this->filename);
  }
  public function remove() {
    @unlink($filename);
  }
}
?>

Cache_File is similar to Cache_DBM. You have a constructor to which you pass the name of the cache file and an optional expiration. You have a get() method that performs expiration validation (if an expiration time is set) and returns the contents of the cache files. The put() method takes a buffer of information and writes it to a temporary cache file; then it swaps that temporary file in for the final file. The remove() method destroys the cache file.

Often you use this type of cache to store the contents of a page from an output buffer, so you can add two convenience methods, begin() and end(), in lieu of put() to capture output to the cache:

public function begin() {
  if(($this->fp = fopen($this->tempfilename, "w")) == false) {
      return false;
  }
  ob_start();
}
public function end() {
  $buffer = ob_get_contents();
  ob_end_flush();
  if(strlen($buffer)) {
    fwrite($this->fp, $buffer);
    fclose($this->fp);
    rename($this->tempfilename, $this->filename);
    return true;
  }
  else {
    flcose($this->fp);
    unlink($this->tempfilename);
    return false;
  }
}

To use these functions to cache output, you call begin() before the output and end() at the end:

<?php
  require_once 'Cache/File.inc';
  $cache = Cache_File("/data/cache/files/016/index.cache");
  if($text = $cache->get()) {
    print $text;
  }
  else {
    $cache->begin();
?>
<?php
  // do page generation here
?>
<?php
    $cache->end();
  }
?>

Caching Home Pages

This section explores how you might apply caching techniques to a Web site that allows users to register open-source projects and create personal pages for them (think pear.php.net or www.freshmeat.net). This site gets a lot of traffic, so you would like to use caching techniques to speed the page loads and take the strain off the database.

This design requirement is very common; the Web representation of items within a store, entries within a Web log, sites with member personal pages, and online details for financial stocks all often require a similar templatization. For example, my company allows for all its employees to create their own templatized home pages as part of the company site. To keep things consistent, each employee is allowed certain customizable data (a personal message and resume) that is combined with other predetermined personal information (fixed biographic data) and nonpersonalized information (the company header, footer, and navigation bars).

You need to start with a basic project page. Each project has some basic information about it, like this:

class Project {
  // attributes of the project
  public $name;
  public $projectid;
  public $short_description;
  public $authors;
  public $long_description;
  public $file_url;

The class constructor takes an optional name. If a name is provided, the constructor attempts to load the details for that project. If the constructor fails to find a project by that name, it raises an exception. Here it is:

public function _ _construct($name=false) {
    if($name) {
      $this->_fetch($name);
    }
  }

And here is the rest of Project:

  protected function _fetch($name) {
    $dbh = new DB_Mysql_Test;
    $cur = $dbh->prepare("
      SELECT
        *
      FROM
        projects
      WHERE
        name = :1");
    $cur->execute($name);
    $row = $cur->fetch_assoc();
    if($row) {
      $this->name = $name;
      $this->short_description = $row['short_description'];
      $this->author = $row['author'];
      $this->long_description = $row['long_description'];
      $this->file_url = $row['file_url'];
    }
    else {
      throw new Exception;
    }
  }
}

You can use a store() method for saving any changes to a project back to the database:

  public function store() {
    $dbh = new DB_Mysql_Test();
    $cur = $dbh->execute("
      REPLACE INTO
        projects
      SET
        short_description = :1,
        author = :2,
        long_description = :3,
        file_url = :4
      WHERE
        name = :5");
    $cur->execute($this->short_description,
                $this->author,
                $this->long_description,
                $this->file_url,
                $this->name);
  }
}

Because you are writing out cache files, you need to know where to put them. You can create a place for them by using the global configuration variable $CACHEBASE, which specifies the top-level directory into which you will place all your cache files.

Alternatively, you could create a global singleton Config class that will contain all your configuration parameters. In Project, you add a class method get_cachefile() to generate the path to the Cache File for a specific project:

public function get_cachefile($name) {
  global $CACHEBASE;
  return "$CACHEBASE/projects/$name.cache";
}

The project page itself is a template in which you fit the project details. This way you have a consistent look and feel across the site. You pass the project name into the page as a GET parameter (the URL will look like http://www.example.com/project.php?name=ProjectFoo) and then assemble the page:

<?php
  require 'Project.inc';
  try {
    $name = $_GET['name'];
    if(!$name) {
      throw new Exception();
    }
    $project = new Project($name);
  }
  catch (Exception $e) {
    // If I fail for any reason, I will send people here
    header("Location: /index.php");
    return;
  }
?>

<html>
<title><?= $project->name ?></title>
<body>
<!-- boilerplate text -->
<table>
  <tr>
    <td>Author:</td><td><?= $project->author ?>
  </tr>
  <tr>
    <td>Summary:</td><td><?= $project->short_description ?>
  </tr>
  <tr>
    <td>Availability:</td>
    <td><a href="<?= $project->file_url ?>">click here</a></td>
  </tr>
  <tr>
    <td><?= $project->long_description ?></td>
  </tr>
</table>
</body>
</html>

You also need a page where authors can edit their pages:

<?
  require_once 'Project.inc';
  $name = $_REQUEST['name'];
  $project = new Project($name);
  if(array_key_exists("posted", $_POST)) {
    $project->author = $_POST['author'];
    $project->short_description = $_POST['short_description'];
    $project->file_url = $_POST['file_url'];
    $project->long_description = $_POST['long_description'];
    $project->store();
  }
?>
<html>
<title>Project Page Editor for <?= $project->name ?> </title>
<body>
<form name="editproject" method="POST">
<input type ="hidden" name="name" value="<?= $name ?>">
<table>
  <tr>
    <td>Author:</td>
    <td><input type="text" name=author value="<?= $project->author ?>" ></td>
  </tr>
  <tr>
    <td>Summary:</td>
    <td>
    <input type="text"
           name=short_description
           value="<?= $project->short_description ?>">
    </td>
  </tr>
  <tr>
    <td>Availability:</td>
    <td><input type="text" name=file_url value="<?= $project->file_url?>"></td>
  </tr>
  <tr>
    <td colspan=2>
      <TEXTAREA name="long_description" rows="20" cols="80"><?= $project->
long_description ?></TEXTAREA>
    </td>
  </tr>
</table>
<input type=submit name=posted value="Edit content">
</form>
</body>
</html>

The first caching implementation is a direct application of the class Cache_File you developed earlier:

<?php
  require_once 'Cache_File.inc';
  require_once 'Project.inc';
  try {
    $name = $_GET['name'];
    if(!$name) {
      throw new Exception();
    }
    $cache = new Cache_File(Project::get_cachefile($name));
    if($text = $cache->get()) {
      print $text;
      return;
    }
    $project = new Project($name);
  }
  catch (Exception $e) {
    // if I fail, I should go here
    header(";Location: /index.php");
    return;
  }
  $cache->begin();
?>

<html>
<title><?= $project->name ?></title>
<body>
<!-- boilerplate text -->
<table>
  <tr>
    <td>Author:</td><td><?= $project->author ?>
  </tr>
  <tr>
    <td>Summary:</td><td><?= $project->short_description ? >
  </tr>
  <tr>
    <td>Availability:</td><td><a href="<?= $project->file_url ?>">click
here</a></td>
  </tr>
  <tr>
    <td><?= $project->long_description ?></td>
  </tr>
</table>
</body>
</html>
<?php
  $cache->end();
?>

To this point, you've provided no expiration logic, so the cached copy will never get updated, which is not really what you want. You could add an expiration time to the page, causing it to auto-renew after a certain period of time, but that is not an optimal solution. It does not directly address your needs. The cached data for a project will in fact remain forever valid until someone changes it. What you would like to have happen is for it to remain valid until one of two things happens:

  • The page template needs to be changed

  • An author updates the project data

The first case can be handled manually. If you need to update the templates, you can change the template code in project.php and remove all the cache files. Then, when a new request comes in, the page will be recached with the correct template.

The second case you can handle by implementing cache-on-write in the editing page. An author can change the page text only by going through the edit page. When the changes are submitted, you can simply unlink the cache file. Then the next request for that project will cause the cache to be generated. The changes to the edit page are extremely minimalthree lines added to the head of the page:

<?php
  require_once 'Cache/File.inc';
  require_once 'Project.inc';
  $name = $_REQUEST['name'];
  $project = new Project($name);
  if(array_key_exists("posted", $_POST)) {
    $project->author = $_POST['author'];
    $project->short_description = $_POST['short_description'];
    $project->file_url = $_POST['file_url'];
    $project->long_description = $_POST['long_description'];
    $project->store();

    // remove our cache file
    $cache = new Cache_File(Project::get_cachefile($name));
    $cache->remove();
  }
?>

When you remove the cache file, the next user request to the page will fail the cache hit on project.php and cause a recache. This can result in a momentary peak in resource utilization as the cache files are regenerated. In fact, as discussed earlier in this section, concurrent requests for the page will all generate dynamic copies in parallel until one finishes and caches a copy.

If the project pages are heavily accessed, you might prefer to proactively cache the page. You would do this by reaching it instead of unlinking it on the edit page. Then there is no worry of contention. One drawback of the proactive method is that it works poorly if you have to regenerate a large number of cache files. Proactively recaching 100,000 cache files may take minutes or hours, whereas a simple unlink of the cache backing is much faster. The proactive caching method is effective for pages that have a high cache hit rate. It is often not worthwhile if the cache hit rate is low, if there is limited storage for cache files, or if a large number of cache files need to be invalidated simultaneously.

Recaching all your pages can be expensive, so you could alternatively take a pessimistic approach to regeneration and simply remove the cache file. The next time the page is requested, the cache request will fail, and the cache will be regenerated with current data. For applications where you have thousands or hundreds of thousands of cached pages, the pessimistic approach allows cache generation to be spread over a longer period of time and allows for "fast" invalidation of elements of the cache.

There are two drawbacks to the general approach so farone mainly cosmetic and the other mainly technical:

  • The URL http://example.com/project.php?project=myproject is less appealing than http://example.com/project/myproject.html. This is not entirely a cosmetic issue.

  • You still have to run the PHP interpreter to display the cached page. In fact, not only do you need to start the interpreter to parse and execute project.php, you also must then open and read the cache file. When the page is cached, it is entirely static, so hopefully you can avoid that overhead as well.

You could simply write the cache file out like this:

/www/htdocs/projects/myproject.html

This way, it could be accessed directly by name from the Web; but if you do this, you lose the ability to have transparent regeneration. Indeed, if you remove the cache file, any requests for it will return a "404 Object Not Found" response. This is not a problem if the page is only changed from the user edit page (because that now does cache-on-write); but if you ever need to update all the pages at once, you will be in deep trouble.

Using Apache's mod_rewrite for Smarter Caching

If you are running PHP with Apache, you can use the very versatile mod_rewrite so that you can cache completely static HTML files while still maintaining transparent regeneration.

If you run Apache and have not looked at mod_rewrite before, put down this book and go read about it. Links are provided at the end of the chapter. mod_rewrite is very, very cool.

mod_rewrite is a URL-rewriting engine that hooks into Apache and allows rule-based rewriting of URLs. It supports a large range of features, including the following:

  • Internal redirects, which change the URL served back to the client completely internally to Apache (and completely transparently)

  • External redirects

  • Proxy requests (in conjunction with mod_proxy)

It would be easy to write an entire book on the ways mod_rewrite can be used. Unfortunately, we have little time for it here, so this section explores its configuration only enough to address your specific problem.

You want to be able to write the project.php cache files as full HTML files inside the document root to the path /www/htdocs/projects/ProjectFoo.html. Then people can access the ProjectFoo home page simply by going to the URL http://www.example.com/projects/ProjectFoo.html. Writing the cache file to that location is easyyou simply need to modify Project::get_cachefile() as follows:

function get_cachefile($name) {
  $cachedir = "/www/htdocs/projects";
  return "$cachedir/$name.html";
}

The problem, as noted earlier, is what to do if this file is not there. mod_rewrite provides the answer. You can set up a mod_rewrite rule that says "if the cache file does not exist, redirect me to a page that will generate the cache and return the contents." Sound simple? It is.

First you write the mod_rewrite rule:

<Directory /projects>
RewriteEngine On
RewriteCond /www/htdocs/%{REQUEST_FILENAME} !-f
RewriteRule ^/projects/(.*).html /generate_project.php?name=$1
</Directory>

Because we've written all the cache files in the projects directory, you can turn on the rewriting engine there by using RewriteEngine On. Then you use the RewriteCond rule to set the condition for the rewrite:

/www/htdocs/%{REQUEST_FILENAME} !-f

This means that if /www/htdocs/${REQUEST_FILENAME} is not a file, the rule is successful. So if /www/htdocs/projects/ProjectFoo.html does not exist, you move on to the rewrite:

RewriteRule ^/projects/(.*).html /generate_project.php?name=$1

This tries to match the request URI (/projects/ProjectFoo.html) against the following regular expression:

^/projects/(.*).html

This stores the match in the parentheses as $1 (in this case, ProjectFoo). If this match succeeds, an internal redirect (which is completely transparent to the end client) is created, transforming the URI to be served into /generate_project.php?name=$1 (in this case, /generate_project.php?name=ProjectFoo).

All that is left now is generate_project.php. Fortunately, this is almost identical to the original project.php page, but it should unconditionally cache the output of the page. Here's how it looks:

<?php
  require 'Cache/File.inc';
  require 'Project.inc';
  try {
    $name = $_GET[name];
    if(!$name) {
      throw new Exception;
    }
    $project = new Project($name);
  }
  catch (Exception $e) {
    // if I fail, I should go here
    header("Location: /index.php");
    return;
  }
  $cache = new Cache_File(Project::get_cachefile($name));
  $cache->begin();
?>

<html>
<title><?= $project->name ?></title>
<body>
<!-- boilerplate text -->
<table>
  <tr>
    <td>Author:</td><td><?= $project->author ?>
  </tr>
  <tr>
    <td>Summary:</td><td><?= $project->short_description ?>
  </tr>
  <tr>
    <td>Availability:</td>
    <td><a href="<?= $project->file_url ?>">click here</a></td>
  </tr>
  <tr>
    <td><?= $project->long_description ?></td>
  </tr>
</table>
</body>
</html>
<?php
  $cache->end();
?>

An alternative to using mod_rewrite is to use Apache's built-in support for custom error pages via the ErrorDocument directive. To set this up, you replace your rewrite rules in your httpd.conf with this directive:

ErrorDocument 404 /generate_project.php

This tells Apache that whenever a 404 error is generated (for example, when a requested document does not exist), it should internally redirect the user to /generate_project.php. This is designed to allow a Web master to return custom error pages when a document isn't found. An alternative use, though, is to replace the functionality that the rewrite rules provided.

After you add the ErrorDocument directive to your httpd.conf file, the top block of generate_project.php needs to be changed to use $_SERVER['REQUEST_URI'] instead of having $name passed in as a $_GET[] parameter. Your generate_project.php now looks like this:

<?php
  require 'Cache/File.inc';
  require 'Project.inc';
  try {
    $name = $_SERVER['REQUEST_URI'];
    if(!$name) {
      throw new Exception;
    }
    $project = new Project($name);
  }
  catch (Exception $e) {
    // if I fail, I should go here
    header("Location: /index.php");
    return;
  }
  $cache = new Cache_File(Project::get_cachefile($name));
  $cache->begin();
?>

Otherwise, the behavior is just as it would be with the mod_rewrite rule.

Using ErrorDocument handlers for generating static content on-the-fly is very useful if you do not have access over your server and cannot ensure that it has mod_rewrite available. Assuming that I control my own server, I prefer to use mod_rewrite. mod_rewrite is an extremely flexible tool, which means it is easy to apply more complex logic for cache regeneration if needed.

In addition, because the ErrorDocument handler is called, the page it generates is returned with a 404 error code. Normally a "valid" page is returned with a 200 error code, meaning the page is okay. Most browsers handle this discrepancy without any problem, but some tools do not like getting a 404 error code back for content that is valid. You can overcome this by manually setting the return code with a header() command, like this:

header("$_SERVER['SERVER_PROTOCOL'] 200");

Caching Part of a Page

Often you cannot cache an entire page but would like to be able to cache components of it. An example is the personalized navigation bar discussed earlier in this chapter, in the section "Cookie-Based Caching." In that case, you used a cookie to store the user's navigation preferences and then rendered them as follows:

<?php
$userid = $_COOKIE['MEMBERID'];
$user = new User($userid);
if(!$user->name) {
  header("Location: /login.php");
}
$navigation = $user->get_interests();
?>
<table>
  <tr>
    <td>
      <table>
        <tr><td>
        <?= $user->name %>'s Home
        </td></tr>
        <?php for($i=1; $i<=3; $i++) { ?>
        <tr><td>
        <!-- navigation row position <?= $i ?> -->
        <?= generate_navigation_element($navigation[$i]) ?>
        </td></tr>
        <?php } ?>
      </table>
    </td>
    <td>
      <!-- page body (static content identical for all users) -->
    </td>
  </tr>
</table>

You tried to cache the output of generate_navigation_component(). Caching the results of small page components is simple. First, you need to write generate_navigation_element. Recall the values of $navigation, which has topic/subtopic pairs such as sports-football, weather-21046, project-Foobar, and news-global. You can implement generate_navigation as a dispatcher that calls out to an appropriate content-generation function based on the topic passed, as follows:

<?php
function generate_navigation($tag) {
  list($topic, $subtopic) = explode('-', $tag, 2);
  if(function_exists("generate_navigation_$topic")) {
    return call_user_func("generate_navigation_$topic", $subtopic);
  }
  else {
    return 'unknown';
  }
}
?>

A generation function for a project summary looks like this:

<?php
require_once 'Project.inc';
function generate_navigation_project($name) {
  try {
    if(!$name) {
      throw new Exception();
    }
    $project = new Project($name);
  }
  catch (Exception $e){
    return 'unknown project';
  }
  ?>
<table>
  <tr>
    <td>Author:</td><td><?= $project->author ?>
  </tr>
  <tr>
    <td>Summary:</td><td><?= $project->short_description ?>
  </tr>
  <tr>
    <td>Availability:</td>
    <td><a href="<?= $project->file_url ?>">click here</a></td>
  </tr>
  <tr>
    <td><?= $project->long_description ?></td>
  </tr>
</table>
  <?php
}
?>

This looks almost exactly like your first attempt for caching the entire project page, and in fact you can use the same caching strategy you applied there. The only change you should make is to alter the get_cachefile function in order to avoid colliding with cache files from the full page:

<?php
require_once 'Project.inc';
function generate_navigation_project($name) {
  try {
    if(!$name) {
      throw new Exception;
    }
    $cache = new Cache_File(Project::get_cachefile_nav($name));
    if($text = $cache->get()) {
      print $text;
      return;
    }
    $project = new Project($name);
    $cache->begin();
  }
  catch (Exception $e){
    return 'unkonwn project';
  }
?>
<table>
  <tr>
    <td>Author:</td><td><?= $project->author ? >
  </tr>
  <tr>
    <td>Summary:</td><td><?= $project->short_description ?>
  </tr>
  <tr>
    <td>Availability:</td><td><a href="<?= $project->file_url ?>">click
here</a></td>
  </tr>
  <tr>
    <td><?= $project->long_description ?></td>
  </tr>
</table>
<?php
    $cache->end();
}

And in Project.inc you add this:

public function get_cachefile_nav($name) {
  global $CACHEBASE;
  return "$CACHEBASE/projects/nav/$name.cache";
}

?>

It's as simple as that!

Implementing a Query Cache

Now you need to tackle the weather element of the navigation bar you've been working with. You can use the Simple Object Application Protocol (SOAP) interface at xmethods.net to retrieve real-time weather statistics by ZIP code. Don't worry if you have not seen SOAP requests in PHP before; we'll discuss them in depth in Chapter 16, "RPC: Interacting with Remote Services." generate_navigation_weather() creates a Weather object for the specified ZIP code and then invokes some SOAP magic to return the temperature in that location:

<?php
include_once 'SOAP/Client.php';
class Weather {
  public $temp;
  public $zipcode;
  private $wsdl;
  private $soapclient;

  public function _ _construct($zipcode) {
    $this->zipcode = $zipcode;
    $this->_get_temp($zipcode);
  }

  private function _get_temp($zipcode) {
    if(!$this->soapclient) {
      $query = "http://www.xmethods.net/sd/2001/TemperatureService.wsdl";
      $wsdl = new SOAP_WSDL($query);
      $this->soapclient = $wsdl->getProxy();
    }
    $this->temp = $this->soapclient->getTemp($zipcode);
  }
}

function generate_navigation_weather($zip) {
  $weather = new Weather($zip);
?>
The current temp in <?= $weather->zipcode ?>
is <?= $weather->temp ?> degrees Farenheit\n";
<?php
}

RPCs of any kind tend to be slow, so you would like to cache the weather report for a while before invoking the call again. You could simply apply the techniques used in Project and cache the output of generate_navigation_weather() in a flat file. That method would work fine, but it would allocate only one tiny file per ZIP code.

An alternative is to use a DBM cache and store a record for each ZIP code. To insert the logic to use the Cache_DBM class that you implemented earlier in this chapter requires only a few lines in _get_temp:

private function _get_temp($zipcode) {
  $dbm = new Cache_DBM(Weather::get_cachefile(), 3600);
  if($temp = $dbm->get($zipcode)) {
    $this->temp = $temp;
    return;
  }
  else {
    if(!$this->soapclient) {
      $url = " http://www.xmethods.net/sd/2001/TemperatureService.wsdl";
      $wsdl = new SOAP_WSDL($url);
      $this->soapclient = $wsdl->getProxy();
    }
    $this->temp = $this->soapclient->getTemp($zipcode);
    $dbm->put($zipcode, $this->temp);
  }
}

function get_cachefile() {
  global $CACHEBASE;
  return "$CACHEBASE/Weather.dbm";
}

Now when you construct a Weather object, you first look in the DBM file to see whether you have a valid cached temperature value. You initialize the wrapper with an expiration time of 3,600 seconds (1 hour) to ensure that the temperature data does not get too old. Then you perform the standard logic "if it's cached, return it; if not, generate it, cache it, and return it."


Previous
Table of Contents
Next