Ïðèãëàøàåì ïîñåòèòü
ßçûêîâ (yazykov.lit-info.ru)

DBM-Based Caching

Previous
Table of Contents
Next

DBM-Based Caching

A frequently overlooked storage medium is DBM files. Often relegated to being a "poor man's database," many people forget that DBM files are extremely fast and are designed to provide high-speed, concurrent read/write access to local data. DBM file caches excel over flat-file caches in that they are designed to have multiple data sources stored in them (whereas flat-file caches work best with a single piece of data per file), and they are designed to support concurrent updates (whereas you have to build concurrency into a flat-file filesystem).

Using DBM files is a good solution when you need to store specific data as key/value pairs (for example, a database query cache). In contrast with the other methods described in this chapter, DBM files work as a key/value cache out-of-the-box.

In PHP the dba (DBM database abstraction) extension provides a universal interface to a multitude of DBM libraries, including the following:

  • dbm The original Berkley DB file driver

  • ndbm Once a cutting-edge replacement for dbm, now largely abandoned

  • gdbm The GNU dbm replacement

  • Sleepycat DB versions 24 Not IBM's DB2, but an evolution of dbm brought about by the nice folks at Berkeley

  • cdb A constant database library (nonupdatable) by djb of Qmail fame

Licenses

Along with the feature set differences between these libraries, there are license differences as well. The original dbm and ndbm are BSD licensed, gdbm is licensed under the Gnu Public License (GPL), and the Sleepycat libraries have an even more restrictive GPL-style license.

License differences may not mean much to you if you are developing as a hobby, but if you are building a commercial software application, you need to be certain you understand the ramifications of the licensing on the software you use. For example, if you link against a library under the GPL, you need to the source code of your application available to anyone you sell the application to. If you link against SleepyCat's DB4 dbm in a commercial application, you need to purchase a license from SleepyCat.


You might use a DBM file to cache some data. Say you are writing a reporting interface to track promotional offers. Each offer has a unique ID, and you have written this function:

int showConversions(int promotionID)

which finds the number of distinct users who have signed up for a give promotion. On the back end the showConversions script might look like this:

function showConversion($promotionID) {
  $db = new DB_MySQL_Test;
  $row = $db->execute("SELECT count(distinct(userid)) cnt
                      FROM promotions
                      WHERE promotionid = $promotionid")->fetch_assoc();
  return $row['cnt'];
}

This query is not blindingly fast, especially with the marketing folks reloading it constantly, so you would like to apply some caching.

To add caching straight to the function, you just need to open a DBM file and preferentially fetch the result from there if it exists:

function showConversion($promotionID) {
  $gdbm = dba_popen("promotionCounter.dbm", "c", "gdbm");
  if(($count = dba_fetch($promotionid, $gdbm)) !== false) {
            return $count;
  }
  $db = new DB_MySQL_Test;
  $row = $db->execute("SELECT count(distinct(userid)) cnt
                       FROM promotions
                       WHERE promotionid = $promotionid");
  dba_replace($promotion, $row[0], $gdbm);
  return $row['cnt'];
}

Cache Concurrency and Coherency

A nice feature of DBM files is that concurrency support is built into them. The exact locking method is internal to the specific back end being used (or at least is not exposed to the user from PHP), but safe concurrent access is guaranteed.

Cache Invalidation and Management

If you are an astute reader, you probably noticed the serious flaw in the caching scheme discussed earlier in this chapter, in the section "DBM-Based Caching." You have no method to invalidate the cache! The counts that you've cached will never update. While this certainly makes the results return quickly, it also renders the result useless. A good caching system strives to make its impact transparentor at least barely noticeable.

Unlike the flat-file implementations discusses earlier in this chapter, the difficulty here is not how to update the filesthe dba_replace and dba_insert functions take care of all the work for you. The issue is how to know that you should update them at all. DBM files do not carry modification times on individual rows, so how do you know if the value available is from one second ago or one week ago?

Probably the cleverest approach I have seen to this problem is the probabilistic approach. You look at the frequency at which the data is requested and figure out how many requests you get on average before you should invalidate the cache. For example, if you receive 10 requests per second to the page where the data is displayed and you would like to cache the data for 5 minutes, you should flush the data according to the following formula:

5 minutes x (60 seconds/minute) x (10 requests/second) = 3,000 requests

Sharing a global access count between all processes is impractical. It would require storing access time information for every row in the DBM file. That is not only complicated, but it's slow as well, as it means you have to write to the DBM file (to record the time) on every read call. Instead, you can take the probabilistic approach. If instead of updating exactly on the 3,000th request, you assign a 1/3,000 probability that you will update on any given request, probabilistically you end up with the same number of refreshes over a long period of time.

Here is a reimplementation of showConversion() that implements probabilistic removal:

function showConversion($promotionID) {
  $gdbm = dba_popen("promotionCounter.dbm", "c", "gdbm");
  // if this is our 1 in 3000 chance, we will skip
  // looking for our key and simply reinsert it
  if(rand(3000) > 1) {
    if($count = dba_fetch($promotionid, $gdbm)) {
              return $count;
    }
  }
  $db = new DB_MySQL_Test;
  $row = $db->execute("SELECT count(distinct(userid)) cnt
                       FROM promotions
                       WHERE promotionid = $promotionid")->fetch_assoc();
  dba_replace($promotion, $row[0], $gdbm);
  return $row[cnt];
}

The beauty of this method is its simplicity. You cache only the data you are really interested in, and you let mathematics handle all the tough choices. The downside of this method is that it requires you to really know the access frequency of an application; making poor choices for the removal probability can result in values staying in the cache much longer than they should. This is especially true if there are lulls in traffic, which break the mathematical model. Still, it is an interesting example of thinking out-of-the-box, and it may be a valid choice if the access patterns for your data are particularly stable or as an enhancement to a deterministic process.

To implement expiration in the cache, you can wrap all the calls to it with a class that adds modification times to all the entries and performs internal expiration:

<?php
class Cache_DBM {
  private $dbm;
  private $expiration;
  function _ _construct($filename, $expiration=3600) {
    $this->dbm = dba_popen($filename, "c", "ndbm");
    $this->expiration = $expiration;
  }
  function put($name, $tostore) {
    $storageobj = array('object' => $tostore, 'time' => time());
    dba_replace($name, serialize($storageobj), $this->dbm);
  }
  function get($name) {
    $getobj = unserialize(dba_fetch($name, $this->dbm));
    if(time() - $getobj[time] < $this->expiration) {
      return $getobj[object];
    }
    else {
      dba_delete($name, $this->dbm);
      return false;
    }
  }
  function delete($name) {
    return dba_delete($name, $this->dbm);
  }
}
?>

You would use this class by constructing a new cache object:

<?php
  require_once 'Cache/DBM.inc';
  $cache = new Cache_DBM("/path/to/cachedb");
?>

This cache object calls dba_popen to open the cache DBM file (and to create it if it does not exist). The cache object also sets the expiration time to the default of 3,600 seconds (1 hour). If you wanted a different time, say 1 day, you could specify the expiration as well:

$cache = Cache_DBM("/path/to/cachedb", 86400);

Cache storage and lookups are performed by a keyname, which you need to provide. For example, to store and then reinstantiate a Foo object, you would use this:

$foo = new Foo();
//store it
$cache->put('foo', $foo);

In the library, this creates an array that contains $foo as well as the current time and serializes it. This serialization is then stored in the cache DBM with the key foo. You have to serialize the object because a DBM file can only store strings. (Actually, it can store arbitrary contiguous binary structures, but PHP sees them as strings.) If there is existing data under the key foo, it is replaced. Some DBM drivers (DB4, for example) can support multiple data values for a given key, but PHP does not yet support this.

To get a previously stored value, you use the get() method to look up the data by key:

$obj = $cache->get('foo');

get is a bit complicated internally. To get back a stored object, it must first be looked up by key. Then it is deserialized into its container, and the insert time is compared against the expiration time specified in the cache constructor to see if it is stale. If it fails the expiration check, then it is returned to the user; otherwise, it is deleted from the cache.

When using this class in the real world, you perform a get() first to see whether a valid copy of the data is in the cache, and if it is not, you use put():

<?php
class Foo {
  public function id() {
    return "I am a Foo";
  }
}

require_once 'Cache/DBM.inc';
$dbm = new Cache_DBM("/data/cache/files/016/generic");
if($obj = $dbm->get("foo")) {
  // Cache Hit, $obj is what we were looking for
  print $obj->id();
}
else {
  // Cache Miss, create a new $obj and insert it into the cache
  $obj = new Foo();
  $dbm->put("foo", $obj);
  print $obj->id();
}
// ... use $obj however we want
?>

The following are some things to note about the wrapper class:

  • Any sort of data structure (for example, object, array, string) can be handled automatically. Anything can be handled automatically except resources, but resources cannot be effectively shared between processes anyway.

  • You can perform a put() to recache an object at any time. This is useful if you take an action that you know invalidates the cached value.

  • Keynames are not autodetermined, so you must know that foo refers to the Foo object you are interested in. This works well enough for singletons (where this naming scheme makes perfect sense), but for anything more complicated, a naming convention needs to be devised.

With the exception of cdb, DBM implementations dynamically extend their backing storage to handle new data. This means that if left to its own devices, a DBM cache will function as long as the filesystem it is on has free space. The DBM library does not track access statistics, so without wrapping the library to provide this functionality, you can't do intelligent cache management.

One idiosyncrasy of DBM files is that they do not shrink. Space is reused inside the file, but the actual size of the file can only grow, never shrink. If the cache sees a lot of activity (such as frequent inserts and significant turnover of information), some form of cache maintenance is necessary. As with file-based caches, for many applications the low-maintenance overhead involves simply removing and re-creating the DBM files.

If you do not want to take measures that draconian, you can add a garbage-collection method to Cache_DBM:

function garbageCollection() {
  $cursor = dba_firstkey($this->dbm);
  while($cursor) {
    $keys[] = $cursor;
    $cursor = dba_nextkey($this->dbm);
  }
  foreach( $keys as $key ) {
    $this->get($key);
  }
}

You use a cursor to walk through the keys of the cache, store them, and then call get() on each key in succession. As shown earlier in this section, get() removes the entry if it is expired, and you simply ignore its return value if it is not expired. This method may seem a little longer than necessary; putting the call to get() inside the first while loop would make the code more readable and reduce an entire loop from the code.

Unfortunately, most DBM implementations do not correctly handle keys being removed from under them while looping through the keys. Therefore, you need to implement this two-step process to ensure that you visit all the entries in the cache.

Garbage collection such as this is not cheap, and it should not be done more frequently than is needed. I have seen implementations where the garbage collector was called at the end of every page request, to ensure that the cache was kept tight. This can quickly become a serious bottleneck in the system. A much better solution is to run the garbage collector as part of a scheduled job from cron. This keeps the impact negligible.


Previous
Table of Contents
Next