Приглашаем посетить
Романтизм (19v-euro-lit.niv.ru)

Caching in a Distributed Environment

Previous
Table of Contents
Next

Caching in a Distributed Environment

Using caching techniques to increase performance is one of the central themes of this book. Caching, in one form or another, is the basis for almost all successful performance improvement techniques, but unfortunately, a number of the techniques we have developed, especially content caching and other interprocess caching techniques, break down when we move them straight to a clustered environment.

Consider a situation in which you have two machines, Server A and Server B, both of which are serving up cached personal pages. Requests come in for Joe Random's personal page, and it is cached on Server A and Server B (see Figure 15.4).

Figure 15.4. Requests being cached across multiple machines.

Caching in a Distributed Environment


Now Joe comes in and updates his personal page. His update request happens on Server A, so his page gets regenerated there (see Figure 15.5).

Figure 15.5. A single cache write leaving the cache inconsistent.

Caching in a Distributed Environment


This is all that the caching mechanisms we have developed so far will provide. The cached copy of Joe's page was poisoned on the machine where the update occurred (Server A), but Server B still has a stale copy, but it has no way to know that the copy is stale, as shown in Figure 15.6. So the data is inconsistent and you have yet to develop a way to deal with it.

Figure 15.6. Stale cache data resulting in inconsistent cluster behavior.

Caching in a Distributed Environment


Cached session data suffers from a similar problem. Joe Random visits your online marketplace and places items in a shopping cart. If that cart is implemented by using the session extension on local files, then each time Joe hits a different server, he will get a completely different version of his cart, as shown in Figure 15.7.

Figure 15.7. Inconsistent cached session data breaking shopping carts.

Caching in a Distributed Environment


Given that you do not want to have to tie a user's session to a particular machine (for the reasons outlined previously), there are two basic approaches to tackle these problems:

  • Use a centralized caching service.

  • Implement consistency controls over a decentralized service.

Centralized Caches

One of the easiest and most common techniques for guaranteeing cache consistency is to use a centralized cache solution. If all participants use the same set of cache files, most of the worries regarding distributed caching disappear (basically because the caching is no longer completely distributedjust the machines performing it are).

Network file shares are an ideal tool for implementing a centralized file cache. On Unix systems the standard tool for doing this is NFS. NFS is a good choice for this application for two main reasons:

  • NFS servers and client software are bundled with essentially every modern Unix system.

  • Newer Unix systems supply reliable file-locking mechanisms over NFS, meaning that the cache libraries can be used without change.

The real beauty of using NFS is that from a user level, it appears no different from any other filesystem, so it provides a very easy path for growing a cache implementation from a single file machine to a cluster of machines.

If you have a server that utilizes /cache/www.foo.com as its cache directory, using the Cache_File module developed in Chapter 10, "Data Component Caching," you can extend this caching architecture seamlessly by creating an exportable directory /shares/cache/www.foo.com on your NFS server and then mounting it on any interested machine as follows:

#/etc/fstab
nfs-server:/shares/cache/www.foo.com /cache/www.foo.com nfs rw,noatime - -

Then you can mount it with this:

# mount -a

These are the drawbacks of using NFS for this type of task:

  • It requires an NFS server. In most setups, this is a dedicated NFS server.

  • The NFS server is a single point of failure. A number of vendors sell enterprise-quality NFS server appliances. You can also rather easily build a highly available NFS server setup.

  • The NFS server is often a performance bottleneck. The centralized server must sustain the disk input/output (I/O) load for every Web server's cache interaction and must transfer that over the network. This can cause both disk and network throughput bottlenecks. A few recommendations can reduce these issues:

    • Mount your shares by using the noatime option. This turns off file metadata updates when a file is accessed for reads.

    • Monitor your network traffic closely and use trunked Ethernet/Gigabit Ethernet if your bandwidth grows past 75Mbps.

    • Take your most senior systems administrator out for a beer and ask her to tune the NFS layer. Every operating system has its quirks in relationship to NFS, so this sort of tuning is very difficult. My favorite quote in regard to this is the following note from the 4.4BSD man pages regarding NFS mounts:

      Due to the way that Sun RPC is implemented on top of UDP (unreliable
      datagram) transport, tuning such mounts is really a black art that can
      only be expected to have limited success.
      

Another option for centralized caching is using an RDBMS. This might seem completely antithetical to one of our original intentions for cachingto reduce the load on the databasebut that isn't necessarily the case. Our goal throughout all this is to eliminate or reduce expensive code, and database queries are often expensive. Often is not always, however, so we can still effectively cache if we make the results of expensive database queries available through inexpensive queries.

Fully Decentralized Caches Using Spread

A more ideal solution than using centralized caches is to have cache reads be completely independent of any central service and to have writes coordinate in a distributed fashion to invalidate all cache copies across the cluster.

To achieve this, you can use Spread, a group communication toolkit designed at the Johns Hopkins University Center for Networking and Distributed Systems to provide an extremely efficient means of multicast communication between services in a cluster with robust ordering and reliability semantics. Spread is not a distributed application in itself; it is a toolkit (a messaging bus) that allows the construction of distributed applications.

The basic architecture plan is shown in Figure 15.8. Cache files will be written in a nonversioned fashion locally on every machine. When an update to the cached data occurs, the updating application will send a message to the cache Spread group. On every machine, there is a daemon listening to that group. When a cache invalidation request comes in, the daemon will perform the cache invalidation on that local machine.

Figure 15.8. A simple Spread ring.

Caching in a Distributed Environment


This methodology works well as long as there are no network partitions. A network partition event occurs whenever a machine joins or leaves the ring. Say, for example, that a machine crashes and is rebooted. During the time it was down, updates to cache entries may have changed. It is possible, although complicated, to build a system using Spread whereby changes could be reconciled on network rejoin. Fortunately for you, the nature of most cached information is that it is temporary and not terribly painful to re-create. You can use this assumption and simply destroy the cache on a Web server whenever the cache maintenance daemon is restarted. This measure, although draconian, allows you to easily prevent usage of stale data.

To implement this strategy, you need to install some tools. To start with, you need to download and install the Spread toolkit from www.spread.org. Next, you need to install the Spread wrapper from PEAR:

# pear install spread

The Spread wrapper library is written in C, so you need all the PHP development tools installed to compile it (these are installed when you build from source). So that you can avoid having to write your own protocol, you can use XML-RPC to encapsulate your purge requests. This might seem like overkill, but XML-RPC is actually an ideal choice: It is much lighter-weight than a protocol such as SOAP, yet it still provides a relatively extensible and "canned" format, which ensures that you can easily add clients in other languages if needed (for example, a standalone GUI to survey and purge cache files).

To start, you need to install an XML-RPC library. The PEAR XML-RPC library works well and can be installed with the PEAR installer, as follows:

# pear install XML_RPC

After you have installed all your tools, you need a client. You can augment the Cache_File class by using a method that allows for purging data:

require_once 'XML/RPC.php';

class Cache_File_Spread extends File {
  private $spread;

Spread works by having clients attach to a network of servers, usually a single server per machine. If the daemon is running on the local machine, you can simply specify the port that it is running on, and a connection will be made over a Unix domain socket. The default Spread port is 4803:

private $spreadName  = '4803';

Spread clients join groups to send and receive messages on. If you are not joined to a group, you will not see any of the messages for it (although you can send messages to a group you are not joined to). Group names are arbitrary, and a group will be automatically created when the first client joins it. You can call your group xmlrpc:

private $spreadGroup = 'xmlrpc';

private $cachedir = '/cache/';
public function _ _construct($filename, $expiration=false)
{
  parent::_ _construct($filename, $expiration);

You create a new Spread object in order to have the connect performed for you automatically:

 $this->spread = new Spread($this->spreadName);
}

Here's the method that does your work. You create an XML-RPC message and then send it to the xmlrpc group with the multicast method:

  function purge()
  {
    // We don't need to perform this unlink,
    // our local spread daemon will take care of it.
    // unlink("$this->cachedir/$this->filename");
    $params = array($this->filename);
    $client = new XML_RPC_Message("purgeCacheEntry", $params);
    $this->spread->multicast($this->spreadGroup, $client->serialize());
  }
}
}

Now, whenever you need to poison a cache file, you simply use this:

$cache->purge();

You also need an RPC server to receive these messages and process them:

require_once 'XML/RPC/Server.php';
$CACHEBASE = '/cache/';
$serverName = '4803';
$groupName  = 'xmlrpc';

The function that performs the cache file removal is quite simple. You decode the file to be purged and then unlink it. The presence of the cache directory is a half-hearted attempt at security. A more robust solution would be to use chroot on it to connect it to the cache directory at startup. Because you're using this purely internally, you can let this slide for now. Here is a simple cache removal function:

function purgeCacheEntry($message) {
  global $CACHEBASE;
  $val = $message->params[0];
  $filename = $val->getval();
  unlink("$CACHEBASE/$filename");
}

Now you need to do some XML-RPC setup, setting the dispatch array so that your server object knows what functions it should call:

$dispatches = array( 'purgeCacheEntry' =>
                        array('function' => 'purgeCacheEntry'));
$server = new XML_RPC_Server($dispatches, 0);

Now you get to the heart of your server. You connect to your local Spread daemon, join the xmlrpc group, and wait for messages. Whenever you receive a message, you call the server's parseRequest method on it, which in turn calls the appropriate function (in this case, purgeCacheEntry):

$spread = new Spread($serverName);
$spread->join($groupName);
while(1) {
  $message = $spread->receive();
  $server->parseRequest($data->message);
}


Previous
Table of Contents
Next