Приглашаем посетить
Черный Саша (cherny-sasha.lit-info.ru)

Automating Tasks

Table of Contents
Previous Next

Automating Tasks

When used from the command line, PHP can help automate tasks under Windows and Linux. One common task for it to perform is to parse the day's web logs, which are stored in NCSA Common Log Format, the default log type for Apache. Having analyzed the log, PHP should then e-mail a summary to the web site administrator. For this we will need a simple script that will do the following:

We then need to set up the script to run at midnight every night.

NCSA Common Log File Format

In the NCSA format there is one entry per line, each entry being a request made to the server. The best way to explain the log file is to look at an example:

    10.0.0.1 - - [10/Apr/2001:14:22:57 +0100] "GET /icons/blank.gif HTTP/1.1" 404 287
    10.0.0.1 - - [10/Apr/2001:14:24:37 +0100] "GET /error.php HTTP/1.1" 500 289
    10.0.0.5 - - [10/Apr/2001:14:25:00 +0100] "GET /private/index.php HTTP/1.1" 401
    291
    10.0.0.5 - jm [10/Apr/2001:14:25:32 +0100] "GET /private/index.php HTTP/1.1" 200
    770
    10.0.0.2 - - [10/Apr/2001:14:26:58 +0100] "GET /links.php HTTP/1.1" 200 99

Here we have various requests to a local web server that have been taken from my own logs. Each line has the following form:

    ClientIP - UserName [Date:Time TimeZone] "Method URI HTTPVersion" StatusCode
    BytesSent

If we look at the first line of the extract from the logs it tells us that the client with the IP address 10.0.0.1 requested a page from our web server at 14:22:57 on the 10th April 2001 in the time zone GMT + 1 hour (British Summer Time). They requested to view the resource /icons/blank.gif via the GET method using the HTTP protocol version 1.1. The server could not find this file, and sent a 404-status code that generated 287 bytes of outgoing traffic. The log also shows that the same user made a request for the page /error.php which caused the server to return a 500 status code (Internal Server Error), generating 289 bytes with this request.

A different user then requested a protected page, which generated a 401 status code asking the user to authenticate, which they did, authenticating as the user jm. This is shown in the next line of the log file. The final request is just a simple request by a client on the machine with an IP of 10.0.0.2 asking for the page /links.php. The server dealt with this request and returned an OK status of 200.

Now that we understand the general format of the logs, we need to extract information from them programmatically. This can be done by first removing any unwanted formatting characters and then tokenizing the string. Luckily, PHP provides some functions for us to do this easily. The following function is passed a line of the log, extracts the information and returns it in an associative array.

The first line of the function removes any unwanted formatting characters such as ], [ and " from the line passed to the function. This makes it easier to tokenize and deal with later in the function, as we no longer need to be concerned with characters that do not hold any information:

    function tokenizeLine($line)
    {
        $line = preg_replace("/(\[|\]|\")/", "", $line);

We then tokenize the array using the PHP function strtok to return all of the characters up to the next space. This is done using a while loop to check whether there are any more tokens we need to retrieve:

      $token = strtok($line, " ");
      while ($token) {
             $token_array[] = $token;
             $token = strtok(" ");
      }
      $return_array['IP'] = $token_array[0];

The above line checks to see if the user provided a username. If they did, the next line assigns this username to the return array; otherwise, it is not set:

        if (!(strstr("-", $token_array[2])) and (strlen($token_array[2]) > 1)) {
            $return_array['UserName'] = $token_array[2];
        }

It would be nice to split date and time apart, as at the moment they are both held in the $token_array[3] variable. The regular expression below does this. It matches everything up to the first ":" which is the date. It then matches everything up to the next space, which is the time. It stores these in $data_array[1] and $data_array[2] respectively. We then assign these to their correct place in the $return_array variable:

        preg_match("/([\/a-zA-Z0-9]+)[\:]([0-9:]+)/",
                    $token_array[3],$date_array);
        $return_array['Date'] = $date_array[1];
        $return_array['Time'] = $date_array[2];

The lines below place all the remaining data into the return array and then return the contents of $return_array from the function:

        $return_array['TimeZone'] = $token_array[4];
        $return_array['RequestMethod'] = $token_array[5];
        $return_array['Resource'] = $token_array[6];
        $return_array['HTTPVersion'] = $token_array[7];
        $return_array['StatusCode'] = $token_array[8];
        $return_array['BytesSent'] = $token_array[9];
        return $return_array;
    }

The Log Analyzer Script

We now need to write the rest of the script. Our example will simply record the number of each type of status code and e-mail this to the administrator. If you use this script, you might want to include other statistics as well:

    <?php
    set_time_limit(0); // Force this script to run without a time limit

    /* -- Variables used in the script -- */
    $logfile = "./access.log";
    $admin_email = "admin@localhost";

    function tokenizeLine($line)
    {
        $line = preg_replace("/(\[|\]|\")/", "", $line);
        $token = strtok($line, " ");
        while ($token) {
            $token_array[] = $token;
            $token = strtok(" ");
        }
        $return_array['IP'] = $token_array[0];

        if (!(strstr("-", $token_array[2])) and (strlen($token_array[2]) > 1)) {
            $return_array['UserName'] = $token_array[2];
        }

        preg_match("/([\/a-zA-Z0-9]+)[\:]([0-9:]+)/",
                    $token_array[3],$date_array);
        $return_array['Date'] = $date_array[1];
        $return_array['Time'] = $date_array[2];
        $return_array['TimeZone'] = $token_array[4];
        $return_array['RequestMethod'] = $token_array[5];
        $return_array['Resource'] = $token_array[6];
        $return_array['HTTPVersion'] = $token_array[7];
        $return_array['StatusCode'] = $token_array[8];
        $return_array['BytesSent'] = $token_array[9];
        return $return_array;
    }

The script reads each line of the log file into an array called $file_contents:

    $file_contents = file($logfile);

It then loops through the array passing each line to the tokenizeLine() function and then increments a counter for each status code:

    foreach ($file_contents as $line) {
        $info_array = tokenizeLine($line);
        $status_code[$info_array['StatusCode']]++;
    }

The script then creates the text for the e-mail, and finally sends it to the administrator:

    $email = "Summary of codes for todays logs\n\nCode\tCount\n";

    foreach ($status_code as $code => $count) {
        $email .= "$code:\t$count\n";
    }

    mail($admin_email, "Summary of weblogs", $email);
    ?>

For ease of use, we want to automate the running of this script so that it runs every night at midnight. To do this under Linux we use cron, and under Windows NT/2000 we use the AT command. I will deal with cron first.

cron

cron is a way of automating tasks under Linux/UNIX. The cron daemon looks at each user's crontab file every minute and checks to see if it needs to perform any actions. The systems administrator (or any user for that matter) can use it to automate tasks such as running the script above.

A crontab entry has six fields. The first five are used to specify the time that an action should take place; the last specifies the command that should be run by the cron daemon. Each of the time fields specifies a value that must be matched to the current system time for the command to run. They can also contain the wild card * to run the command at any time where all the other fields match.

The first field specifies the number of minutes past the hour (0-59) that the command should be run. The second field specifies the hour of the day the command should be run (0-23), the third field specifies the day of the month (1-31), the fourth the month of the year (1-12), and the fifth the day of the week (0-6, where 0 is Sunday). You may find more detailed information about the crontab command at http://hoth.stsci.edu/man/man1/crontab.html.

We need our script to run every day at midnight. If the script is called mail_stats.php and lives in the directory /home/jmoore/, the following crontab entry would accomplish this. If you need to edit your crontab file you may do so by typing in crontab at your terminal:

    0 0 * * * /usr/local/bin/php -q /home/jmoore/mail_stats.php

This tells the cron daemon that when the hour and minute both equal 0 on any day of any month it should run the command /usr/local/bin/php –q mail_stats.php.

Important 

Under the Linux/UNIX shell you can make the script itself executable by setting the correct permissions on the script (normally this is achieved by running the command chmod a+xmail_stats.php, but check your system's documentation if you are unsure how to do this). You can then add the command the shell should use to execute the script, as the first line of the script. For example #!/usr/local/bin/php –q. This tells the shell to execute the script with the executable found at /usr/local/bin/php with the flag –q.

AT

The AT command is the Windows NT equivalent of the Linux cron. It allows you to automate commands under Windows 2000, XP, and NT. To do this you need to make an entry in the AT list. This is done at the command line, the command taking the form:

    at [\\computername] time [/interactive] [/every:date[,...] | /next:date[,...]]
    command

We want to run the command php –q c:\mail_stats.php (assuming the mail_stats.php script is on the root of your C: drive) every day at midnight. This is achieved with the following command:

    AT 00:00 /every:M,T,W,Th,F,S,Su php -q c:\mail_stats.php

This would execute the command php –q c:\mail_stats.php at midnight every Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.

Windows Scheduled Task Manager

There is no crontab/AT mechanism for Windows 95/98/ME. Instead we have the Scheduled Tasks Manager. The wizard can be accessed using Start | Programs| Accessories | System Tools | Scheduled Tasks.

First, you need to create a batch file called, for example, C:\CHECKLOG.BAT. We will use this batch file to execute the script. It only requires one line in it and presuming you have set up your path variable add the following line to it:

    php -q c:\mail_stats.php

After saving the batch file, go to the Scheduled Task Manager and add the new task by double-clicking on the Add Scheduled Task icon. Click the Next button on the wizard, then specify the batch file we have just created by clicking on the Browse button and browsing to it. You can assign a name to the task (for example, CheckLogTask) and specify the time and frequency it should run. However, there is a fairly major limitation when using the Scheduled Task Manager. Each task may only have one setting for the time. To solve this problem, you can create multiple tasks accessing the same shortcut by naming them differently (CheckLogTask1, CheckLogTask2, CheckLogTask3, and so on).

Accepting Command Line Arguments

The above script does its job well, but what happens if you have lots of different access logs for different virtual hosts, and each log needs to be sent to a different administrator? One option is to have a copy of the script for each log and change the variables for each of the different logs and administrators. A far more efficient way to do it would be to accept command line arguments so that you could call the script with mail_stats.php <logfile> <administrators_email>.

PHP allows you to pass command line arguments to your script like this. When this is done PHP sets two variables; $argc which contains the number of command line arguments passed to the script, and $argv[] which is an array of the actual arguments passed to the script.

$argv[0] is always the name of the PHP script you are executing, mail_stats.php in this case, then $argv[1] will contain the first command line argument, $argv[2] the second argument and $argv[n] the nth argument. To allow our script to accept arguments via the command line we need to check that the correct number of arguments are passed to us and then assign the arguments to the correct variables. The following code does this:

    if ($argc != 3) {
        echo("usage: mail_stats.php logfile administrators_email");
        exit;
    }

    $logfile = $argv[1];
    $admin_email = $argv[2];

Firstly we check the number of arguments. If this is incorrect we print a usage message and quit the script, otherwise we assign the correct values to the variables that are used later on in the script. If you are not the only person who is going to use the script then you might want to check $argv[1] and $argv[2] for validity; this feature has not been included here for the sake of clarity in the example. However, if needed, you can put the above lines into the mail_stats.php script in place of the following:

    /* -- Variables used in the script -- */
    $logfile = "./access.log";
    $admin_email = "admin@localhost";

Table of Contents
Previous Next