Приглашаем посетить
Черный Саша (cherny-sasha.lit-info.ru)

Understanding How Databases and Queries Work

Previous
Table of Contents
Next

Understanding How Databases and Queries Work

An RDBMS is a system for organizing data into tables. The tables are comprised of rows, and the rows have a specific format. SQL (originally Structured Query Language; now a name without any specific meaning) provides syntax for searching the database to extract data that meets particular criteria. RDBMSs are relational because you can define relationships between fields in different tables, allowing data to be broken up into logically separate tables and reassembled as needed, using relational operators.

The tables managed by the system are stored in disk-based data files. Depending on the RDBMS you use, there may be a one-to-one, many-to-one, or one-to-many relationship between tables and their underlying files.

The rows stored in the tables are in no particular order, so without any additional infrastructure, searching for an item in a table would involve looking through every row in the table to see whether it matches the query criteria. This is known as a full table scan and, as you can imagine, is very slow as tables grow in size.

To make queries more efficient, RDBMSs implement indexes. An index is, as the name implies, a structure to help look up data in a table by a particular field. An index is basically a special table, organized by key, that points to the exact position for rows of that key. The exact data structure used for indexes vary from RDBMS to RDBMS. (Indeed, many allow you to choose the particular type of index from a set of supported algorithms.)

Figure 12.1 shows a sample database lookup on a B-treestyle index. Note that after doing an efficient search for the key in the index, you can jump to the exact position of the matching row.

Figure 12.1. A B-tree index lookup.

Understanding How Databases and Queries Work


A database table usually has a primary key. For our purposes, a primary key is an index on a set of one or more columns. The columns in the index must have the following properties: The columns cannot contain null, and the combination of values in the columns must be unique for each row in the table. Primary keys are a natural unique index, meaning that any key in the index will match only a single row.

Note

Some database systems allow for special table types that store their data in index order. An example is Oracle's Index Organized Table (IOT) table type.

Some database systems also support indexes based on an arbitrary function applied to a field or combination of fields. These are called function-based indexes.


When at all possible, frequently run queries should take advantage of indexes because indexes greatly improve access times. If a query is not frequently run, adding indexes to specifically support the query may reduce performance of the database. This happens because the indexes require CPU and disk time in order to be created and maintained. This is especially true for tables that are updated frequently.

This means that you should check commonly run queries to make sure they have all the indexes they need to run efficiently, and you should either change the query or the index if needed. A method for checking this is shown later in this chapter, in the section "Query Introspection with EXPLAIN."

Note

Except where otherwise noted, this chapter continues to write examples against MySQL. Most RDBMSs deviate slightly from the SQL92 language specification of SQL, so check your system's documentation to learn its correct syntax.


You can access data from multiple tables by joining them on a common field. When you join tables, it is especially critical to use indexes. For example, say you have a table called users:

CREATE TABLE users (
  userid int(11) NOT NULL,
  username varchar(30) default NULL,
  password varchar(10) default NULL,
  firstname varchar(30) default NULL,
  lastname varchar(30) default NULL,
  salutation varchar(30) default NULL,
  countrycode char(2) NOT NULL default 'us'
);

and a table called countries:

CREATE TABLE countries (
  countrycode char(2) default NULL,
  name varchar(60) default NULL,
  capital varchar(60) default NULL
);

Now consider the following query, which selects the username and country name for an individual user by user ID:

SELECT username, name
FROM users, countries
WHERE userid = 1
AND users.countrycode = countries.countrycode;

If you have no indexes, you must do a full table scan of the products of both tables to complete the query. This means that if users has 100,000 rows and countries contains 239 rows, 23,900,000 joined rows must be examined to return the result set. Clearly this is a bad procedure.

To make this lookup more efficient, you need to add indexes to the tables. A first start is to add primary keys to both tables. For users, userid is a natural choice, and for countries the two-letter International Organization for Standardization (ISO) code will do. Assuming that the field that you want to make the primary key is unique, you can use the following after table creation:

mysql> alter table users add primary key (userid);

Or, during creation, you can use the following:

CREATE TABLE countries (
  countrycode char(2) NOT NULL default 'us',
  name varchar(60) default NULL,
  capital varchar(60) default NULL,
  PRIMARY KEY  (countrycode)
);

Now when you do a lookup, you first perform a lookup by index on the users table to find the row with the matching user ID. Then you take that user's countrycode and perform a lookup by key with that in countries. The total number of rows that need to be inspected is 1. This is a considerable improvement over inspecting 23.9 million rows.

Query Introspection with EXPLAIN

Determining the query path in the previous example was done simply with logical deduction. The problem with using logic to determine the cost of queries is that you and the database are not equally smart. Sometimes the query optimizer in the database makes bad choices. Sometimes people make bad choices. Because the database will be performing the query, its opinion on how the query will be run is the one that counts the most. Manual inspection is also time-consuming and difficult, especially as queries become complex.

Fortunately, most RDBMSs provide the EXPLAIN SQL syntax for query execution path introspection. EXPLAIN asks the query optimizer to generate an execution plan for the query. The exact results of this vary from RDBMS to RDBMS, but in general EXPLAIN returns the order in which the tables will be joined, any indexes that will used, and an approximate cost of each part of the query (the number of rows in the tables being queried and so on).

Let's look at a real-world example. On a site I used to work on, there was a visit table that tracked the number of visits a user had made and the last time they visited. The table looked like this:

CREATE TABLE visits (
  userid int not null,
  last_visit timestamp,
  count int not null default 0,
  primark key(userid)
);

The normal access pattern for this table was to find the visit count and last visit for the user on login (so that a welcome message such as "You last logged in on..." can be displayed). Using EXPLAIN to inspect that query shows the following:

mysql> explain select * from visits where userid = 119963;
+-------------+-------+---------------+---------+---------+-------+------+
| table       | type  | possible_keys | key     | key_len | ref   | rows |
+-------------+-------+---------------+---------+---------+-------+------+
| visits      | const | PRIMARY       | PRIMARY |       4 | const |   1  |
+-------------+-------+---------------+---------+---------+-------+------+
1 row in set (0.00 sec)

This shows the table being accessed (visit), the type of join being performed (const, because it is a single-table query and no join is happening), the list of possible keys that could be used (only PRIMARY on the table is eligible), the key that it has picked from that list, the length of the key, and the number of rows it thinks it will examine to get the result. This is an efficient query because it is keyed off the primary key visits.

As this application evolves, say that I decide that I would like to use this information to find the number of people who have logged in in the past 24 hours. I'd do this with the following query:

SELECT count(*) FROM visits WHERE last_visit > NOW() - 86400;

EXPLAIN for this query generates the following:

mysql> explain select count(*) from visits where last_visit > now() - 86400;
+--------+------+---------------+------+---------+--------+------------+
| table  | type | possible_keys | key  | key_len | rows   | Extra      |
+--------+------+---------------+------+---------+--------+------------+
| visits | ALL  | NULL          | NULL |    NULL | 511517 | where used |
+--------+------+---------------+------+---------+--------+------------+
1 row in set (0.00 sec)

Notice here that the query has no keys that it can use to complete the query, so it must do a complete scan of the table, examining all 511,517 rows and comparing them against the WHERE clause. I could improve this performance somewhat by adding an index on visits. When I do this, I get the following results:

mysql> create index visits_lv on visits(last_visit);
Query OK, 511517 rows affected (10.30 sec)
Records: 511517  Duplicates: 0  Warnings: 0

mysql> explain select count(*) from visits where last_visit > now() - 86400;
+--------+-------+--------------+-----------+--------+-------------------------+
| table  | type  | possible_keys| key       | rows   | Extra                   |
+--------+-------+--------------+-----------+--------+-------------------------+
| visits | range | visits_lv    | visits_lv | 274257 | where used; Using index |
+--------+-------+--------------+-----------+--------+-------------------------+
1 row in set (0.01 sec)

The new index is successfully used, but it is of limited effectiveness (because, apparently, a large number of users log in every day). A more efficient solution for this particular problem might be to add a counter table per day and have this updated for the day on a user's first visit for the day (which can be confirmed from the user's specific entry in visits):

CREATE TABLE visit_summary (
  day date,
  count int,
  primary key(date)
);

Finding Queries to Profile

One of the hardest parts of tuning a large application is finding the particular code sections that need to be tuned. Tuning databases is no different: With hundreds or thousands of queries in a system, it is critical that you be able to focus your effort on the most critical bottlenecks.

Every RDBMS has its own techniques for finding problem queries. In MySQL the easiest way to spot trouble queries is with the slow query log. The slow query log is enabled with a triumvirate of settings in the MySQL configuration file. Basic slow-query logging is enabled with this:

log-slow-queries = /var/lib/mysql/slow-log

If no location is specified, the slow query log will be written to the root of the data directory as server-name-slow.log. To set a threshold for how long a query must take (in seconds) to be considered slow, you use this setting:

set-variable    = long_query_time=5 (MySQL 3.x)

or

long_query_time=5 (MySQL 4+)

Finally, if you would also like MySQL to automatically log any query that does not use an index, you can set this:

log-long-format (MySQL 3,4.0)

or

log-queries-not-using-indexes (MySQL 4.1+)

Then, whenever a query takes longer than the long_query_time setting or fails to use an index, you get a log entry like this:

select UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(MAX(last_visit)) FROM visits;
# User@Host: user[user] @ db.omniti.net [10.0.0.1]
# Query_time: 6  Lock_time: 0  Rows_sent: 1  Rows_examined: 511517

This tells you what query was run, how much time it took to complete, in seconds, how many rows it returned, and how many rows it had to inspect to complete its task.

The slow query log is the first place I start when tuning a new client's site. I usually start by setting long_query_time to 10 seconds, fix or replace every query that shows up, and then drop the amount of time and repeat the cycle. The goal for any production Web site should be to be able to set long_query_time to 1 second and have the log be completely free of queries (this assumes that you have no data-mining queries running against your production database; ignore those if you do).

The mysqldumpslow tool is very handy for reading the slow query log. It allows you to summarize and sort the results in the slow query log for easier analysis.

Queries are grouped into entries that display the number of times the query was placed in the slow query log, the total time spent executing the group of queries, and so on.

Here's an example:

Count: 4  Time=0.25s (1s)  Lock=0.00s (0s)  Rows=3.5 (14), root[root]@localhost
  SELECT * FROM users LIMIT N
Count: 5  Time=0.20s (1s)  Lock=0.00s (0s)  Rows=5.0 (25), root[root]@localhost
  SELECT * FROM users

The tool accepts options to control how the queries are sorted and reported. You can run mysqldumpslow --help for more information on the options.

Logging of non-indexed queries can also be enlightening, but I tend not to leave it on. For queries running on very small tables (a couple hundred rows), it is often just as fastif not fasterfor the RDBMS to avoid the index as to use it. Turning on log-long-format is a good idea when you come into a new environment (or when you need to do a periodic audit of all the SQL running in an application), but you do not want these queries polluting your logs all the time.


Previous
Table of Contents
Next