Relational Databases

Technically speaking, a database is a collection of data organized for rapid search and retrieval. Databases come in many shapes and sizes:

A hierarchical database is one in which data is stored in a tree-like structure, such as a computer's file system.
A relational database management system (RDBMS) is one where data is organized into tables (also known as "entities"). Tables consist of fields (also known as "columns" or "attributes") and records (also known as "rows" or "instances" or "tuples").
An object-oriented database is one in which data is stored in its natural structure, in the form of objects. Object-oriented databases offer a great deal of flexibility when using object-oriented programming. While a few object-oriented databases have emerged in recent years, they have yet to achieve the performance and widespread usage of relational databases, which are the main focus of this chapter. This is mainly due to the fact that relational algebra (the basis for relational database theory) is much more powerful and correct than any object query language can be.
A hybrid DBMS is one that offers features of both object-oriented and relational databases.

While all of these different types of databases can be used with PHP, it is most common to use relational databases for web applications due to their speed, stability, and maturity. An application which implements an RDBMS almost always uses more than one table. The data in a table usually has a logical relationship with the data in other tables, hence the term relational database.

Let's now look into an example database that draws inspiration from an online library. The database will store information on the current books in the library, the books' authors, and the series to which each book belongs. Each book must belong to one and only one series, and may have multiple authors, and each author may have written multiple books. We also store the price, the quantity, and the number of books that are "booked" (reserved) for each book in the database.

We shall continue by demonstrating how this database will be organized into separate tables. The relationships between the tables, and the fields within the tables, will also be illustrated.

The table containing data describing book details would have a logical relationship with the table containing data about the different series of books like this:

Indices

An index is a sorted list that speeds up data matching, and it is the most powerful tool for improving database performance. If computers were magically fast, we would have no need for indices, but they're not, and complex queries can sometimes take a very long time to process. Therefore, it is sometimes necessary to optimize a database to handle queries more quickly.

The index is a sorted list of the field's values. In addition to the values, it contains pointers to that value's location in the table. In MySQL, and many other database systems, indices are stored in their own files, separate from the table data.

A unique index is one in which each value in the index must be unique. Unique indices can be searched even more quickly than those that are not unique, but they are only applicable if the nature of the data in the field(s) being indexed is such that uniqueness makes sense.

For example, in the Periodic Table of elements, the element's symbol must always be unique; therefore, a unique index may be placed on the field "Symbol". The element's valence is not unique, so if an index were needed on the field "Valence", you may not use a unique index. For more information about indices, including unique indices, consult the documentation for your DBMS.

Keys

A primary key is an example of a unique index. It is used to uniquely identify a single record in a table. No two records in a table can have the same value for their primary key. This is generally necessary in a relational database because it allows for the retrieval and manipulation of data in a logical, consistent, and unambiguous way. Primary keys are used to enforce this requirement.

Important

Primary keys may consist of one or more fields in the table.

In the above example, the ISB N field is the primary key of the details table. If the book has an ISB N of 12323461003730, we can accurately match that ISB N, with the rest of the data pertaining to the book, since there can be only one 1861003730 in the details table. If a flaw in the application allowed us to enter a second record in the table with the ISB N 18634534730, the referential integrity of the database woud be violated, since the primary key value no longer refers to just a single record.

Since each record has a unique identifier, we can now demonstrate the relationship between the details table data and the series table data, by referring to series_ID in series:

In the series table, the primary key is series_ID. Note that the series_ID field also appears in the details table. In the details table, the series_ID field is a foreign key. It refers to the primary key of the foreign table, series, to establish a relationship between the records of this table and the foreign table. Like primary keys, foreign keys may consist of more than one field:

There are two possible ways of determining a primary key:

Natural or Logical Key
The preferred way to allocate a primary key is to find something inherent in the data which uniquely identifies the record. In the details table, each book already has an ISB N, and each ISB N is unique, so it can be used as the unique identifier (the natural primary key).
Surrogate Key
When a logical key cannot be found, the alternative is to introduce a surrogate key. A surrogate key is an additional field whose only purpose is to provide a unique primary key. It is usually an integer field. In the series table, the series_ID integer field was added to guarantee unique records. It is therefore a surrogate primary key. Surrogate keys are often a good idea, even when dealing with a unique identifier, such as ISB N numbers. They can help make tables easier to manipulate because they don't correlate directly to any real data in the row. In other words, they help abstract the unique identifier from the actual data.

Surrogate keys are very common in database applications, partly because there are many instances when there is no derivable logic key. For instance, in a table of customer data, a numeric customer_ID field is normally necessary, since the data collected about customers, like their name and address, are not likely to contain uniquely identifying attributes.

Normalization

Let's look at the table books:

The following data might be stored in the books table:

ISB N	book_title	auth_name
1-861005-15-6	Beginning Databases with PostgreSQL	Richard Stones
1-861005-15-6	Beginning Databases with PostgreSQL	Neil Matthew
1-861005-15-6	Beginning Databases with PostgreSQL	Jon Parise

There is a wasteful redundancy found in the books table – for each occurrence of an author, the book's ISB N and title are needlessly repeated. This is because our database schema is not fully normalized. Schema refers to the basic definition of the data – the organization of the fields and tables. Normalization is the process through which redundancy is removed from a database's structure by organizational means.

To solve the problem of repeating values, let's break this one table into two separate tables – one for authors and one for book titles:

The surrogate key, auth_id, is added to the author table for two reasons. First, it allows expansion of the author table in the future, perhaps by splitting the auth_name field into separate auth_firstname and auth_lastname fields. Second, it is usually best to use numeric values for primary keys because it's easy to keep them unique, and they are generally faster for searching and comparison operations. For example, we might have two authors in our database named "Tom Jones", because people's names aren't guaranteed to be unique. Lastly, a numeric value is almost always shorter than a character-based value, which makes a difference when storing the value in a foreign table multiple times.

ISB N is the logical primary key for the title table.

We still need to establish a relationship between authors and book titles. For that, let's introduce a third index or look-up table:

The primary key of authortitle is composed of two separate fields: auth_ID and ISB N. Also, auth_ID and ISB N are foreign keys into each of the other two tables.

The author table stores information about the authors. The title tabel stores information about the book titles. The authortitle table relates the data from the author table with that of the title table, connecting authors with titles, showing which authors wrote what books. For a table to be a lookup table, it must always have at least two foreign keys (referring to at least two different foreign tables).

Our author and title tables have a many-to-many relationship. Each author may have many books. Likewise, each book may be written by multiple authors. As we have seen above, a many-to-many relationship is implemented by use of a look-up table. Another example of a many-to-many relationship is that of orders and products. A customer's order can typically consist of multiple products, and it is hoped that each product will appear on multiple orders. A look-up table that unites an order_ID with a product_ID (and probably a quantity) makes it work. Thus, by introducing this additional table, we transform a many-to-many relationship into two easily manageable one-to-many relationships.

In our example, there might be multiple records in authortitle that relate the author "Jon Parise" in the author table with multiple books defined in the title table. It can therefore be said that author and authortitle have a one-to-many relationship, since one instance of author can have many instances of authortitle. Likewise, title and authortitle have a one-to-many relationship.

Sometimes, two tables may have a one-to-one relationship, meaning that for each record in the first table, there can be only one related record in the second table. Technically, this can mean that the fields of the two tables could simply be combined into one table, but, in some circumstances developers sometimes choose to create one-to-one relationships for organizational purposes. Another possible reason, again, is speed. Perhaps the data in the first table require frequent access while the data in the second table require only occasional access.

Purist database theorists often insist that a database schema should always be completely normalized, but normalization has its pros and cons:

Pros
Normalization results in less redundancy, which can mean less disk space and fewer chances for error. In addition, if the description ever needs to be changed, it only exists in one place.
Cons
One drawback to normalization is a slightly slower data access. Suppose we would like to print a report consisting of users' full names and their preferences, along with the preference descriptions. In our example, before normalization, we would have had to query only two tables for this information. After normalization, we need to reference three tables for the same information. Usually this performance difference is very negligible, due to the high performance of most RDBMSs. But in some very complex situations, say those involving twelve or fifteen tables, it may be beneficial to de-normalize the schema a bit. Also, working with many tables makes SQL queries more complex, so de-normalization may simplify the queries, as well.

Database optimization expertise is attained through study of database design theory, and analyzing the problem at hand. It is usually best to normalize as much as possible. However, it is best not to go for normalization if the data being modeled will not be redundant too often. Once we become more familiar with the nuances of schematization, we may see opportunities for higher performance in some situations, but the decision to de-normalize should not be made lightly. Normalization is generally considered desirable; in fact, many consider it mandatory.