Приглашаем посетить

Добычин (dobychin.lit-info.ru)

Organizing Your Data

Once we have decided which data will be put into our databases, we have to decide how to organize them into tables (relations). Any table we design needs to perform two fundamental tasks:

Store the data.
Distinguish between entities underlying our data.

Primary Keys

The second requirement deserves further explanation. When we say "entities underlying our data," we are referring to the actual item being represented. Consider a customer database. The underlying entities are the physical customers, not their names or addresses. This distinction is critical because we might have two or more customers with the name Viktor Müller, or two customers living at the address 123 Happy Stream Ave.

We will distinguish between individual entities in our tables by using keys. Keys are unique identifiers for our row data that can never change for a particular entity. For our customer example above, we could theoretically combine the name and address to create the key:

Viktor Müller living at 123 Happy Lane, Toronto, Ontario

However, this does not prepare us for the eventuality that two people with the same name happen to live at the same address. (It is not uncommon for fathers and sons to have the same name.)

We can avoid this problem if we choose another way of specifying a key for our table data. Since most modern databases are extremely effective at managing integer values and keys, we will most often choose this type for our keys. This primary key will simply be a unique, non-empty integer that will identify any given row (see Figure 9-2).

Figure 9-2. Primary keys in database tables.

Organizing Your Data

PRIMARY KEYS AS BUSINESS DATA

A debate rages in the database community about a common practice with which a number of database designers take exception to: combining primary keys (a means of uniquely identifying a given row within a table) and business data (defined to be data that is part of the set of information managed by the web application). In this practice, application authors designate a piece of business data, such as the account number or user ID, as the primary key in a given table.

Those who argue against this practice offer two reasons. First, you are giving away information about your database structure instead of some data attribute on an entity. While it is not always easy to think how people might do this, once you start giving away implementation details, you start to increase the risk that somebody with less-than-noble intentions can exploit them.

Second, by using the primary key as business data, we are tying our implementation to our data. If we ever wish to reorganize our data in the future, we might have created a big headache for ourselves. Consider a blogging engine that assigns to each message a message_id field. If we were to use that as our primary key and then discover a year later that our message database has grown so large that we need to upgrade to a new type of database table or even new database software, we might find ourselves in trouble with these message ID values changing suddenly!

Thus, these database designers will instead use a surrogate key, which is a unique, non-empty integer field with a name such as id or pkid (primary key id) to serve as the primary key, and then have the other identifier for the data, such as a user's account number or a message ID, as regular data.

However, other application developers argue that this practice is unnecessary. The supposed security holes have never materialized, and the added complexity of having to create and manage your own account number or user ID instead of having the database simply take care of it for you is not worthwhile. The supposed hassles of changing implementation never actually come about, and the application is burdened with extra logic.

Like many debates in the computer industry, there are bits of truth to either side of the argument, and thus we cannot dismiss either outright. However, in the interest of keeping our code useful for learning purposes, we will largely allow our primary keys to be based on business data throughout this book. For cases when that could be a problem, we will make note of it and do things differently.

Choosing Data Types

When you choose which pieces of information you are going to include for any particular entity, you must decide what data type to assign them in the database.

Numeric Data

Storing numeric information in the database is typically easy since most modern engines offer a variety of features and are well-optimized for these types of data. You will want to be sure that the type you choose has an appropriate resolution and maximum/minimum values. Most database implementations support a number of integer and floating-point data types that are similar to those seen in programming languages.

Integer data types usually come in 16-bit, 32-bit, and 64-bit flavors. Some servers support unsigned integers, while others will support 8-bit integers. When choosing one of these types for your data column, think about the possible range of values for that column and choose appropriately. Using a 16-bit integer for bank account numbers means that you would have a limit of around 32,000 bank accounts (64,000 if you were using a server that supported unsigned integers). Similarly, if you are representing a field with the number of bedrooms in a house, choosing a 64-bit integer (which takes up 8 bytes of storage and has a maximum value of nine billion-billion) would probably be overkill unless you work with very large homes.

Databases allow for approximate floating-point numbers, too. Unlike PHP, which only supports a double-precision floating-point type, most database servers support both single and double precision. Single-precision floating-point numbers are like their double precision cousins except that they are less precise and take up less space. However, just like floating-point numbers in PHP, they are approximate values, and thus should be avoided for critical decimal calculations. For instance, while it may not seem too big a deal to see a penny lost here and there in calculations, customers conducting thousands of transactions an hour are likely to become irate.

If a more precise representation of decimal numbers is required, there are additional types to support thistypically called DECIMAL or NUMERIC. When declaring columns of this type, you can specify the precision desired and the number of decimal digits. (Some implementations allow a precision as high as one thousand.) Calculations on these data types are not approximate, so your data is safe, but the implementation of these types is slow and inefficient. (Many database servers will store these as strings so as to ensure the safety of the data.) Therefore, they should be used only when necessary.

Currency

As we briefly touched upon, using floating-point numbers for currency is an unwise idea due to their approximate nature. Integers are acceptable when you are guaranteed to be working with only whole numbers in currency, but for most cases, the DECIMAL or NUMERIC data types are preferred.

Given that most servers let you specify a precision and number of decimal points for these latter types, many programmers are tempted to choose some reasonable precision and then a value of 2 for the number of decimal places. (Dollars, euros, pounds sterling, and even rupees are subdivisible into 100.) You should be quite careful heremany prices commonly in use are specified in fractions of the smallest units.

Italian purchasers of gasoline (petrol) often pay prices such as 1.124 per liter, cell phone customers in Japan might find themselves charged 0.9¥ (yen) per packet of data they send, and the local hardware store might sell nails for 0.2¢ each. Giving yourself some extra flexibility in how prices are stored does not cost much more in space or efficiency and gives your application the opportunity to grow.

Finally, application authors who store currency values in their databases are highly encouraged to store the type of currency used (even if 99.9 percent of the data will be in one currency, that 0.1 percent that is not will cause all sorts of headaches otherwise) along with their values.

Dates and Times

Most database servers and the ANSI SQL standard provide the ability to distinguish between a date, a time, or a combination of the two. Various implementations of database servers allow for a higher degree of functionality (including extra capabilities for time zones) and the ability to control whether values are automatically set when a row is created or updated.

When choosing to assign these types to a column, you should think about whether the date or a more precise time is required. If only a date is needed, using a full date-time data type would result in wasted space for the time data that was ignored. While some database servers provide support for storing time zone information along with a date-time data value, we often find it useful in our web applications to store these values in the same time zone and worry about where the client is from within our PHP code later.

Character/String Data

While most database servers support a remarkable selection of character and string types, not all are supported by the ANSI SQL standard; we only make use of a few of them. In general, there is support for fixed-length strings (padded with spaces), short variable-length strings, and long-variable strings.

For the first two data types you specify, the maximum length of the string desired when declaring the table column. While the maximum permissible length of this string varies from server to server, 255 is a common value.

The difference between fixed-length and variable-length columns is when the storage space is allocated. For fixed-width columns, the space is always allocated immediately, and any unused slots in the string are filled in with spaces. On the other hand, only as many slots as there are characters in the data are used in the database for variable-width fields.

Both of these field types refuse to accept data that is greater than the permissible value specified at creation time. However, given that we will be doing a lot of validation of data from within PHP code and do not want to waste space, we will almost always use the variable length type in our databases.

When the short-fixed or variable-length strings are insufficient, there is the option for a longer text field. The maximum length for this text data varies from server to server, but it is typically in the range of gigabytes; people worried about hitting that upper limit might want to think seriously about how and where they are storing data. The only complication with this data type is that some servers might only use a subset of bytes at the beginning of the string for performing sorts and comparisonsfor instance, some servers might only use the first 1024 characters when sorting data of that type.

For all string data, character set support is critical. The good news is that all database servers that we will use or mention in this book support declaring string columns in a number of character sets, including Unicode (UTF-8). The bad news is that there appears to be no standard way of doing this, and all database servers we will use have a different way of doing so.

Most servers distinguish between which character sets are supported and which sorting orders (or collations) are used. In addition to needing to know how string data should be stored in the database, the server needs to know how to sort the data upon retrieval. Collations tell the server how it is done and whether to consider characters such as à, â, and ä as equivalent or as having a specific order in a given locale. A majority of the collations available on the major servers we will use are case-insensitive.

Binary Data

For fields where you wish to represent non-printable data (such as encoded data) or a file (such as a user's photo), most database servers support the ability to place arbitrary-length binary data in the database. Unfortunately, they are inconsistent: Both Oracle and MySQL support the BLOB data type, into which arbitrary length binary data can be placed, whereas Microsoft SQL Server has the image data type with a size limit of 2GB (and is not restricted to image files but restricted to binary data). PostgreSQL has the bytea data type that supports arbitrary-length binary data, but with the requirement that certain binary values be escaped or prepared for storage.

We will address each database server's specific implementations when we encounter binary columns throughout this book.

Other Data

There are some other possible types of data, such as binary flags (Booleans, yes/no type data, and so on), groups of binary flags (also known as bit fields), and time intervals. We will use these types of date less often, but all are fully supported by the ANSI SQL standard.

In general, we will resist the urge to use binary flags (also known as bits) and bit fields (collections of bits) since we find that the choices are rarely clear. Even for things as seemingly straightforward as a user's gender, we might find possible values such as "male," "female," "other," or "not yet known ."

Organizing Your Data into Tables

While we have gone through our tables and thought about keys and data types, we have not explained how we are going to lay them out in the database. In formal database speak, the process by which you go about structuring data in a relational database to optimize both efficiency and consistency is known as database normalization. Unfortunately, the process is also normally presented in an extremely complicated and mathematically intense manner that makes it inaccessible to most programmers.

In a nutshell, normalization is a multi-step process by which we try to organize and reorganize our data in phases, each called a normalization form, to achieve an optimal layout. While a full treatment of the topic is beyond the scope of this book, we will present you with the key features here to help you design your databases.

For our examples in the next few subsections, we will assume we are working with an online message board system where users can write messages in various forums and other users can reply to those messages. After sitting down and thinking about it, we decided that the following is the set of data we would like to represent in a messages table:

Field
Example Value
Description
message_id
4593492
This is a unique identifier for this message. We will use this field as the primary key for the table.
author_name
Jacob Smith
This gives us the name of the article's author.
author_email
"Jacob@jacobmsmith.com"
This is the e-mail address of the article's author.
date_posted
"2005-06-12 16:23:44.18"
This identifies when (both date and time) this message was posted.
title
"I like to eat chocolate!!"
This is the title of the message.
body
"Chocolate is the awes0m3 TEXT d000dz!!!11!one!!"
This is the body of the message in a field.
forum_name
"Favorite Activities"
This is the forum in which the message was posted.
replies
long text list of replies, including poster, date, and contents.
This is a long TEXT field with replies to this message as the contents, separated by some special sequence. These replies have the reply poster's name and e-mail address and the title, date, and contents of the reply.

Eliminate Non-Scalar Data from Tables

Non-scalar data, otherwise known as vector data or repeating groups, is data for which more than one value ends up being represented per row in our table.

For instance, in our message board database, we see that the replies field is actually a repeating series of reply entities. The general rule for dealing with repeating groups is to put the items in their own table and have them refer to the table from which they were excised. For example, we might choose to create a new table for all replies in the system, with a reply_id, author_name, author_email, date_posted, title, body, and message_id to which this reply is in reference. This new table replaces the old replies field in our Messages table, and the new table refers to its parent table via the new message_id field. Many of the changes we will discuss for the table of messages also apply to the table of replies (see Figure 9-3).

Figure 9-3. Elimination of non-scalar data from our table.

Organizing Your Data

Eliminate Data Not Directly Related to the Key

In our message table, we have a field called author_email. This field does not contain data that is relevant to our message since the author_name uniquely identifies the author of the message. What we are really saying here is that we want to have a list of users and their e-mail address (and perhaps other information) and just refer to an entry from that table in our message table. Thus, we create a new table of users of our messaging system, and in that table include their name and e-mail address. We also have a primary key assigning to each of them a user_id. Our table system is now starting to look like that shown in Figure 9-4.

Figure 9-4. Elimination of unrelated data from our tables.

Organizing Your Data

Use Foreign Keys for Unrelated Data

Having performed the previous manipulations, we now see that our message table only contains fields that are directly related to the message itself. However, two of those fields, author_name and forum, are merely strings with the name of the author and forum to which the message was posted.

This leaves us open to all sorts of trouble, such as spelling mistakes or users or forums changing names, being removed, and so on. It's easy to imagine two messages with author_name field values of "Jonathan Smith" and "Johnathan Smith." We would be forced to ask whether these were the same author with a spelling mistake or two different users. We would be in even more trouble if we had two users (or forums) with the same name.

To eliminate these problems, we will make sure that all data fields not related to the message are keys from other tables, or foreign keys. Thus, we will have a table listing all of the available forums in the system and refer to that table in our message table. Also, instead of using an author_name field, we will refer to the user_id field in our newly created Users table.

This leaves us with tables as in Figure 9-5.

Figure 9-5. Our final database tables.

Organizing Your Data

Too Much Normalization?

For exceptionally large or complicated databases, there is the possibility for normalization to leave us with a large number of tables. In rare and extreme cases, some relational database management systems begin to have problems managing the data. In those cases, some people begin to perform denormalization, where they undo certain parts of the normalization they have just performed to "help" with database efficiency.

We are not exaggerating when we say exceptionally large databasesnone of the databases we design or describe in this book is going to be so big that denormalization is going to be a necessity. We should always let the database engine do all the optimizations it wants, given that it is likely to be far better at it than us.

Indexes for Faster Searching

Even though we created a primary key to search quickly and efficiently for items in our tables, we often find ourselves wanting to search for other items in our row data. For example, if we were to create a table of our web application's users, it is likely that we would want to search for a user by name. When a user tried to log in, he would give us a name and a password, and we would have to find this record in our table.

If there are fields other than the primary key field in which we expect to frequently search for values, we can speed up this operation by creating an index. An index is an additional data structure that the database server associates with the table to help speed up searches for specific values on that column. You can specify an index on a single column, such as user_name, or on multiple columns, such as last_name, first_name. The DBMS constantly maintains and updates the index as you add, update, or remove records. They are designed so that the database server can find the results sought without having to look through all the rows in the table. In the absence of such an index, the database server has to search through all of the data to find any matching records.

However, indexes do not come free. They can take up a decent amount of space on the server's hard disk and slow down data insertion and deletion operations since the indexes need to be updated along with the table's row data. Thus, database designers rarely create an index for every column in their tables. Instead, they look at their applications and how they are searching through the data and create indexes on the columns that are being used most frequently.

Most database servers automatically create an index for columns marked as primary keys. Most of these software packages limit the number of indexes that can be created on a single table, though the number is at least 16.

Table of Contents