Database patterns

From ProgClub
Jump to: navigation, search

This page documents the various types of database design patterns supported by the PHPBOM library and used in our applications.

Introduction

Here we document the type of tables we have in our databases and include some notes about how they are accessed and used.

Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.

We have four broad types of data, being metadata (in about tables), core mutable application data (in entity and history tables), log data (in log tables), and then everything else.

Tables of the various patterns can only depend on other types of tables with a higher precedence. So for example application entity data cannot refer to log data because 'log' tables come after 'entity' tables. When you are declaring your schema you should declare your tables in the canonical order and within patterns of the same type usually alphabetical or temporal order is best when possible.

Many of our table patterns implement a parent pattern, being the elemental data pattern.

Notes

Naming conventions

Tables are named as 't_{pattern}_{name}'.

Columns are named as 'a_{name}'. Note that column names must be globally unique, not simply unique per table. The exception to this rule is if you have a foreign key without a more specific name, in that case the foreign key can have the same name as its referent.

The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.

Note: to complicate things somewhat the PHPBOM software provides support for a table prefix to use for your schemas so you can squeeze an application into a shared database if need be. The table names thus become '{prefix}t_{pattern}_{name}', because who doesn't love complexity?

So in PHPBOM parlance there are three types of table name:

Type Name Note
full table name '{prefix}t_{pattern}_{name}' $prefix can be empty!
table name 't_{pattern}_{name}' used within the software to identify a table
short table name '{name}' used to enforce as globally unique

Note that short table names must be globally unique across all applicable schemas defined for your application (an application can have multiple schemas, typically two, the PHPBOM schema and the application schema).

Naming foreign keys

Ideally foreign keys (including foreign keys which are also primary keys) will have the same name as the primary key column they refer to.

When there are multiple foreign keys to the same primary key, or when we have a self-referential table, then using the same name as the referent is not possible. In this case you should use the whole name being referred to and place a qualifier after, e.g.: two foreign keys to a_token_id on the same table:

  • a_token_id_prev
  • a_token_id_next

Naming for entity/history tables implemented by PHPBOM

In general applications and PHPBOM share namespaces, particularly for elemental data. An exception to this is made with entity tables and their corresponding history counterparts. It is recommended for your application entity/history tables that you prefix your app code to distinguish tables your application has defined from tables PHPBOM has defined.

To try and keep out of your way PHPBOM uses standard prefixes for entity/history tables. These prefixes (at the present time) are:

Prefix Note
app_ for application data
web_ for web related data
loc_ for geographic data

Canonical order

Tables should be declared in the following order. Tables with a lower ordinal cannot reference tables with a higher ordinal.

Ordinal Pattern
11 about
21 lookup
23 mflags
31 ident
41 particle
43 pstring
51 textdata
52 upload
53 zblob
55 ztext
56 zjson
61 structure
71 entity
73 history
81 log
91 view
99 other

a_interaction_id

We have a standard ident table called t_ident_interaction which contains the serial number for an application interaction (e.g. a HTTP request) in its a_interaction_id column. When most elemental data is registered (not lookup data) we log the a_interaction_id of the interaction that did the registration. Using the a_interaction_id and the t_ident_interaction table we can deduce the time of the registration along with other audit information associated with the interaction.

Foreign keys

We don't configure foreign keys, thus they are not enforced. This is potentially controversial, but it has some benefits.

If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for lookup tables the zero value is modelled in the lookup table but not as a valid value in the application (except for flags).

Generally it's best if you left join to referenced tables in case the relevant data is missing.

Banning foreign keys simplifies access, improves performance, and reduces storage overhead. It also improves some audit situations wherein log data can refer to subsequently removed records without needing to be deleted with the relevant primary key data.

You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.

Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant. It's trivial to detect orphan data if you find yourself in a situation where you need to know.

I guess I should stress that just because a foreign key *can* be zero doesn't mean that it *should* be zero. One place where I find it expedient to use zero values is when I'm working on a new feature that is not yet complete. Such work in progress can use zero values for foreign keys while the code that calculates what they should really be hasn't been written yet. The general rule is that if you find a foreign key with a zero value there is something wrong and you should investigate and resolve. Ordinarily you should never have zero foreign keys (except with flags).

Null values

I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value.

Banning nulls simplifies applications, reduces storage costs, and improves performance.

Note that most elemental data will define an 'empty' (or 'zero') record for ID = 1 (but not upload tables, as they don't have meaningful 'empty' values). The enumeration data defines the 'empty' record with ID = 0.

Shredding

If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for elemental data rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in history tables) can also optionally be removed (otherwise it should preferably be updated with valid data) during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups (or application server backups) which may still contain references to the shredded values. Shredding is revisionist and can rewrite history, so you might like to leave an audit trail if you do support shredding.

Presently with PHPBOM we have considered the possibility of shredding and defined it, but it is not yet implemented and is a low-priority usecase for us at the moment.

Table types

Metadata

All database and application metadata are stored in about tables and accessed via a raw connection. Your application can use the metadata to figure out if the database is in an operable state or not, viz the installed schema revisions and/or online status.

Application data

Your core mutable application data should mostly be stored in entity tables. Each entity table has a corresponding history table which logs changes made to data in entity tables.

Data in entity and history tables are accessed via a trn connection.

Log data

For writing logs to log tables. Logs are ordinarily insert-only (no updates or deletes). If your log table refers to a mutable entity table you might like to copy the value of salient fields at the time the log was made, as the 'entity' table might be modified after the logs have been written.

Elemental data

Everything that's not metadata, application data or log data implements the elemental pattern.

Elemental data is accessed via a raw connection.

The elemental data pattern is basically where we have a small thing, like say an DBT_UINT8, that points to a big thing, like say a DBT_HASH_BIN. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a raw connection which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.

Note that most (but not all) elemental tables log the a_interaction_id for the interaction in which they were registered. Items only get registered once and cannot ordinarily be changed after they have been registered. Even though elemental data is immutable and insert-only data we still put the created_on and updated_on timestamps on our elemental tables, ideally created_on and updated_on will always have the same value, and if not that might alert you to something having gone wrong (because elemental records should not ever be changed).

There are two broad types of elemental data depending on how values are identified: value elements and hashed elements. For value elements identity is determined by a value comparison. For hashed elements identity is determined by hashing.

Note that for value elements the values are usually case-sensitive. You can break this rule but only if you know what you are doing!

The table patterns which implement the elemental pattern are:

pattern identity small thing big things a_interaction_id note
lookup value enum, flags char, code, name, etc. not logged lookup data is read-only and packaged with schema
ident value id creation date, potentially other data logged
particle value id some larger integer or usually a case-sensitive string logged
pstring value id some string value (must be case-sensitive, as with particles), a case-insensitive value is also usually logged (for use in indexes and queries) logged be careful with case-sensitivity
textdata hash id a string value and a DBT_HASH_BIN logged the DBT_HASH_BIN unsures case-sensitivity
upload hash id a DBT_HASH_BIN and other details collected with the upload such as file name logged data logged during upload should not be altered at a later date, use application entity if that's what you need
zblob hash id a blob and a DBT_HASH_BIN logged
ztext hash id a blob and a DBT_HASH_BIN logged
zjson hash id a blob and a DBT_HASH_BIN logged
structure hash id a string value, a DBT_HASH_BIN, and relevant components logged the DBT_HASH_BIN unsures case-sensitivity

Patterns

about

The about tables contain database metadata and are accessed via the 'raw' connection.

An about table is named 't_about_{name}'.

lookup

The lookup tables contain hard-coded reference data and are accessed via the 'raw' connection. Ordinarily lookup data is defined with the schema and not modified thereafter.

A lookup table is named 't_lookup_{name}'.

Depending on whether it is an enumeration or flags table a lookup table must contain either a 'a_{name}_enum' column or an 'a_{name}_flags' column to identify a row. The 'a_{name}_{enum,flags}' column is usually be a DBT_UINT8.

The 'a_{name}_{enum,flags}' column must be the primary key.

enumeration

An enumeration table is a special type of lookup table which contains enumerated values (starting at 1) which indicate a state and associated data. For enumerations the zero value is usually an invalid value but it is also usually modelled in the lookup data as an instance of the Null Object Pattern.

If an optional 'a_{name}_char' column exists it must be of type DBT_ASCII_CHAR_BIN. The PHPBOM library will use lowercase char codes and applications can use uppercase char codes to distinguish themselves.

Unlike flags tables you can extend a PHPBOM enumeration with your own values, but if you do you should code from the max enum down as PHPBOM will code from the min enum up. Use an uppercase value in relevant 'char', 'code', etc. columns to distinguish your application defined values from PHPBOM defined values.

flags

A flags table is a special type of lookup table which associates up to 256 flags enum values with up to eight boolean columns. Unlike with enumeration tables a flags table can legitimately have a zero primary key (indicating no boolean values are set).

A flags table is named 't_mflags_{name}_flags'.

A flags table must contain a 'a_{name}_flags' enum column which identifies a row. The 'a_{namne}_flags' column would typically be a DBT_UINT8. If more flags are needed they would probably be best modelled with a new flags table than by extending the flags column to DBT_UINT16, although that is possible.

The 'a_{name}_flags' must be the primary key.

Other columns, typically up to eight of them, are DBT_BOOL columns which model the associated flags as boolean values. Programmers can join to the flags lookup table to get the associated boolean column values for use in reports or queries.

Boolean columns are named 'a_{is,has,was,can}_{noun}_{adjective}' and in our PHP code there is generally a constant declared with the schema having the relevant power of 2:

define( 'FLAG_{IS,HAS,WAS,CAN}_{NOUN}_{ADJECTIVE}', pow( 2, x ) );

For the current full list of supported indicatives (is, has, etc) see the is_bool_name() function definition.

Generally it is very ill-advised to extend a PHPBOM flags lookup table. If you need extra flags defining your own flags table is probably your best option.

ident

The ident tables allocate unique identifiers (serial numbers) and are accessed via the 'raw' connection.

An ident table is named 't_ident_{name}'.

An ident table has a 'a_{name}_id' column which is usually a DBT_ID24 or DBT_ID32.

Typically an ident table will collect a timestamp for its creation date, and will usually log an a_interaction_id.

particle

The particle tables contain particles and are accessed via the 'raw' connection. A particle table maps a wide column to a narrow column and vice versa.

A particle table is named 't_particle_{name}'.

A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A particle table must contain a 'a_{name}' column which is the wide value. There must be a unique index on the 'a_{name}' column. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a pstring table instead.

pstring

The pstring tables contain strings and are accessed via the 'raw' connection. Pstrings are case-sensitive strings (like particles), with an associated case-insensitive string (which can be used for indexes and/or queries).

A pstring table is named 't_pstring_{name}'.

A pstring table must contain a 'a_{name}_id' column which contains the ID for the pstring. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A pstring table must contain a 'a_{name}' column which contains the case-sensitive string value. The 'a_{name}' column is usually a DBT_ASCII_BIN or DBT_UTF8_BIN. There must be a unique index on the 'a_{name}' column.

A pstring table must contain a 'a_{name}_ci' column which contains the case-insensitive string value. The 'a_{name}_ci' column is usually a DBT_ASCII_CI or DBT_UTF8_CI column.

The only real difference between a pstring table and a particle table is that a pstring table always models strings (particles can model numbers and other scalar values) and pstring tables include a case-insensitive version of the value which can be used in indexes or queries.

textdata

The textdata tables contain strings and are accessed via the 'raw' connection. A textdata table maps a wide column to a narrow column. You would usually use a textdata table if a particle or pstring table is too small.

A textdata table is named 't_textdata_{name}'.

A textdata table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A textdata table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A textdata table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the string value.

upload

The upload tables contain data about uploaded files and are accessed via the 'raw' connection.

A upload table is named 't_upload_{name}'.

An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.

zblob

The zblob tables contain compressed blobs and are accessed via the 'raw' connection.

A zblob table is named 't_zblob_{name}'.

A zblob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zblob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zblob table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed blob.

See the inflate.php program if you need to work with data in this format.

ztext

The ztext tables contain compressed text data and are accessed via the 'raw' connection.

A ztext table is named 't_ztext_{name}'.

A ztext table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A ztext table must contain a 'a_{name}' column which is the compressed UTF-8 encoded textual data. The 'a_{name}' column is usually a DBT_MLOB.

A ztext table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

zjson

The zjson tables contain compressed JSON and are accessed via the 'raw' connection.

A zjson table is named 't_zjson_{name}'.

A zjson table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zjson table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zjson table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

structure

The structure tables contain structures and are accessed via the 'raw' connection. A structure table maps a wide column to a narrow column and constituent component columns.

A structure table is named 't_structure_{name}'.

A structure table must contain a 'a_{name}_id' column which contains the narrow ID for the structure. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A structure table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A structure table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the wide string value.

Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.

entity

The entity tables contain mutable application data and are accessed via the 'trn' connection. All data tables have a corresponding history table.

A entity table is named 't_entity_{name}'.

A entity table has a 'a_{name}_id' column which is is usually a DBT_ID24 or DBT_ID32.

A entity table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.

A entity table will have other columns too.

Typically entity tables use a prefix that indicates their subtype, e.g. 'app_', 'cfg_', 'web_', 'loc_', etc. Usually the subtype is omitted from the table's column names. E.g. t_entity_app_user.a_user_id. If you're worried about future namespace collisions picking an application prefix for your table and column names is probably the way to go. (2020-03-28 jj5 - THINK: do we want a registry for application prefixes to avoid collisions?)

history

The history tables contain logs of changes to mutable application data stored in entity tables and are accessed via the 'trn' connection. All history tables have a corresponding entity table.

A history table is named 't_history_{name}'.

A history table has a 'a_{name}_history_id' primary key which is is usually a DBT_ID32.

A history table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the entity was modified.

A history table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the entity.

A history table then contains a copy of all of the columns from its corresponding entity table.

log

A log table contains log data and are accessed via the 'raw' connection.

A log table is named 't_log_{name}'.

There are no rules for the content of log tables, but you should usually (but not always) try to copy data into logs by value rather than by reference, because referenced data might change after the logs were written.

other

If we're dealing with tables which can't be categorised into one or more of the above patterns we call it other data and there are no restrictions on its format. The connection to use for 'other' tables is unspecified.

Database connections

In an application there are usually two separate connections to the database, being the 'raw' connection and the 'trn' connection. The database pattern indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for entity and history tables which use the 'trn' connection.

raw connection

The 'raw' connection uses the READ UNCOMMITTED transaction isolation level and (usually) auto-commits.

Note: as a performance improvement you can begin a transaction at the end of an interaction when the only thing left to do is write some logs. Beginning a transaction will improve the performance of the logging, and you can commit the transaction when you're done.

trn connection

The 'trn' connection uses the SERIALIZABLE transaction isolation level. Programmers must manually begin and commit (or rollback) their transactions. The PHPBOM library provides support for nested transactions.

Database types

There are many more supported database types which can be used for column definitions, but some key types are listed here.

Note: we model identity columns as integers, not unsigned integers. Usually identity columns have values starting at 1 (zero is unused) and incrementing from there. But sometimes it can be useful to multiply identity values (particularly when they are foreign keys) by minus one (-1) to make them negative while you're processing some values, that's why we use signed ints and not unsigned ints for identity columns.

DBT_ID8

A DBT_ID8 is an auto generated integer from 1 to 127.

DBT_ID16

A DBT_ID16 is an auto generated integer from 1 to 32,767.

DBT_ID24

A DBT_ID24 is an auto generated integer from 1 to 8,388,607.

DBT_ID32

A DBT_ID32 is an auto generated integer from 1 to 2,147,483,647.

DBT_ID64

A DBT_ID64 is an auto generated integer from 1 to 9,223,372,036,854,775,808.

DBT_UINT8

A DBT_UINT8 is an integer from 0 to 255.

DBT_UINT16

A DBT_UINT16 is an integer from 1 to 65,535.

DBT_UINT24

A DBT_UINT24 is an integer from 1 to 16,777,215.

DBT_UINT32

A DBT_UINT32 is an integer from 1 to 4,294,967,295.

DBT_UINT64

A DBT_UINT64 is an integer from 1 to 264-1.

DBT_ASCII_CHAR_BIN

A DBT_ASCII_CHAR_BIN is a single case-sensitive ASCII character.

DBT_HASH_BIN

A DBT_HASH_BIN is a case-sensitive 28-byte binary string containing a SHA512/224 hash value.

DBT_HASH_HEX

A DBT_HASH_HEX is a case-insensitive 56-character hexadecimal string containing a SHA512/224 hash value.

DBT_TOKEN

A DBT_TOKEN is a case-sensitive 32-character alphanumeric ASCII string.

DBT_CREATED_ON

A DBT_CREATED_ON is the timestamp of when the record was created.

DBT_UPDATED_ON

A DBT_UPDATED_ON is the timestamp of when the record was last updated.

Data formats

Field data can be in one of two formats: application format or database format. In the large the database format and the application format are the same, strings are strings and ints are ints and floats are floats. However there are some differences as summarised in the following chart. Note that in PHPBOM conversion between formats is done automatically for application data by the active record implementation.

Database type Application format Database format
DBT_BOOL boolean int
DBT_DATETIME, DBT_DATETIME_UTC DateTime string
DBT_TIMEZONE DateTimeZone string

Column size

Integer size

I get stuck on this all the time. My general disposition is to use only as much as you need. Then if it turns out that you need more, provision more. I guess I take the view that limits are good/helpful things in making sure situations don't get out of hand. So if I only need about 100 values I will use a byte.

Here's some info about integer sizes.

Bits Signed max Unsigned max
8 127 255
16 32,767 65,535
24 8,388,607 16,777,215
32 2,147,483,647 4,294,967,295
64 9,223,372,036,854,775,808 18,446,744,073,709,551,616

String length

If you have a fixed length field use a fixed length. Examples includes hashes and tokens.

Otherwise pick the first option that will work:

ASCII UTF-8 Note
255 63 length fits in one byte
767 190 maximum length MySQL can index
65000 16250 uses most of the row space supported by MySQL (65,535 bytes), but not all of it

If you can't find an appropriate size above use MEDIUMTEXT if 16 MiB is enough or LONGTEXT for up to 4 GiB.

For help visualising string lengths see my field/column length examples.

Glossary

ACID

ACID:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Concurrency control

Concurrency control

DCL

Data Control Language

DDL

Data Definition Language

DML

Data Manipulation Language

DRI

Declarative Referential Integrity

Luhn algorithm

The Luhn algorithm can be used to generate a check digit for an integer value.

SQL

Structured Query Language (AKA: Scarcely Qualifies as a Language)

Transaction phenomena

These brief notes about potentially problematic transaction phenomena taken from Joe Celko's SQL for Smarties: Advanced SQL Programming.

P0 Dirty write
P1 Dirty read
P2 Nonrepeatable read
P4 Phantom
P5 Lost update

Dirty write

T1 modifies a field. T2 further modifies that field before T1 commits. If T1 or T2 rollback it is unclear what the correct field value is.

Dirty read

T1 modifies a row. T2 then modifies or deletes that row before T1 commits. If T1 performs a rollback T2 will have read a row that was never committed.

Nonrepeatable read

T1 reads a row. T2 then modifies or deletes that row. If T1 then attempts to reread the row it may receive the modified value or discover the row has been deleted.

Phantom

T1 reads the set of rows N that satisfies some condition. T2 then executes statements that generate oen or more rows that satisfy the condition used by T1. If T1 then repeats the initial read with the same condition it obtains a different collection of rows.

Lost update

T1 reads a field, then T2 updates that field, then T1 (based on earlier data) updates the field. The update from T2 is lost.