Database patterns

From ProgClub
Revision as of 01:15, 3 May 2020 by John (talk | contribs) (→‎ACID)
Jump to: navigation, search

This page documents the various types of database design patterns supported by the PHPBOM library and used in our applications.

Introduction

Here we document the type of tables we have in our databases and include some notes about how they are accessed and used.

Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.

We have four broad types of data, being metadata (in about tables), core mutable application data (in data and hist tables), log data (in log tables), and then everything else.

Tables of the various patterns can only depend on other types of tables with a higher precedence. So for example application data cannot refer to log data because 'log' tables come after 'data' tables. When you are declaring your schema you should declare your tables in the canonical order and within patterns of the same type usually alphabetical or temporal order is best when possible.

Many of our table patterns implement a parent pattern, being the elemental data pattern.

Notes

Naming conventions

Tables are named as 't_{pattern}_{name}'.

Columns are named as 'a_{name}'. Note that column names must be globally unique, not simply unique per table. The exception to this rule is if you have a foreign key without a more specific name, in that case the foreign key can have the same name as its referent.

The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.

Note: to complicate things somewhat the PHPBOM software provides support for a table prefix to use for your schemas so you can squeeze an application into a shared database if need be. The table names thus become '{prefix}t_{pattern}_{name}', because who doesn't love complexity?

So in PHPBOM parlance there are three types of table name:

Type Name Note
full table name '{prefix}t_{pattern}_{name}' $prefix can be empty!
table name 't_{pattern}_{name}' used within the software to identify a table
short table name '{name}' used to enforce as globally unique

Note that short table names must be globally unique across all applicable schemas defined for your application (an application can have multiple schemas, typically two, the PHPBOM schema and the application schema).

Naming foreign keys

Ideally foreign keys (including foreign keys which are also primary keys) will have the same name as the primary key column they refer to.

When there are multiple foreign keys to the same primary key, or when we have a self-referential table, then using the same name as the referent is not possible. In this case you should use the whole name being referred to and place a qualifier after, e.g.: two foreign keys to a_token_id on the same table:

  • a_token_id_prev
  • a_token_id_next

Naming for data/hist tables implemented by PHPBOM

In general applications and PHPBOM share namespaces, particularly for elemental data. An exception to this is made with data tables and their corresponding hist counterparts. It is recommended for your application data/hist tables that you prefix your app code to distinguish tables your application has defined from tables PHPBOM has defined.

To try and keep out of your way PHPBOM uses standard prefixes for data/hist tables. These prefixes (at the present time) are:

Prefix Note
app_ for application data
web_ for web related data
loc_ for geographic data

Canonical order

Tables should be declared in the following order. Tables with a lower ordinal cannot reference tables with a higher ordinal.

Ordinal Pattern
1 about
2 lookup
3 ident
4 particle
5 name
6 string
7 jzon
8 zlob
9 upload
10 structure
11 data
12 hist
13 log
14 other

a_interaction_id

We have a standard ident table called t_ident_interaction which contains the serial number for an application interaction (e.g. a HTTP request) in its a_interaction_id column. When most elemental data is registered (not lookup data) we log the a_interaction_id of the interaction that did the registration. Using the a_interaction_id and the t_ident_interaction table we can deduce the time of the registration along with other audit information associated with the interaction.

Foreign keys

We don't configure foreign keys, thus they are not enforced. This is potentially controversial, but it has some benefits.

If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for lookup tables the zero value is modelled in the lookup table but not as a valid value in the application (except for flags).

Generally it's best if you left join to referenced tables in case the relevant data is missing.

Banning foreign keys simplifies access, improves performance, and reduces storage overhead. It also improves some audit situations wherein log data can refer to subsequently removed records without needing to be deleted with the relevant primary key data.

You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.

Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant. It's trivial to detect orphan data if you find yourself in a situation where you need to know.

I guess I should stress that just because a foreign key *can* be zero doesn't mean that it *should* be zero. One place where I find it expedient to use zero values is when I'm working on a new feature that is not yet complete. Such work in progress can use zero values for foreign keys while the code that calculates what they should really be hasn't been written yet. The general rule is that if you find a foreign key with a zero value there is something wrong and you should investigate and resolve. Ordinarily you should never have zero foreign keys (except with flags).

Null values

I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value.

Banning nulls simplifies applications, reduces storage costs, and improves performance.

Note that most elemental data will define an 'empty' (or 'zero') record for ID = 1 (but not upload tables, as they don't have meaningful 'empty' values). The enumeration data defines the 'empty' record with ID = 0.

Shredding

If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for elemental data rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in hist tables) can also optionally be removed (otherwise it should preferably be updated with valid data) during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups (or application server backups) which may still contain references to the shredded values. Shredding is revisionist and can rewrite history, so you might like to leave an audit trail if you do support shredding.

Presently with PHPBOM we have considered the possibility of shredding and defined it, but it is not yet implemented and is a low-priority usecase for us at the moment.

Table types

Metadata

All database and application metadata are stored in about tables and accessed via a raw connection. Your application can use the metadata to figure out if the database is in an operable state or not, viz the installed schema revisions and/or online status.

Application data

Your core mutable application data should mostly be stored in data tables. Each data table has a corresponding hist table which logs changes made to data in data tables.

Data in data and hist tables are accessed via a trn connection.

Log data

For writing logs to log tables. Logs are ordinarily insert-only (no updates or deletes). If your log table refers to a mutable data table you might like to copy the value of salient fields at the time the log was made, as the 'data' table might be modified after the logs have been written.

Elemental data

Everything that's not metadata, application data or log data implements the elemental pattern.

Elemental data is accessed via a raw connection.

The elemental data pattern is basically where we have a small thing, like say an DBT_UINT8, that points to a big thing, like say a DBT_HASH_BIN. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a raw connection which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.

Note that most (but not all) elemental tables log the a_interaction_id for the interaction in which they were registered. Items only get registered once and cannot ordinarily be changed after they have been registered.

The table patterns which implement the elemental pattern are:

pattern small thing big things a_interaction_id note
lookup enum, flags char, code, name, etc. not logged lookup data is read-only and packaged with schema
ident id creation date, potentially other data logged
particle id some larger integer or usually a case-sensitive string logged
name id some string value (must be case-sensitive, as with particles), a case-insensitive value is also usually logged (for use in indexes and queries) logged be careful with case-sensitivity
string id a string value and a DBT_HASH_BIN logged the DBT_HASH_BIN unsures case-sensitivity
jzon id a blob and a DBT_HASH_BIN logged
zlob id a blob and a DBT_HASH_BIN logged
upload id a DBT_HASH_BIN and other details collected with the upload such as file name logged data logged during upload should not be altered at a later date, use application data if that's what you need
structure id a string value, a DBT_HASH_BIN, and relevant components logged the DBT_HASH_BIN unsures case-sensitivity

Patterns

about

The about tables contain database metadata and are accessed via the 'raw' connection.

An about table is named 't_about_{name}'.

lookup

The lookup tables contain hard-coded reference data and are accessed via the 'raw' connection. Ordinarily lookup data is defined with the schema and not modified thereafter.

A lookup table is named 't_lookup_{name}'.

Depending on whether it is an enumeration or flags table a lookup table must contain either a 'a_{name}_enum' column or an 'a_{name}_flags' column to identify a row. The 'a_{name}_{enum,flags}' column is usually be a DBT_UINT8.

The 'a_{name}_{enum,flags}' column must be the primary key.

enumeration

An enumeration table is a special type of lookup table which contains enumerated values (starting at 1) which indicate a state and associated data. For enumerations the zero value is usually an invalid value but it is also usually modelled in the lookup data as an instance of the Null Object Pattern.

If an optional 'a_{name}_char' column exists it must be of type DBT_ASCII_CHAR_BIN. The PHPBOM library will use lowercase char codes and applications can use uppercase char codes to distinguish themselves.

Unlike flags tables you can extend a PHPBOM enumeration with your own values, but if you do you should code from the max enum down as PHPBOM will code from the min enum up. Use an uppercase value in relevant 'char', 'code', etc. columns to distinguish your application defined values from PHPBOM defined values.

flags

A flags table is a special type of lookup table which associates up to 256 flags enum values with up to eight boolean columns. Unlike with enumeration tables a flags table can legitimately have a zero primary key (indicating no boolean values are set).

A flags table is named 't_lookup_{name}_flags'.

A flags table must contain a 'a_{name}_flags' enum column which identifies a row. The 'a_{namne}_flags' column would typically be a DBT_UINT8. If more flags are needed they would probably be best modelled with a new flags table than by extending the flags column to DBT_UINT16, although that is possible.

The 'a_{name}_flags' must be the primary key.

Other columns, typically up to eight of them, are DBT_BOOL columns which model the associated flags as boolean values. Programmers can join to the flags lookup table to get the associated boolean column values for use in reports or queries.

Boolean columns are named 'a_{is,has,was,can}_{noun}_{adjective}' and in our PHP code there is generally a constant declared with the schema having the relevant power of 2:

define( 'FLAG_{IS,HAS,WAS,CAN}_{NOUN}_{ADJECTIVE}', pow( 2, x ) );

For the current full list of supported indicatives (is, has, etc) see the is_bool_name() function definition.

Generally it is very ill-advised to extend a PHPBOM flags lookup table. If you need extra flags defining your own flags table is probably your best option.

ident

The ident tables allocate unique identifiers (serial numbers) and are accessed via the 'raw' connection.

An ident table is named 't_ident_{name}'.

An ident table has a 'a_{name}_id' column which is usually a DBT_ID24 or DBT_ID32.

Typically an ident table will collect a timestamp for its creation date, and will usually log an a_interaction_id.

particle

The particle tables contain particles and are accessed via the 'raw' connection. A particle table maps a wide column to a narrow column and vice versa.

A particle table is named 't_particle_{name}'.

A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A particle table must contain a 'a_{name}' column which is the wide value. There must be a unique index on the 'a_{name}' column. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a name table instead.

name

The name tables contain names and are accessed via the 'raw' connection. Names are case-sensitive strings (like particles), potentially (although not necessarily) with an associated case-insensitive string (to be used for indexes and/or queries).

A name table is named 't_name_{name}'.

A name table must contain a 'a_{name}_id' column which contains the ID for the name. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A name table must contain a 'a_{name}' column which contains the name. The 'a_{name}' column is usually a DBT_ASCII_BIN or DBT_UTF8_BIN. There must be a unique index on the 'a_{name}' column.

A name table may contain a 'a_{name}_ci' column which contains the case-insensitive name. The 'a_{name}_ci' column is usually a DBT_ASCII_CI or DBT_UTF8_CI column.

The only real difference between a name table and a particle table is that a name table always models strings (particles can model numbers and other scalar values) and name tables may include a case-insensitive version of the value which can be used in indexes or queries.

string

The string tables contain strings and are accessed via the 'raw' connection. A string table maps a wide column to a narrow column. You would usually use a string table if a particle or name table is too small.

A string table is named 't_string_{name}'.

A string table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A string table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A string table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the string value.

jzon

The jzon tables contain compressed JSON and are accessed via the 'raw' connection.

A jzon table is named 't_jzon_{name}'.

A jzon table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A jzon table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A jzon table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

zlob

The zlob tables contain compressed blobs and are accessed via the 'raw' connection.

A zlob table is named 't_zlob_{name}'.

A zlob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zlob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zlob table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed blob.

See the inflate.php program if you need to work with data in this format.

upload

The upload tables contain data about uploaded files and are accessed via the 'raw' connection.

A upload table is named 't_upload_{name}'.

An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.

structure

The structure tables contain structures and are accessed via the 'raw' connection. A structure table maps a wide column to a narrow column and constituent component columns.

A structure table is named 't_structure_{name}'.

A structure table must contain a 'a_{name}_id' column which contains the narrow ID for the structure. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A structure table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A structure table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the wide string value.

Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.

data

The data tables contain mutable application data and are accessed via the 'trn' connection. All data tables have a corresponding hist table.

A data table is named 't_data_{name}'.

A data table has a 'a_{name}_id' column which is is usually a DBT_ID24 or DBT_ID32.

A data table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.

A data table will have other columns too.

Typically data tables use a prefix that indicates their subtype, e.g. 'app_', 'cfg_', 'web_', 'loc_', etc. Usually the subtype is omitted from the table's column names. E.g. t_data_app_user.a_user_id. If you're worried about future namespace collisions picking an application prefix for your table and column names is probably the way to go. (2020-03-28 jj5 - THINK: do we want a registry for application prefixes to avoid collisions?)

hist

The hist tables contain logs of changes to mutable application data stored in data tables and are accessed via the 'trn' connection. All hist tables have a corresponding data table.

A hist table is named 't_hist_{name}'.

A hist table has a 'a_{name}_hist_id' primary key which is is usually a DBT_ID32.

A hist table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the data was modified.

A hist table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the data.

A hist table then contains a copy of all of the columns from its corresponding data table.

log

A log table contains log data and are accessed via the 'raw' connection.

A log table is named 't_log_{name}'.

There are no rules for the content of log tables, but you should usually (but not always) try to copy data into logs by value rather than by reference, because referenced data might change after the logs were written.

other

If we're dealing with tables which can't be categorised into one or more of the above patterns we call it other data and there are no restrictions on its format. The connection to use for 'other' tables is unspecified.

Database connections

In an application there are usually two separate connections to the database, being the 'raw' connection and the 'trn' connection. The database pattern indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for data and hist tables which use the 'trn' connection.

raw connection

The 'raw' connection uses the READ UNCOMMITTED transaction isolation level and (usually) auto-commits.

Note: as a performance improvement you can begin a transaction at the end of an interaction when the only thing left to do is write some logs. Beginning a transaction will improve the performance of the logging, and you can commit the transaction when you're done.

trn connection

The 'trn' connection uses the SERIALIZABLE transaction isolation level. Programmers must manually begin and commit (or rollback) their transactions. The PHPBOM library provides support for nested transactions.

Database types

There are many more supported database types which can be used for column definitions, but some key types are listed here.

DBT_ID8

A DBT_ID8 is an auto generated integer from 1 to 255.

DBT_ID16

A DBT_ID16 is an auto generated integer from 1 to 65,535.

DBT_ID24

A DBT_ID24 is an auto generated integer from 1 to 16,777,215.

DBT_ID32

A DBT_ID32 is an auto generated integer from 1 to 4,294,967,295.

DBT_ID64

A DBT_ID64 is an auto generated integer from 1 to 264-1.

DBT_UINT8

A DBT_UINT8 is an integer from 0 to 255.

DBT_UINT16

A DBT_UINT16 is an integer from 1 to 65,535.

DBT_UINT24

A DBT_UINT24 is an integer from 1 to 16,777,215.

DBT_UINT32

A DBT_UINT32 is an integer from 1 to 4,294,967,295.

DBT_UINT64

A DBT_UINT64 is an integer from 1 to 264-1.

DBT_ASCII_CHAR_BIN

A DBT_ASCII_CHAR_BIN is a single case-sensitive ASCII character.

DBT_HASH_BIN

A DBT_HASH_BIN is a case-sensitive 28-byte binary string containing a SHA512/224 hash value.

DBT_HASH_HEX

A DBT_HASH_HEX is a case-insensitive 56-character hexadecimal string containing a SHA512/224 hash value.

DBT_TOKEN

A DBT_TOKEN is a case-sensitive 32-character alphanumeric ASCII string.

DBT_CREATED_ON

A DBT_CREATED_ON is the timestamp of when the record was created.

DBT_UPDATED_ON

A DBT_UPDATED_ON is the timestamp of when the record was last updated.

Data formats

Field data can be in one of two formats: application format or database format. In the large the database format and the application format are the same, strings are strings and ints are ints and floats are floats. However there are some differences as summarised in the following chart. Note that in PHPBOM conversion between formats is done automatically for application data by the active record implementation.

Database type Application format Database format
DBT_BOOL boolean int
DBT_DATETIME, DBT_DATETIME_UTC DateTime string
DBT_TIMEZONE DateTimeZone string

Column size

Integer size

I get stuck on this all the time. My general disposition is to use only as much as you need. Then if it turns out that you need more, provision more. I guess I take the view that limits are good/helpful things in making sure situations don't get out of hand. So if I only need about 100 values I will use a byte.

Here's some info about integer sizes.

Bits Signed max Unsigned max
8 127 255
16 32,767 65,535
24 8,388,607 16,777,215
32 2,147,483,647 4,294,967,295
64 9,223,372,036,854,775,808 18,446,744,073,709,551,616

String length

If you have a fixed length field use a fixed length. Examples includes hashes and tokens.

Otherwise pick the first option that will work:

ASCII UTF-8 Note
255 63 length fits in one byte
767 190 maximum length MySQL can index
65000 16250 uses most of the row space supported by MySQL (65,535 bytes), but not all of it

If you can't find an appropriate size above use MEDIUMTEXT if 16 MiB is enough or LONGTEXT for up to 4 GiB.

For help visualising string lengths see my field/column length examples.

Glossary

ACID

See: ACID:

  • Atomicity
  • Consistency
  • Isolation
  • Durablity

DCL

Data Control Language

DDL

Data Definition Language

DML

Data Manipulation Language

DRI

Declarative Referential Integrity

SQL

Structured Query Language (aka: Scarcely Qualifies as a Language)