Difference between revisions of "Database patterns"

From ProgClub
Jump to: navigation, search
Line 7: Line 7:
 
Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.
 
Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.
  
We have four broad [[#Table types|types of data]], being [[#Metadata|metadata]], core mutable [[#Application data|application data]] (in [[#data|data]] and [[#hist|hist]] tables), [[#Log data|log data]], and then [[#Elemental data|everything else]].
+
We have four broad [[#Table types|types of data]], being [[#Metadata|metadata]] (in [[#about|about]] tables), core mutable [[#Application data|application data]] (in [[#data|data]] and [[#hist|hist]] tables), [[#Log data|log data]] (in [[#log|log]] tables), and then [[#Elemental data|everything else]].
  
 
Tables of the various [[#Patterns|patterns]] can only depend on other types of tables with a higher precedence. So for example application [[#data|data]] cannot refer to [[#log|log]] data because 'log' tables come after 'data' tables. When you are declaring your schema you should declare your tables in the [[#Canonical order|canonical order]].
 
Tables of the various [[#Patterns|patterns]] can only depend on other types of tables with a higher precedence. So for example application [[#data|data]] cannot refer to [[#log|log]] data because 'log' tables come after 'data' tables. When you are declaring your schema you should declare your tables in the [[#Canonical order|canonical order]].

Revision as of 14:28, 22 March 2020

This page documents the various types of database design patterns supported by the PHPBOM library and used in our applications.

Introduction

Here we document the type of tables we have in our databases and include some notes about how they are accessed and used.

Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.

We have four broad types of data, being metadata (in about tables), core mutable application data (in data and hist tables), log data (in log tables), and then everything else.

Tables of the various patterns can only depend on other types of tables with a higher precedence. So for example application data cannot refer to log data because 'log' tables come after 'data' tables. When you are declaring your schema you should declare your tables in the canonical order.

Many of our table patterns implement a parent pattern, being the elemental data pattern.

Notes

Naming conventions

Tables are named as 't_{pattern}_{name}'.

Columns are named as 'a_{name}'. Note that column names must be globally unique, not simply unique per table. The exception to this rule is if you have a foreign key without a more specific name, in that case the foreign key can have the same name as its referent.

The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.

Note: to complicate things somewhat the PHPBOM software provides support for a table prefix to use for your schemas so you can squeeze an application into a shared database if need be. The table names thus become '{prefix}t_{pattern}_{name}', because who doesn't love complexity?

So in PHPBOM parlance there are three types of table name:

Type Name Note
full table name '{prefix}t_{pattern}_{name}' $prefix can be empty!
table name 't_{pattern}_{name}' used within the software to identify a table
short table name '{name}' used to enforce as globally unique

Note that short table names must be globally unique across all applicable schemas defined for your application (an application can have multiple schemas, typically two, the PHPBOM schema and the application schema).

Canonical order

Tables should be declared in the following order. Tables with a lower ordinal cannot reference tables with a higher ordinal.

Ordinal Pattern
1 about
2 lookup
3 ident
4 particle
5 token
6 name
7 string
8 jzon
9 zlob
10 upload
11 structure
12 data
13 hist
14 log
15 other

a_interaction_id

We have a standard ident table called t_ident_interaction which contains the serial number for an application interaction (e.g. a HTTP request) in its a_interaction_id column. When most elemental data is registered (not lookup data) we log the a_interaction_id of the interaction that did the registration. Using the a_interaction_id and the t_ident_interaction table we can deduce the time of the registration along with other audit information associated with the interaction.

Foreign keys

We don't configure foreign keys, thus they are not enforced. This is potentially controversial, but it has some benefits.

If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for lookup tables the zero value is modelled in the lookup table but not as a valid value in the application, foreign keys to lookup data should never be zero (actually no foreign keys should really ever be zero, but this is not enforced by the database).

Generally it's best if you left join to referenced tables in case the relevant data is missing.

Banning foreign keys simplifies access, improves performance, and reduces storage overhead. It also improves some audit situations wherein log data can refer to subsequently removed records without needing to be deleted with the relevant primary key data.

You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.

Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant. It's trivial to detect orphan data if you find yourself in a situation where you need to know.

I guess I should stress that just because a foreign key *can* be zero doesn't mean that it *should* be zero. One place where I find it expedient to use zero values is when I'm working on a new feature that is not yet complete. Such work in progress can use zero values for foreign keys while the code that calculates what they should really be hasn't been written yet. The general rule is that if you find a foreign key with a zero value there is something wrong and you should investigate and resolve. Ordinarily you should never have zero foreign keys.

Null values

I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value.

Banning nulls simplifies applications, reduces storage costs, and improves performance.

Note that most elemental data will define an 'empty' (or 'zero') record for ID = 1 (but not token or upload tables, as they don't have meaningful 'empty' values). The lookup data defines the 'empty' record with ID = 0.

Shredding

If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for elemental data rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in hist tables) can also optionally be removed (otherwise it should preferably be updated with valid data) during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups (or application server backups) which may still contain references to the shredded values. Shredding is revisionist and can rewrite history, so you might like to leave an audit trail if you do support shredding.

Table types

Metadata

All database and application metadata are stored in about tables and accessed via a raw connection. Your application can use the metadata to figure out if the database is in an operable state or not, viz the installed schema revisions and/or online status.

Application data

Your core mutable application data should mostly be stored in data tables. Each data table has a corresponding hist table which logs changes made to data in data tables.

Data in data and hist tables are accessed via a trn connection.

Log data

For writing logs to log tables. Logs are ordinarily insert-only (no updates or deletes). If your log table refers to a mutable data table you might like to copy the value of salient fields at the time the log was made, as the 'data' table might be modified after the logs have been written.

Elemental data

Everything that's not metadata, application data or log data implements the elemental pattern.

Elemental data is accessed via a raw connection.

The elemental data pattern is basically where we have a small thing, like say an DBT_UINT8, that points to a big thing, like say a DBT_HASH_BIN. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a raw connection which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.

Note that most (but not all) elemental tables log the a_interaction_id for the interaction in which they were registered. Items only get registered once and cannot ordinarily be changed after they have been registered.

The table patterns which implement the elemental pattern are:

pattern small thing big things a_interaction_id note
lookup enum char, code, name, etc. not logged lookup data is read-only and packaged with schema
ident id creation date, potentially other data logged
particle id some larger integer or usually a case-sensitive string logged
token id DBT_TOKEN logged
name id some string value (maybe case-sensitive or not), a lowercase value is also logged logged be careful with case-sensitivity
string id a string value and a DBT_HASH_BIN logged the DBT_HASH_BIN unsures case-sensitivity
jzon id a blob and a DBT_HASH_BIN logged
zlob id a blob and a DBT_HASH_BIN logged
upload id a DBT_HASH_BIN and other details collected with the upload such as file name logged data logged during upload should not be altered at a later date, use application data if that's what you need
structure id a string value, a DBT_HASH_BIN, and relevant components logged the DBT_HASH_BIN unsures case-sensitivity

Patterns

about

The about tables contain database metadata and are accessed via the 'raw' connection.

An about table is named 't_about_{name}'.

lookup

The lookup tables contain hard-coded reference data and are accessed via the 'raw' connection.

A lookup table is named 't_lookup_{name}'.

A lookup table must contain a 'a_{name}_enum' column which identifies a row. The 'a_{name}_enum' column is usually be a DBT_UINT8.

The 'a_{name}_enum' must be the primary key.

If an optional 'a_{name}_char' column exists it must be of type DBT_ASCII_CHAR_BIN. The PHPBOM library will use lowercase char codes and applications can use uppercase char codes.

If you're extending a PHPBOM lookup table you should code from the max enum down as PHPBOM will code from the min enum up.

ident

The ident tables allocate unique identifiers (serial numbers) and are accessed via the 'raw' connection.

An ident table is named 't_ident_{name}'.

An ident table has a 'a_{name}_id' column which is usually a DBT_ID24 or DBT_ID32.

Typically an ident table will collect a timestamp for its creation date, and will usually log an a_interaction_id.

particle

The particle tables contain particles and are accessed via the 'raw' connection. A particle table maps a wide column to a narrow column.

A particle table is named 't_particle_{name}'.

A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A particle table must contain a 'a_{name}' column which is the wide value. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a name table instead.

token

The token tables contain tokens and are accessed via the 'raw' connection.

A token table is named 't_token_{name}'.

A token table must contain a 'a_{name}_id' column which contains the ID for the token. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

The 'a_{name}_id' column must be the primary key.

A token table must contain a 'a_{name}' column which contains the token. The 'a_{name}' column is usually a DBT_TOKEN.

There should be a unique index on the 'a_{name}' column.

name

The name tables contain names and are accessed via the 'raw' connection. Names are strings, potentially (although not necessarily) case-insensitive strings.

A name table is named 't_name_{name}'.

A name table must contain a 'a_{name}_id' column which contains the ID for the name. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A name table must contain a 'a_{name}' column which contains the name. The 'a_{name}' column is usually a DBT_ASCII_CI or DBT_UTF8_CI but it could be DBT_ASCII_BIN or DBT_UTF8_BIN depending on your requirements.

A name table must contain a 'a_{name}_lower' column which contains the lowercase name. The 'a_{name}_lower' column is usually a DBT_ASCII_BIN or DBT_UTF8_BIN. Ordinarily there will be a unique index on the 'a_{name}_lower' column, but you can make this a non-unique index depending upon your requirements.

string

The string tables contain strings and are accessed via the 'raw' connection. A string table maps a wide column to a narrow column. You would usually use a string table if a particle table is too small.

A string table is named 't_string_{name}'.

A string table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A string table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A string table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the string value.

jzon

The jzon tables contain compressed JSON and are accessed via the 'raw' connection.

A jzon table is named 't_jzon_{name}'.

A jzon table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A jzon table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A jzon table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

zlob

The zlob tables contain compressed blobs and are accessed via the 'raw' connection.

A zlob table is named 't_zlob_{name}'.

A zlob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zlob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zlob table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed blob.

See the inflate.php program if you need to work with data in this format.

upload

The upload tables contain data about uploaded files and are accessed via the 'raw' connection.

A upload table is named 't_upload_{name}'.

An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.

structure

The structure tables contain structures and are accessed via the 'raw' connection. A structure table maps a wide column to a narrow column and constituent component columns.

A structure table is named 't_structure_{name}'.

A structure table must contain a 'a_{name}_id' column which contains the narrow ID for the structure. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A structure table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A structure table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the wide string value.

Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.

data

The data tables contain mutable application data and are accessed via the 'trn' connection. All data tables have a corresponding hist table.

A data table is named 't_data_{name}'.

A data table has a 'a_{name}_id' column which is is usually a DBT_ID24 or DBT_ID32.

A data table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.

A data table will have other columns too.

hist

The hist tables contain logs of changes to mutable application data stored in data tables and are accessed via the 'trn' connection. All hist tables have a corresponding data table.

A hist table is named 't_hist_{name}'.

A hist table has a 'a_{name}_hist_id' primary key which is is usually a DBT_ID32.

A hist table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the data was modified.

A hist table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the data.

A hist table then contains a copy of all of the columns from its corresponding data table.

log

A log table contains log data and are accessed via the 'raw' connection.

A log table is named 't_log_{name}'.

There are no rules for the content of log tables, but you should usually (but not always) try to copy data into logs by value rather than by reference, because referenced data might change after the logs were written.

other

If we're dealing with tables which can't be categorised into one or more of the above patterns we call it other data and there are no restrictions on its format. The connection to use for 'other' tables is unspecified.

Database connections

In an application there are usually two separate connections to the database, being the 'raw' connection and the 'trn' connection. The database pattern indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for data and hist tables which use the 'trn' connection.

raw connection

The 'raw' connection uses the READ UNCOMMITTED transaction isolation level and (usually) auto-commits.

Note: as a performance improvement you can begin a transaction at the end of an interaction when the only thing left to do is write some logs. Beginning a transaction will improve the performance of the logging, and you can commit the transaction when you're done.

trn connection

The 'trn' connection uses the SERIALIZABLE transaction isolation level. Programmers must manually begin and commit (or rollback) their transactions. The PHPBOM library provides support for nested transactions.

Database types

There are many more supported database types which can be used for column definitions, but some key types are listed here.

DBT_ID8

A DBT_ID8 is an auto generated integer from 1 to 255.

DBT_ID16

A DBT_ID16 is an auto generated integer from 1 to 65,535.

DBT_ID24

A DBT_ID24 is an auto generated integer from 1 to 16,777,215.

DBT_ID32

A DBT_ID32 is an auto generated integer from 1 to 4,294,967,295.

DBT_ID64

A DBT_ID64 is an auto generated integer from 1 to 264-1.

DBT_UINT8

A DBT_UINT8 is an integer from 0 to 255.

DBT_UINT16

A DBT_UINT16 is an integer from 1 to 65,535.

DBT_UINT24

A DBT_UINT24 is an integer from 1 to 16,777,215.

DBT_UINT32

A DBT_UINT32 is an integer from 1 to 4,294,967,295.

DBT_UINT64

A DBT_UINT64 is an integer from 1 to 264-1.

DBT_ASCII_CHAR_BIN

A DBT_ASCII_CHAR_BIN is a single case-sensitive ASCII character.

DBT_HASH_BIN

A DBT_HASH_BIN is a case-sensitive 28-byte binary string containing a SHA512/224 hash value.

DBT_HASH_HEX

A DBT_HASH_HEX is a case-insensitive 56-character hexadecimal string containing a SHA512/224 hash value.

DBT_TOKEN

A DBT_TOKEN is a case-sensitive 32-character alphanumeric ASCII string.

DBT_CREATED_ON

A DBT_CREATED_ON is the timestamp of when the record was created.

DBT_UPDATED_ON

A DBT_UPDATED_ON is the timestamp of when the record was last updated.