Difference between revisions of "Database patterns"

From ProgClub
Jump to: navigation, search
 
(87 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.
 
Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.
  
We have three broad [[#Table types|types of data]], being [[#Metadata|metadata]], core mutable [[#Application data|application data]] (in [[#data|data]] and [[#hist|hist]] tables) and then [[#Elemental data|everything else]].
+
We have four broad [[#Table types|types of data]], being [[#Metadata|metadata]] (in [[#about|about]] tables), core mutable [[#Application data|application data]] (in [[#entity|entity]] and [[#history|history]] tables), [[#Log data|log data]] (in [[#log|log]] tables), and then [[#Elemental data|everything else]].
  
Tables of the various [[#Patterns|patterns]] can only depend on other types of tables with a higher precedence. So for example application [[#data|data]] cannot refer to [[#log|log]] data because 'log' tables come after 'data' tables. When you are declaring your schema you should declare your tables in the [[#Canonical order|canonical order]].
+
Tables of the various [[#Patterns|patterns]] can only depend on other types of tables with a higher precedence. So for example application [[#entity|entity]] data cannot refer to [[#log|log]] data because 'log' tables come after 'entity' tables. When you are declaring your schema you should declare your tables in the [[#Canonical order|canonical order]] and within patterns of the same type usually alphabetical or temporal order is best when possible.
  
 
Many of our table patterns implement a parent pattern, being the [[#Elemental data|elemental data]] pattern.
 
Many of our table patterns implement a parent pattern, being the [[#Elemental data|elemental data]] pattern.
Line 22: Line 22:
  
 
The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.
 
The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.
 +
 +
Note: to complicate things somewhat the [[PHPBOM]] software provides support for a table prefix to use for your schemas so you can squeeze an application into a shared database if need be. The table names thus become '{prefix}t_{pattern}_{name}', because who doesn't love complexity?
 +
 +
So in [[PHPBOM]] parlance there are three types of table name:
 +
 +
{|class="wikitable"
 +
! Type
 +
! Name
 +
! Note
 +
|-
 +
| full table name
 +
| '{prefix}t_{pattern}_{name}'
 +
| $prefix can be empty!
 +
|-
 +
| table name
 +
| 't_{pattern}_{name}'
 +
| used within the software to identify a table
 +
|-
 +
| short table name
 +
| '{name}'
 +
| used to enforce as globally unique
 +
|}
 +
 +
Note that short table names must be globally unique across all applicable schemas defined for your application (an application can have multiple schemas, typically two, the [[PHPBOM]] schema and the application schema).
 +
 +
=== Naming foreign keys ===
 +
 +
Ideally foreign keys (including foreign keys which are also primary keys) will have the same name as the primary key column they refer to.
 +
 +
When there are multiple foreign keys to the same primary key, or when we have a self-referential table, then using the same name as the referent is not possible. In this case you should use the whole name being referred to and place a qualifier after, e.g.: two foreign keys to a_token_id on the same table:
 +
 +
* a_token_id_prev
 +
* a_token_id_next
 +
 +
=== Naming for entity/history tables implemented by PHPBOM ===
 +
 +
In general applications and [[PHPBOM]] share namespaces, particularly for [[#Elemental data|elemental data]]. An exception to this is made with [[#entity|entity]] tables and their corresponding [[#history|history]] counterparts. It is recommended for your application entity/history tables that you prefix your app code to distinguish tables your application has defined from tables PHPBOM has defined.
 +
 +
To try and keep out of your way [[PHPBOM]] uses standard prefixes for entity/history tables. These prefixes (at the present time) are:
 +
 +
{|class="wikitable sortable"
 +
! Prefix !! Note
 +
|-
 +
| app_  || for application data
 +
|-
 +
| web_  || for web related data
 +
|-
 +
| loc_  || for geographic data
 +
|}
  
 
== Canonical order ==
 
== Canonical order ==
Line 28: Line 77:
  
 
{|class="wikitable"
 
{|class="wikitable"
! Ordinal
+
! Ordinal !! Pattern
! Pattern
 
 
|-
 
|-
| 1
+
| 11      || [[#about|about]]
| [[#about|about]]
 
 
|-
 
|-
| 2
+
| 21      || [[#lookup|lookup]]
| [[#lookup|lookup]]
+
|-
 +
| 23      || [[#mflags|mflags]]
 +
|-
 +
| 31      || [[#ident|ident]]
 
|-
 
|-
| 3
+
| 41      || [[#particle|particle]]
| [[#ident|ident]]
 
 
|-
 
|-
| 4
+
| 43      || [[#pstring|pstring]]
| [[#particle|particle]]
 
 
|-
 
|-
| 5
+
| 51      || [[#textdata|textdata]]
| [[#token|token]]
 
 
|-
 
|-
| 6
+
| 52      || [[#upload|upload]]
| [[#name|name]]
 
 
|-
 
|-
| 7
+
| 53      || [[#zblob|zblob]]
| [[#string|string]]
 
 
|-
 
|-
| 8
+
| 55      || [[#ztext|ztext]]
| [[#jzon|jzon]]
 
 
|-
 
|-
| 9
+
| 56      || [[#zjson|zjson]]
| [[#zlob|zlob]]
 
 
|-
 
|-
| 10
+
| 61      || [[#structure|structure]]
| [[#upload|upload]]
 
 
|-
 
|-
| 11
+
| 71      || [[#entity|entity]]
| [[#structure|structure]]
 
 
|-
 
|-
| 12
+
| 73      || [[#history|history]]
| [[#data|data]]
 
 
|-
 
|-
| 13
+
| 81      || [[#log|log]]
| [[#hist|hist]]
 
 
|-
 
|-
| 14
+
| 91      || [[#view|view]]
| [[#log|log]]
 
 
|-
 
|-
| 15
+
| 99      || [[#other|other]]
| [[#other|other]]
 
 
|}
 
|}
  
Line 83: Line 120:
 
== Foreign keys ==
 
== Foreign keys ==
  
We don't configure foreign keys, thus they are not enforced. If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for [[#lookup|lookup]] tables the zero value is modelled. Generally it's best if you left join to referenced tables in case the relevant data is missing. Banning foreign keys simplifies access, improves performance, and reduces storage overhead.
+
We don't configure foreign keys, thus they are not enforced. This is potentially controversial, but it has some benefits.
 +
 
 +
If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for [[#lookup|lookup]] tables the zero value is modelled in the lookup table but not as a valid value in the application (except for [[#flags|flags]]).
 +
 
 +
Generally it's best if you left join to referenced tables in case the relevant data is missing.
 +
 
 +
Banning foreign keys simplifies access, improves performance, and reduces storage overhead. It also improves some audit situations wherein log data can refer to subsequently removed records without needing to be deleted with the relevant primary key data.
  
 
You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.
 
You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.
  
Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant.
+
Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant. It's trivial to detect orphan data if you find yourself in a situation where you need to know.
 +
 
 +
I guess I should stress that just because a foreign key *can* be zero doesn't mean that it *should* be zero. One place where I find it expedient to use zero values is when I'm working on a new feature that is not yet complete. Such work in progress can use zero values for foreign keys while the code that calculates what they should really be hasn't been written yet. The general rule is that if you find a foreign key with a zero value there is something wrong and you should investigate and resolve. Ordinarily you should never have zero foreign keys (except with [[#flags|flags]]).
  
 
== Null values ==
 
== Null values ==
  
I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value. Banning nulls simplifies applications and reduces storage costs.
+
I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value.
  
Note that most [[#Elemental data|elemental data]] will define an 'empty' (or 'zero') record for ID = 1 (but not [[#token|token]] or [[#upload|upload]] tables, as they don't have meaningful 'empty' values). The [[#lookup|lookup]] data defines the 'empty' record with ID = 0.
+
Banning nulls simplifies applications, reduces storage costs, and improves performance.
 +
 
 +
Note that most [[#Elemental data|elemental data]] will define an 'empty' (or 'zero') record for ID = 1 (but not [[#upload|upload]] tables, as they don't have meaningful 'empty' values). The [[#enumeration|enumeration]] data defines the 'empty' record with ID = 0.
  
 
== Shredding ==
 
== Shredding ==
  
If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for [[#Elemental data|elemental data]] rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in [[#hist|hist]] tables) can also optionally be removed during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups which may still contain references to the shredded values.
+
If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for [[#Elemental data|elemental data]] rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in [[#history|history]] tables) can also optionally be removed (otherwise it should preferably be updated with valid data) during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups (or application server backups) which may still contain references to the shredded values. Shredding is revisionist and can rewrite history, so you might like to leave an audit trail if you do support shredding.
 +
 
 +
Presently with [[PHPBOM]] we have considered the possibility of shredding and defined it, but it is not yet implemented and is a low-priority usecase for us at the moment.
  
 
= Table types =
 
= Table types =
Line 103: Line 152:
 
== Metadata ==
 
== Metadata ==
  
All database metadata are stored in [[#about|about]] tables and accessed via a [[#raw connection|raw connection]].
+
All database and application metadata are stored in [[#about|about]] tables and accessed via a [[#raw connection|raw connection]]. Your application can use the metadata to figure out if the database is in an operable state or not, viz the installed schema revisions and/or online status.
  
 
== Application data ==
 
== Application data ==
  
Your core mutable application data should mostly be stored in [[#data|data]] tables. Each [[#data|data]] table has a corresponding [[#hist|hist]] table which logs changes made to data in [[#data|data]] tables.
+
Your core mutable application data should mostly be stored in [[#entity|entity]] tables. Each [[#entity|entity]] table has a corresponding [[#history|history]] table which logs changes made to data in [[#entity|entity]] tables.
  
Data in [[#data|data]] and [[#hist|hist]] tables are accessed via a [[#trn connection|trn connection]].
+
Data in [[#entity|entity]] and [[#history|history]] tables are accessed via a [[#trn connection|trn connection]].
 +
 
 +
== Log data ==
 +
 
 +
For writing logs to [[#log|log]] tables. Logs are ordinarily insert-only (no updates or deletes). If your log table refers to a mutable [[#entity|entity]] table you might like to copy the value of salient fields at the time the log was made, as the 'entity' table might be modified after the logs have been written.
  
 
== Elemental data ==
 
== Elemental data ==
  
Everything that's not [[#Metadata|metadata]] or [[#Application data|application data]] implements the elemental pattern.
+
Everything that's not [[#Metadata|metadata]], [[#Application data|application data]] or [[#Log data|log data]] implements the elemental pattern.
  
 
Elemental data is accessed via a [[#raw connection|raw connection]].
 
Elemental data is accessed via a [[#raw connection|raw connection]].
Line 119: Line 172:
 
The elemental data pattern is basically where we have a small thing, like say an [[#DBT_UINT8|DBT_UINT8]], that points to a big thing, like say a [[#DBT_HASH_BIN|DBT_HASH_BIN]]. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a [[#raw connection|raw connection]] which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.
 
The elemental data pattern is basically where we have a small thing, like say an [[#DBT_UINT8|DBT_UINT8]], that points to a big thing, like say a [[#DBT_HASH_BIN|DBT_HASH_BIN]]. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a [[#raw connection|raw connection]] which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.
  
Note that most (but not all) elemental tables log the [[#a_interaction_id|a_interaction_id]] for the interaction in which they were registered. Items only get registered once and cannot be changed after they have been registered.
+
Note that most (but not all) elemental tables log the [[#a_interaction_id|a_interaction_id]] for the interaction in which they were registered. Items only get registered once and cannot ordinarily be changed after they have been registered. Even though elemental data is immutable and insert-only data we still put the created_on and updated_on timestamps on our elemental tables, ideally created_on and updated_on will always have the same value, and if not that might alert you to something having gone wrong (because elemental records should not ever be changed).
 +
 
 +
There are two broad types of elemental data depending on how values are identified: value elements and hashed elements. For value elements identity is determined by a value comparison. For hashed elements identity is determined by hashing.
 +
 
 +
Note that for value elements the values are usually case-sensitive. You can break this rule but only if you know what you are doing!
  
 
The table patterns which implement the elemental pattern are:
 
The table patterns which implement the elemental pattern are:
Line 125: Line 182:
 
{|class="wikitable"
 
{|class="wikitable"
 
! pattern
 
! pattern
 +
! identity
 
! small thing
 
! small thing
 
! big things
 
! big things
Line 131: Line 189:
 
|-
 
|-
 
| [[#lookup|lookup]]
 
| [[#lookup|lookup]]
| enum
+
| value
 +
| enum, flags
 
| char, code, name, etc.
 
| char, code, name, etc.
 
| not logged
 
| not logged
Line 137: Line 196:
 
|-
 
|-
 
| [[#ident|ident]]
 
| [[#ident|ident]]
 +
| value
 
| id
 
| id
 
| creation date, potentially other data
 
| creation date, potentially other data
Line 143: Line 203:
 
|-
 
|-
 
| [[#particle|particle]]
 
| [[#particle|particle]]
 +
| value
 
| id
 
| id
 
| some larger integer or usually a case-sensitive string
 
| some larger integer or usually a case-sensitive string
Line 148: Line 209:
 
|
 
|
 
|-
 
|-
| [[#token|token]]
+
| [[#pstring|pstring]]
 +
| value
 
| id
 
| id
| [[#DBT_TOKEN|DBT_TOKEN]]
+
| some string value (must be case-sensitive, as with [[#particle|particles]]), a case-insensitive value is also usually logged (for use in indexes and queries)
 
| logged
 
| logged
|
+
| be careful with case-sensitivity
 
|-
 
|-
| [[#name|name]]
+
| [[#textdata|textdata]]
 +
| hash
 
| id
 
| id
| some string value (maybe case-sensitive or not), a lowercase value is also logged
+
| a string value and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
 
| logged
 
| logged
| be careful with case-sensitivity
+
| the [[#DBT_HASH_BIN|DBT_HASH_BIN]] unsures case-sensitivity
 
|-
 
|-
| [[#string|string]]
+
| [[#upload|upload]]
 +
| hash
 
| id
 
| id
| a string value and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
+
| a [[#DBT_HASH_BIN|DBT_HASH_BIN]] and other details collected with the upload such as file name
 
| logged
 
| logged
| the [[#DBT_HASH_BIN|DBT_HASH_BIN]] unsures case-sensitivity
+
| data logged during upload should not be altered at a later date, use application [[#entity|entity]] if that's what you need
 
|-
 
|-
| [[#jzon|jzon]]
+
| [[#zblob|zblob]]
 +
| hash
 
| id
 
| id
 
| a blob and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
 
| a blob and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
Line 172: Line 237:
 
|
 
|
 
|-
 
|-
| [[#zlob|zlob]]
+
| [[#ztext|ztext]]
 +
| hash
 
| id
 
| id
 
| a blob and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
 
| a blob and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
Line 178: Line 244:
 
|
 
|
 
|-
 
|-
| [[#upload|upload]]
+
| [[#zjson|zjson]]
 +
| hash
 
| id
 
| id
| a [[#DBT_HASH_BIN|DBT_HASH_BIN]] and other details collected with the upload such as file name
+
| a blob and a [[#DBT_HASH_BIN|DBT_HASH_BIN]]
 
| logged
 
| logged
| data logged during upload should not be altered at a later date, use application [[#data|data]] if that's what you need
+
|
 
|-
 
|-
 
| [[#structure|structure]]
 
| [[#structure|structure]]
 +
| hash
 
| id
 
| id
 
| a string value, a [[#DBT_HASH_BIN|DBT_HASH_BIN]], and relevant components
 
| a string value, a [[#DBT_HASH_BIN|DBT_HASH_BIN]], and relevant components
Line 201: Line 269:
 
== lookup ==
 
== lookup ==
  
The [[#lookup|lookup]] tables contain hard-coded reference data and are accessed via the '[[#raw connection|raw]]' connection.
+
The [[#lookup|lookup]] tables contain hard-coded reference data and are accessed via the '[[#raw connection|raw]]' connection. Ordinarily lookup data is defined with the schema and not modified thereafter.
  
 
A lookup table is named 't_lookup_{name}'.
 
A lookup table is named 't_lookup_{name}'.
  
A lookup table must contain a 'a_{name}_enum' column which identifies a row. The 'a_{name}_enum' column is usually be a [[#DBT_UINT8|DBT_UINT8]].
+
Depending on whether it is an [[#enumeration|enumeration]] or [[#flags|flags]] table a lookup table must contain either a 'a_{name}_enum' column or an 'a_{name}_flags' column to identify a row. The 'a_{name}_{enum,flags}' column is usually be a [[#DBT_UINT8|DBT_UINT8]].
  
The 'a_{name}_enum' must be the primary key.
+
The 'a_{name}_{enum,flags}' column must be the primary key.
  
If an optional 'a_{name}_char' column exists it must be of type [[#DBT_ASCII_CHAR_BIN|DBT_ASCII_CHAR_BIN]]. The [[PHPBOM]] library will use lowercase char codes and applications can use uppercase char codes.
+
=== enumeration ===
  
If you're extending a [[PHPBOM]] lookup table you should code from the max enum down as PHPBOM will code from the min enum up.
+
An [[#enumeration|enumeration]] table is a special type of [[#lookup|lookup]] table which contains enumerated values (starting at 1) which indicate a state and associated data. For enumerations the zero value is usually an invalid value but it is also usually modelled in the lookup data as an instance of the Null Object Pattern.
 +
 
 +
If an optional 'a_{name}_char' column exists it must be of type [[#DBT_ASCII_CHAR_BIN|DBT_ASCII_CHAR_BIN]]. The [[PHPBOM]] library will use lowercase char codes and applications can use uppercase char codes to distinguish themselves.
 +
 
 +
Unlike [[#flags|flags]] tables you can extend a [[PHPBOM]] enumeration with your own values, but if you do you should code from the max enum down as PHPBOM will code from the min enum up. Use an uppercase value in relevant 'char', 'code', etc. columns to distinguish your application defined values from [[PHPBOM]] defined values.
 +
 
 +
=== flags ===
 +
 
 +
A [[#flags|flags]] table is a special type of [[#lookup|lookup]] table which associates up to 256 flags enum values with up to eight boolean columns. Unlike with [[#enumeration|enumeration]] tables a flags table can legitimately have a zero primary key (indicating no boolean values are set).
 +
 
 +
A flags table is named 't_mflags_{name}_flags'.
 +
 
 +
A flags table must contain a 'a_{name}_flags' enum column which identifies a row. The 'a_{namne}_flags' column would typically be a [[#DBT_UINT8|DBT_UINT8]]. If more flags are needed they would probably be best modelled with a new flags table than by extending the flags column to [[#DBT_UINT16|DBT_UINT16]], although that is possible.
 +
 
 +
The 'a_{name}_flags' must be the primary key.
 +
 
 +
Other columns, typically up to eight of them, are [[#DBT_BOOL|DBT_BOOL]] columns which model the associated flags as boolean values. Programmers can join to the flags lookup table to get the associated boolean column values for use in reports or queries.
 +
 
 +
Boolean columns are named 'a_{is,has,was,can}_{noun}_{adjective}' and in our PHP code there is generally a constant declared with the schema having the relevant power of 2:
 +
 
 +
define( 'FLAG_{IS,HAS,WAS,CAN}_{NOUN}_{ADJECTIVE}', pow( 2, x ) );
 +
 
 +
For the current full list of supported indicatives (is, has, etc) see the [https://www.progclub.org/pcrepo/phpbom/branches/0.9/src/code/1-basic/2-general.php?revision=8169#l421 is_bool_name()] function definition.
 +
 
 +
Generally it is very ill-advised to extend a [[PHPBOM]] flags lookup table. If you need extra flags defining your own flags table is probably your best option.
  
 
== ident ==
 
== ident ==
Line 225: Line 317:
 
== particle ==
 
== particle ==
  
The [[#particle|particle]] tables contain particles and are accessed via the '[[#raw connection|raw]]' connection. A particle table maps a wide column to a narrow column.
+
The [[#particle|particle]] tables contain particles and are accessed via the '[[#raw connection|raw]]' connection. A particle table maps a wide column to a narrow column and vice versa.
  
 
A particle table is named 't_particle_{name}'.
 
A particle table is named 't_particle_{name}'.
  
A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]] and is the primary key.
 +
 
 +
A particle table must contain a 'a_{name}' column which is the wide value. There must be a unique index on the 'a_{name}' column. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a [[#pstring|pstring]] table instead.
  
A particle table must contain a 'a_{name}' column which is the wide value. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a [[#name|name]] table instead.
+
== pstring ==
  
== token ==
+
The [[#pstring|pstring]] tables contain strings and are accessed via the '[[#raw connection|raw]]' connection. Pstrings are case-sensitive strings (like [[#particle|particles]]), with an associated case-insensitive string (which can be used for indexes and/or queries).
  
The [[#token|token]] tables contain tokens and are accessed via the '[[#raw connection|raw]]' connection.
+
A pstring table is named 't_pstring_{name}'.
  
A token table is named 't_token_{name}'.
+
A pstring table must contain a 'a_{name}_id' column which contains the ID for the pstring. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]] and is the primary key.
  
A token table must contain a 'a_{name}_id' column which contains the ID for the token. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
A pstring table must contain a 'a_{name}' column which contains the case-sensitive string value. The 'a_{name}' column is usually a [[#DBT_ASCII_BIN|DBT_ASCII_BIN]] or [[#DBT_UTF8_BIN|DBT_UTF8_BIN]]. There must be a unique index on the 'a_{name}' column.
  
The 'a_{name}_id' column must be the primary key.
+
A pstring table must contain a 'a_{name}_ci' column which contains the case-insensitive string value. The 'a_{name}_ci' column is usually a [[#DBT_ASCII_CI|DBT_ASCII_CI]] or [[#DBT_UTF8_CI|DBT_UTF8_CI]] column.
  
A token table must contain a 'a_{name}' column which contains the token. The 'a_{name}' column is usually a [[#DBT_TOKEN|DBT_TOKEN]].
+
The only real difference between a [[#pstring|pstring]] table and a [[#particle|particle]] table is that a pstring table always models strings (particles can model numbers and other scalar values) and pstring tables include a case-insensitive version of the value which can be used in indexes or queries.
  
There should be a unique index on the 'a_{name}' column.
+
== textdata ==
  
== name ==
+
The [[#textdata|textdata]] tables contain strings and are accessed via the '[[#raw connection|raw]]' connection. A textdata table maps a wide column to a narrow column. You would usually use a textdata table if a [[#particle|particle]] or [[#pstring|pstring]] table is too small.
  
The [[#name|name]] tables contain names and are accessed via the '[[#raw connection|raw]]' connection. Names are case-insensitive strings.
+
A textdata table is named 't_textdata_{name}'.
  
A name table is named 't_name_{name}'.
+
A textdata table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
  
A name table must contain a 'a_{name}_id' column which contains the ID for the name. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
A textdata table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.
  
A name table must contain a 'a_{name}' column which contains the name. The 'a_{name}' column is usually a [[#DBT_ASCII_CI|DBT_ASCII_CI]] or [[#DBT_UTF8_CI|DBT_UTF8_CI]] but it could be [[#DBT_ASCII_BIN|DBT_ASCII_BIN]] or [[#DBT_UTF8_BIN|DBT_UTF8_BIN]] depending on your requirements.
+
A textdata table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the string value.
  
A name table must contain a 'a_{name}_lower' column which contains the lowercase name. The 'a_{name}_lower' column is usually a [[#DBT_ASCII_BIN|DBT_ASCII_BIN]] or [[#DBT_UTF8_BIN|DBT_UTF8_BIN]]. Ordinarily there will be a unique index on the 'a_{name}_lower' column, but you can make this a non-unique index depending upon your requirements.
+
== upload ==
  
== string ==
+
The [[#upload|upload]] tables contain data about uploaded files and are accessed via the '[[#raw connection|raw]]' connection.
  
The [[#string|string]] tables contain strings and are accessed via the '[[#raw connection|raw]]' connection. A string table maps a wide column to a narrow column. You would usually use a string table if a [[#particle|particle]] table is too small.
+
A upload table is named 't_upload_{name}'.
  
A string table is named 't_string_{name}'.
+
An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.
  
A string table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
== zblob ==
  
A string table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.
+
The [[#zblob|zblob]] tables contain compressed blobs and are accessed via the '[[#raw connection|raw]]' connection.
  
A string table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the string value.
+
A zblob table is named 't_zblob_{name}'.
  
== jzon ==
+
A zblob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
  
The [[#jzon|jzon]] tables contain compressed JSON and are accessed via the '[[#raw connection|raw]]' connection.
+
A zblob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a [[#DBT_MLOB|DBT_MLOB]].
  
A jzon table is named 't_jzon_{name}'.
+
A zblob table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the compressed blob.
  
A jzon table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
See the [https://svn.jj5.net/jjrepo/jj5-bin/branches/0.3/inflate.php inflate.php] program if you need to work with data in this format.
  
A jzon table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a [[#DBT_MLOB|DBT_MLOB]].
+
== ztext ==
  
A jzon table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the compressed data.
+
The [[#ztext|ztext]] tables contain compressed text data and are accessed via the '[[#raw connection|raw]]' connection.
  
See the [https://svn.jj5.net/jjrepo/jj5-bin/branches/0.3/inflate.php inflate.php] program if you need to work with data in this format.
+
A ztext table is named 't_ztext_{name}'.
  
== zlob ==
+
A ztext table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
  
The [[#zlob|zlob]] tables contain compressed blobs and are accessed via the '[[#raw connection|raw]]' connection.
+
A ztext table must contain a 'a_{name}' column which is the compressed UTF-8 encoded textual data. The 'a_{name}' column is usually a [[#DBT_MLOB|DBT_MLOB]].
  
A zlob table is named 't_zlob_{name}'.
+
A ztext table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the compressed data.
  
A zlob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
See the [https://svn.jj5.net/jjrepo/jj5-bin/branches/0.3/inflate.php inflate.php] program if you need to work with data in this format.
  
A zlob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a [[#DBT_MLOB|DBT_MLOB]].
+
== zjson ==
  
A zlob table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the compressed blob.
+
The [[#zjson|zjson]] tables contain compressed JSON and are accessed via the '[[#raw connection|raw]]' connection.
  
See the [https://svn.jj5.net/jjrepo/jj5-bin/branches/0.3/inflate.php inflate.php] program if you need to work with data in this format.
+
A zjson table is named 't_zjson_{name}'.
  
== upload ==
+
A zjson table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
  
The [[#upload|upload]] tables contain data about uploaded files and are accessed via the '[[#raw connection|raw]]' connection.
+
A zjson table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a [[#DBT_MLOB|DBT_MLOB]].
  
A upload table is named 't_upload_{name}'.
+
A zjson table must contain a 'a_{name}_hash_bin' column which contains the [[#DBT_HASH_BIN|DBT_HASH_BIN]] binary hash of the compressed data.
  
An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.
+
See the [https://svn.jj5.net/jjrepo/jj5-bin/branches/0.3/inflate.php inflate.php] program if you need to work with data in this format.
  
 
== structure ==
 
== structure ==
Line 321: Line 415:
 
Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.
 
Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.
  
== data ==
+
== entity ==
  
The [[#data|data]] tables contain mutable application data and are accessed via the '[[#trn connection|trn]]' connection. All data tables have a corresponding [[#hist|hist]] table.
+
The [[#entity|entity]] tables contain mutable application data and are accessed via the '[[#trn connection|trn]]' connection. All data tables have a corresponding [[#history|history]] table.
  
A data table is named 't_data_{name}'.
+
A entity table is named 't_entity_{name}'.
  
A data table has a 'a_{name}_id' column which is is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
+
A entity table has a 'a_{name}_id' column which is is usually a [[#DBT_ID24|DBT_ID24]] or [[#DBT_ID32|DBT_ID32]].
  
A data table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.
+
A entity table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.
  
A data table will have other columns too.
+
A entity table will have other columns too.
  
== hist ==
+
Typically entity tables use a prefix that indicates their subtype, e.g. 'app_', 'cfg_', 'web_', 'loc_', etc. Usually the subtype is omitted from the table's column names. E.g. t_entity_app_user.a_user_id. If you're worried about future namespace collisions picking an application prefix for your table and column names is probably the way to go. (2020-03-28 jj5 - THINK: do we want a registry for application prefixes to avoid collisions?)
  
The [[#hist|hist]] tables contain logs of changes to mutable application data stored in [[#data|data]] tables and are accessed via the '[[#trn connection|trn]]' connection. All hist tables have a corresponding [[#data|data]] table.
+
== history ==
  
A hist table is named 't_hist_{name}'.
+
The [[#history|history]] tables contain logs of changes to mutable application data stored in [[#entity|entity]] tables and are accessed via the '[[#trn connection|trn]]' connection. All history tables have a corresponding [[#entity|entity]] table.
  
A hist table has a 'a_{name}_hist_id' primary key which is is usually a [[#DBT_ID32|DBT_ID32]].
+
A history table is named 't_history_{name}'.
  
A hist table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the [[#data|data]] was modified.
+
A history table has a 'a_{name}_history_id' primary key which is is usually a [[#DBT_ID32|DBT_ID32]].
  
A hist table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the [[#data|data]].
+
A history table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the [[#entity|entity]] was modified.
  
A hist table then contains a copy of all of the columns from its corresponding [[#data|data]] table.
+
A history table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the [[#entity|entity]].
 +
 
 +
A history table then contains a copy of all of the columns from its corresponding [[#entity|entity]] table.
  
 
== log ==
 
== log ==
Line 361: Line 457:
 
= Database connections =
 
= Database connections =
  
In an application there are usually two separate connections to the database, being the '[[#raw connection|raw]]' connection and the '[[#trn connection|trn]]' connection. The [[#Patterns|database pattern]] indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for [[#data|data]] and [[#hist|hist]] tables which use the 'trn' connection.
+
In an application there are usually two separate connections to the database, being the '[[#raw connection|raw]]' connection and the '[[#trn connection|trn]]' connection. The [[#Patterns|database pattern]] indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for [[#entity|entity]] and [[#history|history]] tables which use the 'trn' connection.
  
 
== raw connection ==
 
== raw connection ==
Line 376: Line 472:
  
 
There are many more supported database types which can be used for column definitions, but some key types are listed here.
 
There are many more supported database types which can be used for column definitions, but some key types are listed here.
 +
 +
Note: we model identity columns as integers, not unsigned integers. Usually identity columns have values starting at 1 (zero is unused) and incrementing from there. But sometimes it can be useful to multiply identity values (particularly when they are foreign keys) by minus one (-1) to make them negative while you're processing some values, that's why we use signed ints and not unsigned ints for identity columns.
  
 
== DBT_ID8 ==
 
== DBT_ID8 ==
  
A [[#DBT_ID8|DBT_ID8]] is an auto generated integer from 1 to 255.
+
A [[#DBT_ID8|DBT_ID8]] is an auto generated integer from 1 to 127.
  
 
== DBT_ID16 ==
 
== DBT_ID16 ==
  
A [[#DBT_ID16|DBT_ID16]] is an auto generated integer from 1 to 65,535.
+
A [[#DBT_ID16|DBT_ID16]] is an auto generated integer from 1 to 32,767.
  
 
== DBT_ID24 ==
 
== DBT_ID24 ==
  
A [[#DBT_ID24|DBT_ID24]] is an auto generated integer from 1 to 16,777,215.
+
A [[#DBT_ID24|DBT_ID24]] is an auto generated integer from 1 to 8,388,607.
  
 
== DBT_ID32 ==
 
== DBT_ID32 ==
  
A [[#DBT_ID32|DBT_ID32]] is an auto generated integer from 1 to 4,294,967,295.
+
A [[#DBT_ID32|DBT_ID32]] is an auto generated integer from 1 to 2,147,483,647.
  
 
== DBT_ID64 ==
 
== DBT_ID64 ==
  
A [[#DBT_ID64|DBT_ID64]] is an auto generated integer from 1 to 2<sup>64</sup>-1.
+
A [[#DBT_ID64|DBT_ID64]] is an auto generated integer from 1 to 9,223,372,036,854,775,808.
  
 
== DBT_UINT8 ==
 
== DBT_UINT8 ==
Line 440: Line 538:
  
 
A [[#DBT_UPDATED_ON|DBT_UPDATED_ON]] is the timestamp of when the record was last updated.
 
A [[#DBT_UPDATED_ON|DBT_UPDATED_ON]] is the timestamp of when the record was last updated.
 +
 +
= Data formats =
 +
 +
Field data can be in one of two formats: application format or database format. In the large the database format and the application format are the same, strings are strings and ints are ints and floats are floats. However there are some differences as summarised in the following chart. Note that in [[PHPBOM]] conversion between formats is done automatically for application data by the active record implementation.
 +
 +
{|class="wikitable"
 +
! Database type
 +
! Application format
 +
! Database format
 +
|-
 +
| [[#DBT_BOOL|DBT_BOOL]]
 +
| boolean
 +
| int
 +
|-
 +
| [[#DBT_DATETIME|DBT_DATETIME]], [[#DBT_DATETIME_UTC|DBT_DATETIME_UTC]]
 +
| DateTime
 +
| string
 +
|-
 +
| [[#DBT_TIMEZONE|DBT_TIMEZONE]]
 +
| DateTimeZone
 +
| string
 +
|}
 +
 +
= Column size =
 +
 +
== Integer size ==
 +
 +
I get stuck on this all the time. My general disposition is to use only as much as you need. Then if it turns out that you need more, provision more. I guess I take the view that limits are good/helpful things in making sure situations don't get out of hand. So if I only need about 100 values I will use a byte.
 +
 +
Here's some info about integer sizes.
 +
 +
{|class="wikitable right"
 +
| Bits || Signed max                || Unsigned max
 +
|-
 +
| 8    || 127                      || 255
 +
|-
 +
| 16  || 32,767                    || 65,535
 +
|-
 +
| 24  || 8,388,607                || 16,777,215
 +
|-
 +
| 32  || 2,147,483,647            || 4,294,967,295
 +
|-
 +
| 64  || 9,223,372,036,854,775,808 || 18,446,744,073,709,551,616
 +
|}
 +
 +
== String length ==
 +
 +
If you have a fixed length field use a fixed length. Examples includes hashes and tokens.
 +
 +
Otherwise pick the first option that will work:
 +
 +
{|class="wikitable sortable"
 +
! ASCII
 +
! UTF-8
 +
! Note
 +
|-
 +
|class="right"| 255
 +
|class="right"| 63
 +
| length fits in one byte
 +
|-
 +
|class="right"| 767
 +
|class="right"| 190
 +
| maximum length MySQL can index
 +
|-
 +
|class="right"| 65000
 +
|class="right"| 16250
 +
| uses most of the row space supported by MySQL ([https://dev.mysql.com/doc/refman/8.0/en/column-count-limit.html 65,535 bytes]), but not all of it
 +
|}
 +
 +
If you can't find an appropriate size above use MEDIUMTEXT if 16 MiB is enough or LONGTEXT for up to 4 GiB.
 +
 +
For help visualising string lengths see my [https://www.progclub.net/~jj5/test/php/field-length.php field/column length examples].
 +
 +
= Glossary =
 +
 +
== ACID ==
 +
 +
[https://en.wikipedia.org/wiki/ACID ACID]:
 +
 +
* Atomicity
 +
* Consistency
 +
* Isolation
 +
* Durability
 +
 +
== Concurrency control ==
 +
 +
[https://en.wikipedia.org/wiki/Concurrency_control Concurrency control]
 +
 +
== DCL ==
 +
 +
[https://en.wikipedia.org/wiki/Data_control_language Data Control Language]
 +
 +
== DDL ==
 +
 +
[https://en.wikipedia.org/wiki/Data_definition_language Data Definition Language]
 +
 +
== Deadlock ==
 +
 +
See: http://catb.org/jargon/html/D/deadlock.html
 +
 +
== DML ==
 +
 +
[https://en.wikipedia.org/wiki/Data_manipulation_language Data Manipulation Language]
 +
 +
== DRI ==
 +
 +
[https://en.wikipedia.org/wiki/Referential_integrity#Declarative_referential_integrity Declarative Referential Integrity]
 +
 +
== Livelock ==
 +
 +
See: http://catb.org/jargon/html/L/livelock.html
 +
 +
== Luhn algorithm ==
 +
 +
The [https://en.wikipedia.org/wiki/Luhn_algorithm Luhn algorithm] can be used to generate a check digit for an integer value.
 +
 +
== SQL ==
 +
 +
[https://en.wikipedia.org/wiki/SQL Structured Query Language] ([https://smile.amazon.com/gp/product/0128007613 AKA]: Scarcely Qualifies as a Language)
 +
 +
== Transaction phenomena ==
 +
 +
These brief notes about potentially problematic transaction phenomena taken from [https://smile.amazon.com/gp/product/0128007613 Joe Celko's SQL for Smarties: Advanced SQL Programming].
 +
 +
{|class="wikitable"
 +
| [[#Dirty write|P0]]
 +
| Dirty write
 +
|-
 +
| [[#Dirty read|P1]]
 +
| Dirty read
 +
|-
 +
| [[#Nonrepeatable read|P2]]
 +
| Nonrepeatable read
 +
|-
 +
| [[#Phantom|P4]]
 +
| Phantom
 +
|-
 +
| [[#Lost update|P5]]
 +
| Lost update
 +
|}
 +
 +
=== Dirty write ===
 +
 +
T1 modifies a field. T2 further modifies that field before T1 commits. If T1 or T2 rollback it is unclear what the correct field value is.
 +
 +
=== Dirty read ===
 +
 +
T1 modifies a row. T2 then modifies or deletes that row before T1 commits. If T1 performs a rollback T2 will have read a row that was never committed.
 +
 +
=== Nonrepeatable read ===
 +
 +
T1 reads a row. T2 then modifies or deletes that row. If T1 then attempts to reread the row it may receive the modified value or discover the row has been deleted.
 +
 +
=== Phantom ===
 +
 +
T1 reads the set of rows N that satisfies some condition. T2 then executes statements that generate oen or more rows that satisfy the condition used by T1. If T1 then repeats the initial read with the same condition it obtains a different collection of rows.
 +
 +
=== Lost update ===
 +
 +
T1 reads a field, then T2 updates that field, then T1 (based on earlier data) updates the field. The update from T2 is lost.

Latest revision as of 20:41, 14 October 2020

This page documents the various types of database design patterns supported by the PHPBOM library and used in our applications.

Introduction

Here we document the type of tables we have in our databases and include some notes about how they are accessed and used.

Our schemas are acyclic. Tables can be self-referential, otherwise their foreign keys must point to a previously defined table. If you find yourself with a cyclic situation you can fix it by splitting your two tables into three tables.

We have four broad types of data, being metadata (in about tables), core mutable application data (in entity and history tables), log data (in log tables), and then everything else.

Tables of the various patterns can only depend on other types of tables with a higher precedence. So for example application entity data cannot refer to log data because 'log' tables come after 'entity' tables. When you are declaring your schema you should declare your tables in the canonical order and within patterns of the same type usually alphabetical or temporal order is best when possible.

Many of our table patterns implement a parent pattern, being the elemental data pattern.

Notes

Naming conventions

Tables are named as 't_{pattern}_{name}'.

Columns are named as 'a_{name}'. Note that column names must be globally unique, not simply unique per table. The exception to this rule is if you have a foreign key without a more specific name, in that case the foreign key can have the same name as its referent.

The 't_{pattern}_' and 'a_' prefixes are a form of Hungarian notation.

Note: to complicate things somewhat the PHPBOM software provides support for a table prefix to use for your schemas so you can squeeze an application into a shared database if need be. The table names thus become '{prefix}t_{pattern}_{name}', because who doesn't love complexity?

So in PHPBOM parlance there are three types of table name:

Type Name Note
full table name '{prefix}t_{pattern}_{name}' $prefix can be empty!
table name 't_{pattern}_{name}' used within the software to identify a table
short table name '{name}' used to enforce as globally unique

Note that short table names must be globally unique across all applicable schemas defined for your application (an application can have multiple schemas, typically two, the PHPBOM schema and the application schema).

Naming foreign keys

Ideally foreign keys (including foreign keys which are also primary keys) will have the same name as the primary key column they refer to.

When there are multiple foreign keys to the same primary key, or when we have a self-referential table, then using the same name as the referent is not possible. In this case you should use the whole name being referred to and place a qualifier after, e.g.: two foreign keys to a_token_id on the same table:

  • a_token_id_prev
  • a_token_id_next

Naming for entity/history tables implemented by PHPBOM

In general applications and PHPBOM share namespaces, particularly for elemental data. An exception to this is made with entity tables and their corresponding history counterparts. It is recommended for your application entity/history tables that you prefix your app code to distinguish tables your application has defined from tables PHPBOM has defined.

To try and keep out of your way PHPBOM uses standard prefixes for entity/history tables. These prefixes (at the present time) are:

Prefix Note
app_ for application data
web_ for web related data
loc_ for geographic data

Canonical order

Tables should be declared in the following order. Tables with a lower ordinal cannot reference tables with a higher ordinal.

Ordinal Pattern
11 about
21 lookup
23 mflags
31 ident
41 particle
43 pstring
51 textdata
52 upload
53 zblob
55 ztext
56 zjson
61 structure
71 entity
73 history
81 log
91 view
99 other

a_interaction_id

We have a standard ident table called t_ident_interaction which contains the serial number for an application interaction (e.g. a HTTP request) in its a_interaction_id column. When most elemental data is registered (not lookup data) we log the a_interaction_id of the interaction that did the registration. Using the a_interaction_id and the t_ident_interaction table we can deduce the time of the registration along with other audit information associated with the interaction.

Foreign keys

We don't configure foreign keys, thus they are not enforced. This is potentially controversial, but it has some benefits.

If you have a foreign key which you want to be 'null' you can set it to zero. This means that all foreign keys in our schema are effectively nullable. Note that for lookup tables the zero value is modelled in the lookup table but not as a valid value in the application (except for flags).

Generally it's best if you left join to referenced tables in case the relevant data is missing.

Banning foreign keys simplifies access, improves performance, and reduces storage overhead. It also improves some audit situations wherein log data can refer to subsequently removed records without needing to be deleted with the relevant primary key data.

You can leave orphan data in the database if you want. You can run a garbage collection process if orphan data bothers you, or just clean up after yourself as you go along.

Personally I like leaving the orphan data in place because it simplifies (and enables) "undelete". If you restore the parent record it automatically has all of its children restored (because they never went away). The orphan children don't create a problem by remaining latent because all access is via joins from a parent record, in which case orphan data are irrelevant. It's trivial to detect orphan data if you find yourself in a situation where you need to know.

I guess I should stress that just because a foreign key *can* be zero doesn't mean that it *should* be zero. One place where I find it expedient to use zero values is when I'm working on a new feature that is not yet complete. Such work in progress can use zero values for foreign keys while the code that calculates what they should really be hasn't been written yet. The general rule is that if you find a foreign key with a zero value there is something wrong and you should investigate and resolve. Ordinarily you should never have zero foreign keys (except with flags).

Null values

I almost never allow a field to be null. For foreign keys I use the zero value to represent null. For other fields I typically use an empty string or zero value. If I really do need to model something that can be "null" then I implement a table for it and a missing row models the "null" value.

Banning nulls simplifies applications, reduces storage costs, and improves performance.

Note that most elemental data will define an 'empty' (or 'zero') record for ID = 1 (but not upload tables, as they don't have meaningful 'empty' values). The enumeration data defines the 'empty' record with ID = 0.

Shredding

If you really do need to completely remove some data you can "shred" it. Shredding is dangerous, but effective. When shredding the "no deletions" for elemental data rule can be broken. That's not a problem because you will also delete all data that referred to shredded elemental data. After shredding to really finish the job you need to drop the caches on your application servers. Note that orphan data (such as in history tables) can also optionally be removed (otherwise it should preferably be updated with valid data) during a shred. Depending on the sensitivity of the data being eradicated you might need to take into account the existence of database backups (or application server backups) which may still contain references to the shredded values. Shredding is revisionist and can rewrite history, so you might like to leave an audit trail if you do support shredding.

Presently with PHPBOM we have considered the possibility of shredding and defined it, but it is not yet implemented and is a low-priority usecase for us at the moment.

Table types

Metadata

All database and application metadata are stored in about tables and accessed via a raw connection. Your application can use the metadata to figure out if the database is in an operable state or not, viz the installed schema revisions and/or online status.

Application data

Your core mutable application data should mostly be stored in entity tables. Each entity table has a corresponding history table which logs changes made to data in entity tables.

Data in entity and history tables are accessed via a trn connection.

Log data

For writing logs to log tables. Logs are ordinarily insert-only (no updates or deletes). If your log table refers to a mutable entity table you might like to copy the value of salient fields at the time the log was made, as the 'entity' table might be modified after the logs have been written.

Elemental data

Everything that's not metadata, application data or log data implements the elemental pattern.

Elemental data is accessed via a raw connection.

The elemental data pattern is basically where we have a small thing, like say an DBT_UINT8, that points to a big thing, like say a DBT_HASH_BIN. The characteristics of elemental data are that they are insert-only (updates and deletions are not allowed), and this property lends them to being manageable on a raw connection which uses low transaction isolation and automatic commits. Further, because inserts cannot be rolled back, we can cache all elemental mappings, so once an application server knows the short key for a corresponding long key (or the long key for a corresponding short key) it can cache that fact and in future does not need to request the values from the database again.

Note that most (but not all) elemental tables log the a_interaction_id for the interaction in which they were registered. Items only get registered once and cannot ordinarily be changed after they have been registered. Even though elemental data is immutable and insert-only data we still put the created_on and updated_on timestamps on our elemental tables, ideally created_on and updated_on will always have the same value, and if not that might alert you to something having gone wrong (because elemental records should not ever be changed).

There are two broad types of elemental data depending on how values are identified: value elements and hashed elements. For value elements identity is determined by a value comparison. For hashed elements identity is determined by hashing.

Note that for value elements the values are usually case-sensitive. You can break this rule but only if you know what you are doing!

The table patterns which implement the elemental pattern are:

pattern identity small thing big things a_interaction_id note
lookup value enum, flags char, code, name, etc. not logged lookup data is read-only and packaged with schema
ident value id creation date, potentially other data logged
particle value id some larger integer or usually a case-sensitive string logged
pstring value id some string value (must be case-sensitive, as with particles), a case-insensitive value is also usually logged (for use in indexes and queries) logged be careful with case-sensitivity
textdata hash id a string value and a DBT_HASH_BIN logged the DBT_HASH_BIN unsures case-sensitivity
upload hash id a DBT_HASH_BIN and other details collected with the upload such as file name logged data logged during upload should not be altered at a later date, use application entity if that's what you need
zblob hash id a blob and a DBT_HASH_BIN logged
ztext hash id a blob and a DBT_HASH_BIN logged
zjson hash id a blob and a DBT_HASH_BIN logged
structure hash id a string value, a DBT_HASH_BIN, and relevant components logged the DBT_HASH_BIN unsures case-sensitivity

Patterns

about

The about tables contain database metadata and are accessed via the 'raw' connection.

An about table is named 't_about_{name}'.

lookup

The lookup tables contain hard-coded reference data and are accessed via the 'raw' connection. Ordinarily lookup data is defined with the schema and not modified thereafter.

A lookup table is named 't_lookup_{name}'.

Depending on whether it is an enumeration or flags table a lookup table must contain either a 'a_{name}_enum' column or an 'a_{name}_flags' column to identify a row. The 'a_{name}_{enum,flags}' column is usually be a DBT_UINT8.

The 'a_{name}_{enum,flags}' column must be the primary key.

enumeration

An enumeration table is a special type of lookup table which contains enumerated values (starting at 1) which indicate a state and associated data. For enumerations the zero value is usually an invalid value but it is also usually modelled in the lookup data as an instance of the Null Object Pattern.

If an optional 'a_{name}_char' column exists it must be of type DBT_ASCII_CHAR_BIN. The PHPBOM library will use lowercase char codes and applications can use uppercase char codes to distinguish themselves.

Unlike flags tables you can extend a PHPBOM enumeration with your own values, but if you do you should code from the max enum down as PHPBOM will code from the min enum up. Use an uppercase value in relevant 'char', 'code', etc. columns to distinguish your application defined values from PHPBOM defined values.

flags

A flags table is a special type of lookup table which associates up to 256 flags enum values with up to eight boolean columns. Unlike with enumeration tables a flags table can legitimately have a zero primary key (indicating no boolean values are set).

A flags table is named 't_mflags_{name}_flags'.

A flags table must contain a 'a_{name}_flags' enum column which identifies a row. The 'a_{namne}_flags' column would typically be a DBT_UINT8. If more flags are needed they would probably be best modelled with a new flags table than by extending the flags column to DBT_UINT16, although that is possible.

The 'a_{name}_flags' must be the primary key.

Other columns, typically up to eight of them, are DBT_BOOL columns which model the associated flags as boolean values. Programmers can join to the flags lookup table to get the associated boolean column values for use in reports or queries.

Boolean columns are named 'a_{is,has,was,can}_{noun}_{adjective}' and in our PHP code there is generally a constant declared with the schema having the relevant power of 2:

define( 'FLAG_{IS,HAS,WAS,CAN}_{NOUN}_{ADJECTIVE}', pow( 2, x ) );

For the current full list of supported indicatives (is, has, etc) see the is_bool_name() function definition.

Generally it is very ill-advised to extend a PHPBOM flags lookup table. If you need extra flags defining your own flags table is probably your best option.

ident

The ident tables allocate unique identifiers (serial numbers) and are accessed via the 'raw' connection.

An ident table is named 't_ident_{name}'.

An ident table has a 'a_{name}_id' column which is usually a DBT_ID24 or DBT_ID32.

Typically an ident table will collect a timestamp for its creation date, and will usually log an a_interaction_id.

particle

The particle tables contain particles and are accessed via the 'raw' connection. A particle table maps a wide column to a narrow column and vice versa.

A particle table is named 't_particle_{name}'.

A particle table must contain a 'a_{name}_id' column which contains the narrow ID for the particle. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A particle table must contain a 'a_{name}' column which is the wide value. There must be a unique index on the 'a_{name}' column. When particle 'a_{name}' columns are strings they should be case-sensitive. If you want a case-insensitive string use a pstring table instead.

pstring

The pstring tables contain strings and are accessed via the 'raw' connection. Pstrings are case-sensitive strings (like particles), with an associated case-insensitive string (which can be used for indexes and/or queries).

A pstring table is named 't_pstring_{name}'.

A pstring table must contain a 'a_{name}_id' column which contains the ID for the pstring. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32 and is the primary key.

A pstring table must contain a 'a_{name}' column which contains the case-sensitive string value. The 'a_{name}' column is usually a DBT_ASCII_BIN or DBT_UTF8_BIN. There must be a unique index on the 'a_{name}' column.

A pstring table must contain a 'a_{name}_ci' column which contains the case-insensitive string value. The 'a_{name}_ci' column is usually a DBT_ASCII_CI or DBT_UTF8_CI column.

The only real difference between a pstring table and a particle table is that a pstring table always models strings (particles can model numbers and other scalar values) and pstring tables include a case-insensitive version of the value which can be used in indexes or queries.

textdata

The textdata tables contain strings and are accessed via the 'raw' connection. A textdata table maps a wide column to a narrow column. You would usually use a textdata table if a particle or pstring table is too small.

A textdata table is named 't_textdata_{name}'.

A textdata table must contain a 'a_{name}_id' column which contains the narrow ID for the string. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A textdata table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A textdata table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the string value.

upload

The upload tables contain data about uploaded files and are accessed via the 'raw' connection.

A upload table is named 't_upload_{name}'.

An upload table has 'a_{name}_id', 'a_{name}_hash_bin', 'a_{name}_file_name', 'a_{name}_file_type' and 'a_file_type_enum' columns. Usually the actual file will be stored in the file-system using the 'a_{name}_hash_hex' value for the file name.

zblob

The zblob tables contain compressed blobs and are accessed via the 'raw' connection.

A zblob table is named 't_zblob_{name}'.

A zblob table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed blob. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zblob table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zblob table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed blob.

See the inflate.php program if you need to work with data in this format.

ztext

The ztext tables contain compressed text data and are accessed via the 'raw' connection.

A ztext table is named 't_ztext_{name}'.

A ztext table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A ztext table must contain a 'a_{name}' column which is the compressed UTF-8 encoded textual data. The 'a_{name}' column is usually a DBT_MLOB.

A ztext table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

zjson

The zjson tables contain compressed JSON and are accessed via the 'raw' connection.

A zjson table is named 't_zjson_{name}'.

A zjson table must contain a 'a_{name}_id' column which contains the narrow ID for the compressed data. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A zjson table must contain a 'a_{name}' column which is the compressed binary data. The 'a_{name}' column is usually a DBT_MLOB.

A zjson table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the compressed data.

See the inflate.php program if you need to work with data in this format.

structure

The structure tables contain structures and are accessed via the 'raw' connection. A structure table maps a wide column to a narrow column and constituent component columns.

A structure table is named 't_structure_{name}'.

A structure table must contain a 'a_{name}_id' column which contains the narrow ID for the structure. The 'a_{name}_id' column is usually a DBT_ID24 or DBT_ID32.

A structure table must contain a 'a_{name}' column which is the wide string value. The 'a_{name}' column should be case-insensitive. Case-sensitivity is ensured by the 'a_{name}_hash_bin' column, so the 'a_{name}' column can use a case-insensitive collation.

A structure table must contain a 'a_{name}_hash_bin' column which contains the DBT_HASH_BIN binary hash of the wide string value.

Other component columns will exist. For example for a 't_structure_email_address' there will be 'a_email_address_mailbox' and 'a_email_address_host' columns.

entity

The entity tables contain mutable application data and are accessed via the 'trn' connection. All data tables have a corresponding history table.

A entity table is named 't_entity_{name}'.

A entity table has a 'a_{name}_id' column which is is usually a DBT_ID24 or DBT_ID32.

A entity table has a 'a_{name}_version' column which is usually a reference to t_ident_interaction.a_interaction_id.

A entity table will have other columns too.

Typically entity tables use a prefix that indicates their subtype, e.g. 'app_', 'cfg_', 'web_', 'loc_', etc. Usually the subtype is omitted from the table's column names. E.g. t_entity_app_user.a_user_id. If you're worried about future namespace collisions picking an application prefix for your table and column names is probably the way to go. (2020-03-28 jj5 - THINK: do we want a registry for application prefixes to avoid collisions?)

history

The history tables contain logs of changes to mutable application data stored in entity tables and are accessed via the 'trn' connection. All history tables have a corresponding entity table.

A history table is named 't_history_{name}'.

A history table has a 'a_{name}_history_id' primary key which is is usually a DBT_ID32.

A history table has a 'a_interaction_id' column, which is a reference to the t_ident_interaction.a_interaction_id in which the entity was modified.

A history table has a 'a_crud_enum' column which is a reference to the t_lookup_crud.a_crud_enum that indicates the CRUD operation which affected the entity.

A history table then contains a copy of all of the columns from its corresponding entity table.

log

A log table contains log data and are accessed via the 'raw' connection.

A log table is named 't_log_{name}'.

There are no rules for the content of log tables, but you should usually (but not always) try to copy data into logs by value rather than by reference, because referenced data might change after the logs were written.

other

If we're dealing with tables which can't be categorised into one or more of the above patterns we call it other data and there are no restrictions on its format. The connection to use for 'other' tables is unspecified.

Database connections

In an application there are usually two separate connections to the database, being the 'raw' connection and the 'trn' connection. The database pattern indicates which connection to use when operating on any given type of table. Generally the 'log' connection is used for everything except for entity and history tables which use the 'trn' connection.

raw connection

The 'raw' connection uses the READ UNCOMMITTED transaction isolation level and (usually) auto-commits.

Note: as a performance improvement you can begin a transaction at the end of an interaction when the only thing left to do is write some logs. Beginning a transaction will improve the performance of the logging, and you can commit the transaction when you're done.

trn connection

The 'trn' connection uses the SERIALIZABLE transaction isolation level. Programmers must manually begin and commit (or rollback) their transactions. The PHPBOM library provides support for nested transactions.

Database types

There are many more supported database types which can be used for column definitions, but some key types are listed here.

Note: we model identity columns as integers, not unsigned integers. Usually identity columns have values starting at 1 (zero is unused) and incrementing from there. But sometimes it can be useful to multiply identity values (particularly when they are foreign keys) by minus one (-1) to make them negative while you're processing some values, that's why we use signed ints and not unsigned ints for identity columns.

DBT_ID8

A DBT_ID8 is an auto generated integer from 1 to 127.

DBT_ID16

A DBT_ID16 is an auto generated integer from 1 to 32,767.

DBT_ID24

A DBT_ID24 is an auto generated integer from 1 to 8,388,607.

DBT_ID32

A DBT_ID32 is an auto generated integer from 1 to 2,147,483,647.

DBT_ID64

A DBT_ID64 is an auto generated integer from 1 to 9,223,372,036,854,775,808.

DBT_UINT8

A DBT_UINT8 is an integer from 0 to 255.

DBT_UINT16

A DBT_UINT16 is an integer from 1 to 65,535.

DBT_UINT24

A DBT_UINT24 is an integer from 1 to 16,777,215.

DBT_UINT32

A DBT_UINT32 is an integer from 1 to 4,294,967,295.

DBT_UINT64

A DBT_UINT64 is an integer from 1 to 264-1.

DBT_ASCII_CHAR_BIN

A DBT_ASCII_CHAR_BIN is a single case-sensitive ASCII character.

DBT_HASH_BIN

A DBT_HASH_BIN is a case-sensitive 28-byte binary string containing a SHA512/224 hash value.

DBT_HASH_HEX

A DBT_HASH_HEX is a case-insensitive 56-character hexadecimal string containing a SHA512/224 hash value.

DBT_TOKEN

A DBT_TOKEN is a case-sensitive 32-character alphanumeric ASCII string.

DBT_CREATED_ON

A DBT_CREATED_ON is the timestamp of when the record was created.

DBT_UPDATED_ON

A DBT_UPDATED_ON is the timestamp of when the record was last updated.

Data formats

Field data can be in one of two formats: application format or database format. In the large the database format and the application format are the same, strings are strings and ints are ints and floats are floats. However there are some differences as summarised in the following chart. Note that in PHPBOM conversion between formats is done automatically for application data by the active record implementation.

Database type Application format Database format
DBT_BOOL boolean int
DBT_DATETIME, DBT_DATETIME_UTC DateTime string
DBT_TIMEZONE DateTimeZone string

Column size

Integer size

I get stuck on this all the time. My general disposition is to use only as much as you need. Then if it turns out that you need more, provision more. I guess I take the view that limits are good/helpful things in making sure situations don't get out of hand. So if I only need about 100 values I will use a byte.

Here's some info about integer sizes.

Bits Signed max Unsigned max
8 127 255
16 32,767 65,535
24 8,388,607 16,777,215
32 2,147,483,647 4,294,967,295
64 9,223,372,036,854,775,808 18,446,744,073,709,551,616

String length

If you have a fixed length field use a fixed length. Examples includes hashes and tokens.

Otherwise pick the first option that will work:

ASCII UTF-8 Note
255 63 length fits in one byte
767 190 maximum length MySQL can index
65000 16250 uses most of the row space supported by MySQL (65,535 bytes), but not all of it

If you can't find an appropriate size above use MEDIUMTEXT if 16 MiB is enough or LONGTEXT for up to 4 GiB.

For help visualising string lengths see my field/column length examples.

Glossary

ACID

ACID:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Concurrency control

Concurrency control

DCL

Data Control Language

DDL

Data Definition Language

Deadlock

See: http://catb.org/jargon/html/D/deadlock.html

DML

Data Manipulation Language

DRI

Declarative Referential Integrity

Livelock

See: http://catb.org/jargon/html/L/livelock.html

Luhn algorithm

The Luhn algorithm can be used to generate a check digit for an integer value.

SQL

Structured Query Language (AKA: Scarcely Qualifies as a Language)

Transaction phenomena

These brief notes about potentially problematic transaction phenomena taken from Joe Celko's SQL for Smarties: Advanced SQL Programming.

P0 Dirty write
P1 Dirty read
P2 Nonrepeatable read
P4 Phantom
P5 Lost update

Dirty write

T1 modifies a field. T2 further modifies that field before T1 commits. If T1 or T2 rollback it is unclear what the correct field value is.

Dirty read

T1 modifies a row. T2 then modifies or deletes that row before T1 commits. If T1 performs a rollback T2 will have read a row that was never committed.

Nonrepeatable read

T1 reads a row. T2 then modifies or deletes that row. If T1 then attempts to reread the row it may receive the modified value or discover the row has been deleted.

Phantom

T1 reads the set of rows N that satisfies some condition. T2 then executes statements that generate oen or more rows that satisfy the condition used by T1. If T1 then repeats the initial read with the same condition it obtains a different collection of rows.

Lost update

T1 reads a field, then T2 updates that field, then T1 (based on earlier data) updates the field. The update from T2 is lost.