SQL

Relational database management system (RDBMS)

stored on disk to be persisted
usually use B+ tree

Why SQL

If the data to be stored is structured
If ACID properties are required
If the size of the data is relatively small and can fit on a node)

ACID

Atomicity
- each transaction is all or nothing
- transaction fail if it did not complete
Consistency
- Any transaction will bring the database from one valid state to another
- constraints set by the database, such as foreign key constraints
Isolation
- Executing transactions concurrently has the same results as if the transactions were executed serially
Durability
- Once a transaction has been committed, it will remain so

Methods of scale

Replication

Potential loss of data if the master fail before replication
If write is frequent, the read replicas can get bogged down by replaying writes, and can't do much of reads
Increased slaves/nodes -> more replications -> increased latency
Hardware cost and complexity

Master-slave replication

The master serves read and write
- If master goes offline, the system will become read-only mode util a slaved is promoted to a new master
Slaves only serves read
- Slaves and replicate to additional slave in tree-like fashion

Disadvantages

Additional logic required to promote slave to master
Higher stale data, mainly used if user are from vastly different geo locations

Master-master replication

All masters servers both read and writes
- If either goes done, system remains to operate

Disadvantages

Need load balance or change on application logics to know where to write
Most master-master systems are either violate ACID or increased write latency due to synchronisation
Need to resolve conflict when writing on multiple nodes

Federation

Or know as functional partitioning, splits the database by functions
- Rather a monolithic database, you could have three: forums, users, and products

Advantages

Less read and write request to each -> less replication lag
Improve cache hit, since larger proportion of data can be stored in memory
Can write in parallel to further efficiency

Disadvantages

Not efficient if schema requires huge functions or tables
Update logic to know where to read and write
Joining from two databases is more complex
Hardware complexity

Sharding

Distribute data across difference databases such that each database can only manage a subset of the data
Usually use user's last name or geographic locations
MySQL and Postgres do no support sharding natively

Advantages

Less read and write request to each -> less replication lag
Improve cache hit, since larger proportion of data can be stored in memory
Can write in parallel to further efficiency

Disadvantages

Update logic to know where to read and write, result in complex SQL queries
Joining from multiple shards is more complex
Data distribution can become lopsided in a shard. For example, a set of power users on a shard could result in increased load to that shard compared to others.
- Rebalancing adds additional complexity, can use consistent hashing to reduce the complexity
Hardware complexity

Denormalisation

Redundant copies of the data are written in multiple tables to avoid expensive joins.

Disadvantage(s):

Data is duplicated.
Constraints can help redundant copies of information stay in sync, which increases complexity of the database design.
A denormalised database under heavy write load might perform worse than its normalised counterpart.

SQL tuning

Important to benchmark and profile to simulate and uncover bottlenecks

Benchmark - Simulate high-load situations
Profile - tracking performance issues

Optimisations

Tighten up the schema

Avoid BLOB storage
Use CHAR instead of VARCHAR for fixed length strings

Use good indices

Columns that you are querying (SELECT, GROUP BY, ORDER BY, JOIN) could be faster with indices.
Indices are usually represented as self-balancing B-tree that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.

Avoid expensive joins

Tune the query cache