Databases

SQL

Relational database management system (RDBMS)

stored on disk to be persisted
usually use B+ tree

Why SQL

If the data to be stored is structured
If ACID properties are required
If the size of the data is relatively small and can fit on a node)

ACID

Atomicity
- each transaction is all or nothing
- transaction fail if it did not complete
Consistency
- Any transaction will bring the database from one valid state to another
- constraints set by the database, such as foreign key constraints
Isolation
- Executing transactions concurrently has the same results as if the transactions were executed serially
Durability
- Once a transaction has been committed, it will remain so

Methods of scale

Replication

Potential loss of data if the master fail before replication
If write is frequent, the read replicas can get bogged down by replaying writes, and can't do much of reads
Increased slaves/nodes -> more replications -> increased latency
Hardware cost and complexity

Master-slave replication

The master serves read and write
- If master goes offline, the system will become read-only mode util a slaved is promoted to a new master
Slaves only serves read
- Slaves and replicate to additional slave in tree-like fashion

Disadvantages

Additional logic required to promote slave to master
Higher stale data, mainly used if user are from vastly different geo locations

Master-master replication

All masters servers both read and writes
- If either goes done, system remains to operate

Disadvantages

Need load balance or change on application logics to know where to write
Most master-master systems are either violate ACID or increased write latency due to synchronisation
Need to resolve conflict when writing on multiple nodes

Federation

Or know as functional partitioning, splits the database by functions
- Rather a monolithic database, you could have three: forums, users, and products

Advantages

Less read and write request to each -> less replication lag
Improve cache hit, since larger proportion of data can be stored in memory
Can write in parallel to further efficiency

Disadvantages

Not efficient if schema requires huge functions or tables
Update logic to know where to read and write
Joining from two databases is more complex
Hardware complexity

Sharding

Distribute data across difference databases such that each database can only manage a subset of the data
Usually use user's last name or geographic locations
MySQL and Postgres do no support sharding natively

Advantages

Less read and write request to each -> less replication lag
Improve cache hit, since larger proportion of data can be stored in memory
Can write in parallel to further efficiency

Disadvantages

Update logic to know where to read and write, result in complex SQL queries
Joining from multiple shards is more complex
Data distribution can become lopsided in a shard. For example, a set of power users on a shard could result in increased load to that shard compared to others.
- Rebalancing adds additional complexity, can use consistent hashing to reduce the complexity
Hardware complexity

Denormalisation

Redundant copies of the data are written in multiple tables to avoid expensive joins.

Disadvantage(s):

Data is duplicated.
Constraints can help redundant copies of information stay in sync, which increases complexity of the database design.
A denormalised database under heavy write load might perform worse than its normalised counterpart.

SQL tuning

Important to benchmark and profile to simulate and uncover bottlenecks

Benchmark - Simulate high-load situations
Profile - tracking performance issues

Optimisations

Tighten up the schema

Avoid BLOB storage
Use CHAR instead of VARCHAR for fixed length strings

Use good indices

Columns that you are querying (SELECT, GROUP BY, ORDER BY, JOIN) could be faster with indices.
Indices are usually represented as self-balancing B-tree that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.

Avoid expensive joins

Tune the query cache

NoSQL

Why NoSQL

If the data to be stored is unstructured
If there's a need to serialise and deserialise data
If the size of the data to be stored is large
Favour eventual consistency
limited by the scale

BASE

Basically available - the system guarantees availability.
Soft state - the state of the system may change over time, even without input.
Eventual consistency - the system will become consistent over a period of time, given that the system doesn't receive input during that period.

Key-Value store

Redis, DynamoDB
- For rapidly changing data
Just like an hash-map
- flat in nature
- used for session oriented applications
Maintain keys in lexicographic order, allowing efficient retrieval of key ranges.

Document Store

MongoDB
- For high flexibility, and occasionally changing data
such in JSON, XML, binary formate
- key points to a hierarchical structure
- unstructured catalog data
Provides APIs or a query language to query based on internal structure of the document itself

Wide Column Store

- Cassandra, Bigtable
- high availability and high scalability

for large volume of write
don't need to update and read a lot

Graph database

network structure such as social media networks
- social applications
Most can only access with REST APIs

CleanShot 2024-09-25 at 17.08.28.png

SQL or NoSQL

Reasons for SQL:

Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, etc
Lookups by index are very fast

Reasons for NoSQL:

Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data intensive workload
Very high throughput for IOPS

Sample data well-suited for NoSQL:

Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup tables

Replication

Sequencer