I Foundations of Data Systems

Chapter 2 Data Models and Query Languages

Data model is important. It impacts how the software is written and how we think about the problem that we are solving。

Most applications are built by layering one data model on top of another: each layer hides the complexity of the layers below it by providing a clean data model.

Document models

Pros:

works well for one-to-many relationships
schema flexibility. “schemaless”, “schema-on-read/schema-on-write” - dynamic (runtime) type checking. No need to maintain the schema. Close to OOP
better performance due to locality. Easy to query. Update / read can still be expensive because every time the whole document will be updated

Cons:

Unable to support nested item. Cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model)
poor support for joins

Use cases:

The data in your application has a document-like structure
Use mainly no relationship, or one-to-many relationship

Relational models

Data are organized into relationship.
Each relationship is an unordered collection of tuples.

Pros:

better support for joins
works well with one to many, many-to-one and simple many-to-many relationships
Strong transaction support (ACID)
Strong query optimizer

Document/relational models

A hybrid of the relational and document models is a good route for databases to take in the future.

Graph models

For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models are the most natural. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures

Property Graphs

In the property graph model, each vertex (vertices can be of very different classes. ) consists of:

A unique identifier
A set of outgoing edges
A set of incoming edges
A collection of properties (key-value pairs)

Each edge consists of:

A unique identifier
The vertex at which the edge starts (the tail vertex)
The vertex at which the edge ends (the head vertex)
A label to describe the kind of relationship between the two vertices
A collection of properties (key-value pairs)

Example: Cypher

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}), -- vertex
  (USA:Location      {name:'United States', type:'country'  }), -- vertex
  (Idaho:Location    {name:'Idaho',         type:'state'    }), -- vertex
  (Lucy:Person       {name:'Lucy' }),                           -- vertex
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),         -- edge
  (Lucy)  -[:BORN_IN]-> (Idaho)                                 -- edge

-- query the person who was born in United States and now lives in Europe
MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name

Example: Triple-Stores - (subject, predicate, object). (Another way to store the graph. )
The subject of a triple is equivalent to a vertex in a graph. The object is either a primitive datatype or another vertex in the graph

Query Languages for Data

Declarative or imperative. SQL is an example of declarative languages.

The advantage of declarative:

unordered
more concise and easier to work
automatic optimizations (hide impletation under API -> easier to optimize)
lend themselves to parallel execution.

Chapter 3 Storage and Retrieval

Different storage engines for OLTP or OLAP

Differences between OLTP or OLAP:

Property	OLTP (for transaction processing)	OLAP (for analysis)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random access and low latency	Bulk import or event stream
Primarily used by	End customer	Analyst
What data represents	Latest state of data	History of event over time
Dataset size	GB ~ TB	TB ~ PB

Data warehouse is a separate database that analysts query
1. Advantage:
  1. without affecting OLTP operations
  2. data can be optimized for analytic access patterns
2. Where the data comes from? Data is extracted from OLTP databases, transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL)
3. vs OLTP databases: might have the same SQL query interface, but the internals of the systems can look quite different

OLTP storage engines

Can be devided into two categories: log-structured + update-in-place

Log-structured

Bitcask

Many databases internally use a log, which is an append-only data file. Lookup takes O(N) becaue it is needed to scan the whole file, while write takes O(1).
To improve lookup, we need index. An index is an additional structure, which might slow down writes.

When we use index, we are using key-value pairs. For example, Bitcask stores hash map in memory where each key maps to a byte offset in the data file on disk. When appending a new key value pair, update the hash map. When querying, use the hash map to find the byte offset. Bitcask is well suited to situations where the value for each key is updated frequently.

How do we avoid eventually running out of disk space when appending to a file? A good solution is to break the log into segments of a certain size (by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file). We can then perform compaction on these segments. Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. Compaction can be done in a background thread. After compaction, old segment files can simply be deleted. In this case, lookup will be: check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on.

Issues to consider when implementing log-structured database:
1. File format: CSV is not the best format for a log. It’s faster and simpler to use a binary format.
2. Deleting records: append a special deletion record to the data file and mark the key as tombstone (no update any more)
3. Crash recovery: restore each segment’s hash map by reading the entire segment file. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk
4. Partially written records: if crash when writing, need mechanism to allow corrupted parts of the log to be detected and ignored.
5. Concurrency control: only one writer thread but allow reading by multiple threads
An append-only log over updating the file in place:
1. Appending and segment merging are faster than random writes
2. Concurrency and crash recovery are much simpler
3. Merging old segments avoids the problem of data files getting fragmented over time
Hash table index has limitations:
1. Range queries are not efficient
2. The hash table must fit in memory

SSTables

SSTable requires that the sequence of key-value pairs is sorted by key

SSTables have several big advantages over log segments with hash indexes:
1. Merging segments is simple and efficient. Merging is like merge sort
2. When querying for one key, no need to store all the indexes in memory; instead need to store sparse key. Querying is like binary search
3. Data can be compressed and saved to disk -> save space
How SSTables works?
1. When a write comes in, add it to an in-memory balanced tree data structure
2. When the memtable gets bigger than some threshold, write it out to disk as an SSTable file. The new SSTable file becomes the most recent segment of the database.
3. In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-recent segment, etc
4. From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
Problems:
1. If the database crashes, the most recent writes are lost. To avoid that problem, we can keep a separate log on disk to which every write is immediately appended. Use log file to restore the memtable after a crash. When the written is done, the log can be discarded.
Optimizations
1. Use additional Bloom filters to tell you if a key does not appear in the database efficiently
2. Determine the order and timing of how SSTables are compacted and merged:
  1. size-tiered compaction: newer and smaller SSTables are successively merged into older and larger SSTables
  2. leveled compaction: split up into smaller SSTables and older data is moved into separate levels

LSM tree

Storage engines based on the SSTable like principle of merging and compacting sorted files

Update-in-place

B-Trees

Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory.

How B-Trees works?
1. Search a key: Follow the page reference and find the leaf page that contains the key
2. Update the value of an existing key: Search for the leaf page and write the page back to disk
3. Add a new key: Find the page that contains the key and add it to the page. If there is not page to add the key, split the page into 2 half-full pages. Update the B-tree from bottom to top. =>
  1. This algorithm ensures that the tree remains balanced and its height is O(log n). The height is 3~4 for most databases.
  2. It makes write more complicated. If you split a page, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages.
4. Make the database resilient to crashes, B-tree uses a write-ahead log (WAL). This is an append-only file. When there is a write, the modification will be written to WAL first. When the database crashes, use the log to restore B-tree
5. Concurrency: when multiple threads write the B-tree, protect the tree’s data structures with latches (lightweight locks).
Optimizations
1. Instead of WAL, use copy-on-write scheme. Write to a new location, and pointing to the new location after writing.
2. Pack more keys into a page => the tree have a higher branching factor, and thus fewer levels
3. Lay out leaf pages appear in sequential order on disk => more efficient when requiring pages with nearby key ranges
4. Add pointers to left and right page => allow scanning keys in order without jumping back to parent pages

Comparing B-Trees and LSM-Trees

Property	LSM-Trees	B-Trees
Overall	faster to write (just need append)	faster to read (log (N) to read)
Write overhead	SSTable can be compressed and thus have less to write	Write amplification: one to disk, the other to WAL; Write the whole page even if only a few bytes need to be modified; Use SSD to improve write performance
Compression overhead	Take time to compress => compression throttles write; compression occupies disk resources => slow down system response time
Atomic operation	Contains multiple copies of the same key, hazarding transaction	Use latches and ensure transaction

OLAP storage engines

Column-Oriented Storage

The idea behind column-oriented storage is: don’t store all the values from one row together, but store all the values from each column together instead.

Advantages:
1. Reduce the volume of data needed to load from disk
  1. Compression
    - Observation: the data in one column can be repetitive (i.e the number of distinct values in a column is small compared to the number of rows)
    - Improvement: use bitmap encoding to decrease the data store size without affecting query (when querying use bit operation) => further reduce the volume of data that needs to be loaded from disk
  2. Sorting in Column Storage
    - How: impose an order and use that as an indexing mechanism
    - Improvement: group similar values => easy to compress and store (e.g to store 12,12,…12, actually store 12 * 10000)
2. Make efficient use of CPU cycles: SIMD operation && run loop of a column in L1 cache
Downsides:
1. Writing to column-Oriented storage needs to write the whole columns. An alternative way is to write into LSM-trees locally. Querying will look into both LSM-trees and column data on disk.

Chapter 4 Encoding and decoding

Formats for Encoding Data

Encoding definition: converts data that is kept in objects, structs etc to bytes

Language-Specific Formats

Language	Format
Java	java.io.Serializable , Kryo
Ruby	Marshal
Python	Pickle

Problems:
1. encoding and decoding have to use the same language
2. security problems
3. often neglect the forward and backward compatibility
4. efficiency is low. Java’s built-in serialization is notorious for its bad performance
Conclusion: it’s generally a bad idea to use your language’s built-in encoding

JSON, XML and CSV

Textual, human readable

Problems:
1. verbose: XML
2. ambiguity:
  1. XML and CSV cannot distinguish between a number and a string that happens to consist of digits
  2. JSON cannot distinguish integers and floating-point numbers, and it doesn’t
3. JSON and XML don’t support binary
4. optional schema support for XML and JSON. Correct interpretation of data needs schema
5. CSV does not have any schema
Conclusion: JSON, XML and CSV are good enought for many purposes

Binary encoding

Conclusion: more compact or faster to parse

Thrift and Protocol Buffers (Two binary encoding formats)

IDL

-- IDL of Thrift
struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

-- IDL of Protocol Buffers
message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

How they encode
1. Thrift has two different binary encoding formats: BinaryProtocal and CompactProtocol.
  1. BinaryProtocal: each field has a type annotation to indicate its type. each field has a field tag (the tags can be found in schema definitio).
  - See
  1. CompactProtocol: packs the field type and tag number into a single byte + use variable length (instead of a full eight bytes) for a integer.
  - See
2. Protocol Buffer encodes similarly to CompactProtocol
3. Note: though each field is marked as required or optional, it won’t affect how the field is encoded. There would be a run time check that fails if the field is not set
How they evolve
1. Can change field name, since encoded data never refers to field name
2. Cannot change/reuse a field’s tag (each field should have a unique tag number)
3. Can remove a optional tag
4. Can add a new field but not make it required
5. Can change data type but the risk is value will lost precision or get truncated.
  1. Protocol buffer doesn’t have a list or array datatype but it allows optional field to a repeated field (a multiple value). Thrift has a dedicated list data type but it does not allow evolution from single value to multi

Avro (another binary encoding format)

Avro is different from Thrift or Protocol buffer. Why to develop Avro? Thrift is not a good fit for Hadoop.

IDL: one for human editing, one for machine readable

-- IDL of Avro
record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

How it encodes
- Different from Thrift/Protocol Buffer: when it encodes, there is nothing to identify fields or data types. To parse the data, go through the fields in the order that they appear in the schema and use the schema to tell the data type of each field. Any mismatch between read and write schemas leads to incorrect decoding.
- See
How it evolves
1. Avro resolves the difference between writer’s schema and the reader’s schema. How the resolution works:
  1. matches up the fields by field name
  2. ignore if a field if appears in the writer’s schema but not in the reader’s schema
  3. fills a default value if appears in the writer’s schema but not in the reader’s schema
2. Schema evolution rules
  1. forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader.
  2. backward compatibility means that you can have a new version of the schema as reader and an old version as writer.
3. evolution:
  1. rename a field name: is backward compatible but not forward compatible old code doesn’t know new field name
4. Dynamically generated schemas: if the database schema changes, generate a new Avro schema from the updated database schema and export data. In thrift or protocol buffer, the field flag have be assigned by hand: every time the schema changes, adminstrator have to update the mapping manually

Merits of Schemas

more compact than binary JSON variants since they omit field names
schema is valuable form of documentation
keep a schema to check forward and backward compatibility
schema can be generated from statically typed programming code

Modes of Dataflow

II Distributed Data

Chapter 5 Replication

Why replicate data:
1. keep data geographically close to your users -> reduce latency
2. allow system to continue working if some of its parts fail (because of network interruption) -> increase availability
3. scale out the number of machines that can serve read queries -> increase read throughput
What is the difficulty of replication: the data is changing

Single-leader

How it works
1. Only one node is the leader, the others are followers
2. When clients write, clients send request to the leader. Leader writes to its storage, and sends a log stream or a change stream to all the followers. Followers take the log and update their database.
3. When a client reads, it can query either the leader or one of the folowers

Synchronous VS Asynchronous Replication

Synchronous	Asynchronous
It is guaranteed that followers have an up-to-date copy of the data that is consistent with the leader	Write that fails is not recoverable
block all writes if one synchronous follower doesn’t respond	leader can process even if all its followers fall behind

In practice, just enable one synchronous replication (the others are asynchronous) to make sure two nodes have the up-to-date data. More often, complete asynchronous mode is used.

Set up new followers
- The process:
  1. Take a consistent snapshot of the leader’s database at some point without taking a lock on the entire database
  2. Copy the snapshot to the new follower node
  3. The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken
  4. When the follower has processed the backlog of data changes since the snapshot, it has ca
Reboot failed follower:

Follower will connect to the leader and request all the data changes that occured after it faild. It applied the data change, and catches up with leader
Handle leader failure (failover)
1. determine the leader has failed. Most system use a timtout (30 sec) to determine if a node is dead
2. choose a new leader through an election or appointed previously
3. reconfigure the system to use new leader:
  1. client sends request to the new leader
  2. if the old leader come back, system need to make it to be a new follower
Failover is fraught with things that can go wrong
1. async write -> the new leader might not have received writes from old leader before it failed -> Need to discard data from old master -> Discarding impacts durability.
2. it could happen that two nodes believe they are the leader
3. need to define a right timeout before the leader is declared dead. too short -> unnecessary failover; too long -> longer to recover

Implementation of Replication Logs
1. Statement-based (deprecated in MySQL). e.g UPDATE/INSERT/...
  - Problem:
    1. nondeterministic statement like NOW()
    2. there is an autoincrementing column -> different effect
    3. statement have side effect like trigger, udf -> different side effect
2. Write-ahead log (WAL)
  - Problem:
    the log is very low level. If database changes storage format, it might not be possible to run on different versions of database on leader or followers.
  - How to upgrade？
    1. upgrade followers first and take leader as a down node
    2. upgrade all nodes with a downtime
3. Logical (row-based) log
  - How it works:
    1. decoupled with the storage engine
    2. contains the info for each row: For an inserted row, the log contains the new values of all columns. For a deleted row, the log contains enough information to uniquely identify the row that was deleted. For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns
  - pros:
    1. since the info is decoupled with storage engine, it supports backward compatibility
    2. the data can be parsed in other application
4. Trigger-based
  - let you register custom application code that is automatically executed when a data change occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process.

Multi-leader

why multi leader?
- Problem of single leader: 1) throttle writes 2) single node failure. To solve it: Allow more than one node to accept writes. Each leader that processes a write must forward that data change to all the other nodes.
- Pros:
  1. tolerance of datacenter outrage
  2. tolerance of network problem
  3. low latency
- Cons:
  1. cannot solve trigger, side effect, auto-incrementing column
Use Cases for Multi-Leader Replication
1. Multi-datacenter operation
2. Clients with offline operation: local datacenter works as a leader
3. Collaborative editing
Handling Write Conflicts
- See
1. make the conflict detection synchronous (otherwise, detect the write conflict after writing is too late) -> lose the main advantage of multiple leader replication - two leaders accept write indenpendently
2. avoid conflict by sending all writes for a particular record to the same leader
3. resolve the conflict in a convergent way:
  1. give each write a unique ID and pick the one with the highest ID as the winner -> prone to data loss
  2. give each replica a unique ID and pick the one with highest ID -> prone to data loss
  3. merge the replica together like concat
  4. record conflict and write app to resolve later
4. write custom conflict resolution logic to solve the conflict on write or on read
Multi-Leader Replication
1. Topologies
  1. Cicurlar topology
  2. Star topology
  3. All-to-all topology
  - Cicurlar and Star are error prone due to one node failure.
2. Send writes to other leaders. To avoid infinite loop, tag the replication log with a unique indentifier and ignore if an incoming log has its own indentifier
Problems of Multi-Leader
1. replication log comes in different order, e.g update after insert

Leaderless (Dynamo-style databases)

How it works
1. write: send write request to all nodes and mark the request as success if over n/2 nodes return success
2. read: send read requests to several nodes and determine the value to use among multiple response
  1. read repair: database record contains the data version -> detect outdated value -> update stale node
3. sync data among nodes:
  1. read repair
  2. a background process looks for difference in data between replicas and copies missing data from one to another (ensure eventual consistency, though there is latency and disorder)
Quorums for reading and writing
1. How it works:
  - n replicas, w nodes (every write must be confirmed by w nodes to be considered successfully), r nodes to query.
  - As long as w + r > n, expect to get an up-to-date data
  - In practice, usually w = r = Math.ceil(n/2)
  - Or set large w, small r for heavy reading; small w, large r for heavy writing
  - Or small w and small r to allow low latency and high availablity
2. Limitations/corner cases:
  1. Sloppy quorum -> write and read nodes might have no overlap
  2. concurrent write
  3. concurrent read with write -> cannot determine the order of read and write -> cannot deternine return new valur or old
  4. read failure writes -> unable to roll over for a write that fails
  5. restore failure node from nodes with old value
  6. edge cases by the timing
Sloppy quorum and Hinted Handoff
- Sloppy quorum: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value. (i.e the nodes have been partitioned. a new write of a particular record should go to node a. however, node a cannot be connected at the moment. node b is not a partition for the record but it can store the value temporarily)
- Hinted handoff: Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes
- Pros: increase write availability; but there is no guarantee that a read of r nodes will see the result until the hinted handoff has completed.

Detecting Concurrent Writes
- Simple solutions:
  1. Last write wins. It overwrites the latest changes and discards old changes. It achieves eventual convergence but impact durability
  2. Merge siblings (concurrent values)
- Version number approach:
  1. Define Concurrent: not causally dependent. If every key has a version number, version number is incrementally updated every time when the key is updated. Concurrent means for two writes, the version numbers equal
  2. How it works:
    1. client read: return value and latest version number
    2. client write: send version numbe of prior read and new value that have been merged with value from the prior read
    3. server handle write: overwrite the values with a higher version number
  3. Extending to version vector: instead of storing version vector of the key in the database, store version number per replica -> when write, knows exactly what data to overwrite and what data to keep
Problems of Leaderless
1. need to monitor if the databases are returning up-to-date results. (in single leader database, the replication lag can be measured by subtracting a follower’s position frrom the leader’s position; in leaderless, replication logs are disordered and thus is hard to measure)

Problems with Replication Lag

Problem: asynchronous follower -> get different result from leader and followers

Read-after-write consistency, several solutions:
1. read the modified data from leader
2. request with the last update timestamp. The follower to be requested needs to be updated afte the timestamp
Monotonic read: read from the same follower to ensure after users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.
Consistent prefix reads: users should see the data in a state that makes causal sense. The disorder might be caused by sharding. The solution is to write causal data into the same partition.

Chapter 6 Partitioning

Why partition? When it is impossible to store and process a very large dataset on a single node
Partition strategies:
1. Partition randomly: disorder and impossible to range search
2. Partition by Key Range: assign a continuous range of keys to a partition. Within each partition, keep keys in sorted order.
  - Pro: range scans are easy
  - Con: certain access patterns might lead to hot spots. e.g timestamp is key, and its write might cause hot spot
3. Partition by Hash of Key
  - Pro: distribute keys fairly; partition boundaries can be evenly spaced
  - Con: cannot do efficient range querie
Rebalancing:
- Requirement of Rebalancing:
  1. fairly sharding
  2. no downtime
  3. minimizing moving (e.g use hash mod N to rebalance - it will move data around more than necessary which is expected)
- Strategies:
  1. Fixed number of partitions
    - Pro: simple to implement
    - Con: hard to decide a good number since dataset size might be variable. if too large, rebalancing and recovery from node failures become expensive; if too small, they incur too much overhead.
  2. Dynamic rebalancing: split a node that exceeds a configured size into two partitions (like B tree)
    - Pro: avoid boundary error
    - Con: an empty database starts off with a single partition, since there is no a priori information about where to draw the partition boundaries -> pre-splitting
  3. Partition proportionally to nodes
    - Pro: evenly distributed
    - Con: only apply for hash partitioned dataset
- Manual operation or auto? Manual effort is needed to supervise since rebalancing is expensive and error-prone(might take a node during rebalance as down)
Partitioning and Secondary Indexes
- document-based partitioning: whenever add a record, the record will be added to the document of the secondary index. Problem: tail latency amplification
- term-based partitioning: create a global index in all partitions
  - Pro: reads more efficient - a client only needs to make a request to the partition containing the term that it wants
  - Con: writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index
Routing queries to the appropriate partition
1. Allow clients to contact any node. If that node coincidentally owns the partition, it handle the request ; otherwise, it forwards the request to the appropriate node
2. Send all requests from clients to a routing tier first, which forwards the request to the appropriate node
3. Require that clients be aware of the partitioning and the assignment of partitions to nodes
- Zookeeper: like Option 2, it:
  1. zookeeper keeps track of cluster metadata
  2. node will register themselves in zookeeper
  3. zookeeper updates routing tier if anything changes

References:

Author: hyangjudy

Link: https://hyangjudy.github.io/2020/12/19/ddia/

Data

Donate

微信
支付寶