Get Started. It's Free
or sign up with your email address
Cassandra by Mind Map: Cassandra

1. DHT

1.1. coordinator knows the key placement

1.2. No need for finger tables

2. Data Placement Strategies

2.1. Simple Strategy

2.1.1. Random Partitioner chord-like hash based partitioning

2.1.2. Byte Ordered Partitioner easier for range queries no hashing of keys involved

2.2. Network Topology Strategy

2.2.1. Rack fault tolerance

2.2.2. Ensure fully replicated

3. Snitches

3.1. Simple Snitch

3.1.1. Unaware of Topology (Rack)

3.2. Rack Inferring

3.2.1. Assume topology of network by octet of server ip

3.2.2. x.<DC>.<rack>.<node>

3.3. Property File Snitch

3.3.1. mapping in the configuration file

3.4. EC2 Snitch

3.4.1. EC2 region = DC

3.4.2. AZ = rack

4. Writes

4.1. Client sends write to one coordinator node in the cluster

4.1.1. per-key

4.1.2. per-query

4.1.3. per-client

4.2. coordinator uses partitioner to send query to all replica nodes responsible for the key

4.3. Hinted Handoff

4.3.1. makes sure write request is always available even if replicas are down

4.3.2. if one replica is down coordinator writes to all other replicas coordinator keeps write locally until the downed replica comes back

4.3.3. if all replicas are down coordinator buffers writes for a few hours

4.4. Per-DC coordinator

4.4.1. different from the query coordinator

4.4.2. elected to coordinate with other DCs

4.4.3. ZooKeeper (Paxos)

4.5. Writes At A Replica Node

4.5.1. 1. Log it in disk commit log (for failure recovery)

4.5.2. 2. Make changes to appropriate memtables Memtable in-memory representation of multiple key-value pairs cache that can be searched by key write-back cache as opposed to write-through maintains the latest KV for the server

4.5.3. 3. Later, when memtable is full or old, flush to disk Data File: An SSTable SSTable (sorted string) Index file: An SSTable of (key, position in data sstable) pairs Bloom filter (for efficient searches) Bloom Filter

4.6. Compaction

4.6.1. merge SSTables over time

4.6.2. Run periodically and locally on each server

4.7. Deletes

4.7.1. Add a Tombstone to the log

4.7.2. Eventually gets compacted

5. Reads

5.1. Send read to replicas that have responded quickest in the past

5.2. When X replicas respond, returns the latest timestamped value from among those X

5.3. read repair

5.3.1. checks consistency in the background

5.3.2. if two values are different, issue read repair

5.3.3. eventually brings all replicas up to date (eventual consistency)

5.4. if compaction isn't run frequently enough

5.4.1. A row may be split across multple SSTable

5.4.2. read needs to touch mutli SSTable

5.4.3. reads become slower than writes

6. Membership

6.1. Any server could be the coordinator

6.2. every server needs to maintain a list of all other servers

6.3. gossip style

6.4. Suspicion

6.4.1. adaptively set timeout based on underlying network and failure behaviour

6.4.2. accural detector failure detector outputs a value (Phi) representing suspicion

6.4.3. Apps set an appropriate threshold

6.4.4. PHI calculation for a member Inter-arrival times for gossip messages PHI(t) = -log(CDF or Probability(t_now-t_last))/log10 Determines the detection timeout, but takes into account historical inter-arrival time variation for gossiped heartbeats

6.4.5. In practice, PHI=5=>10-15 sec detection time