Distributed systems part 1

16 min. read

I started the course by Martin Kleppmann on distributed systems and I already know it's going to take me ages to finish it. These are my notes for the first three lessons.

The good things about this course is that it explains something very complex in an easy way. The slides and the explanations are very clear, and it covers a lot of material. I am definitely learning a lot!

The bad thing for me personally is that the course is strictly theoretical. I would like to have expanded on real world examples of where to use one solution or another. It would also be nice if we could get some very basic hello-world working example, in any programming language, that you could go and run in your laptop and another computer, for example your friend's laptop or a Raspberry Pi, and see a distributed system in action.

I did end up trying to build distributed systems of my own, to sink in the knowledge. You can read here about the distributed ping-pong, or here about the distributed chat.

Reasons to go distributed

If you have two nodes, you have a distributed system (for example, two phones sending messages to each other)

  • Reliability: if one computer is down, you have another
  • Performance: nodes in different locations serve locally faster
  • Scaling: You can spread huge tasks into many computers (CERN grid)

Distributed computing is more complex though. Dealing with faults is what makes it fundamentally different, and often harder, compared to programming a single computer.

RPC: Remote Procedure Call. "Location transparency": system hides where a resource is located.

Failure

  • Generals paradox (messages are unreliable): we need backup systems (for example, if payment fails, retry. If goods were not sent, refund)
  • Byzantine generals paradox (nodes are unreliable): if f nodes fail, we need 3f + 1 total nodes. So f < 1/3. Cryptography can make it a bit easier. Real distributed systems do often involve complex trust relationships.

System model

  • Network: can be reliable, fair-loss or arbitrary. Arbitrary turns into fair-loss with TLS, fair-loss turns into reliable with retry + de-duplicate (TCP + some other mechanism).
  • Nodes: fail-stop, fail-recovery or fail-arbitrary (byzantine). We can not recover from one to the previous, algorithms to recover from one are very different from the others.
  • Network and nodes (timing): Synchronous, partially synchronous, and asynchronous (the most robust).

Availability

Martin Klepmann's slide about availability
Availability is how long a service is working. Slide by Martin Kleppmann.
  • SLO (service-level objective): percentage of requests that need to return a correct response within a specified timeout, as measured by a certain client over a certain period of time.
  • SLA (service-level agreement): contract that specifies some SLO, as well as the consequences if the SLO is not met

In a partially synchronous system, a perfect failure detector does not exist.

Clocks

  • Physical clocks: measure seconds (quartz clocks, atomic cesium clocks or TAI, GPS clocks, astronomical time)
  • Logical clocks: measure events

Physical clocks

UTC (coordinated Universal time) is based on atomic time with corrections from astronomical time (a day is not exactly 24 * 3600 seconds). Time zones and daylight savings are offsets.

Leap seconds are added or subtracted every year on June and December. They make measurements of time elapsed be incorrect (if a second is inserted or removed while measuring). Unix timestamps (seconds since epoch) and POSIX ignore leap seconds. ISO8601 has the format datetime plus offset: YYYY-MM-DDThh:mm:ss+00:00

Smearing the leap second: rather than inserting or removing a second between 23:59:59 and 00:00:00, the extra second is spread out by running the clocks slower or faster.

Monotonic time: Independent time. It starts when the computer was turned on, and it always moves forward. Monotonic timestamps are not comparable across nodes, so they can't be used to set absolute time. Because of these clock adjustments, monotonic time is used for elapsed times.

Computers use quartz clocks, which get out of phase, but we can sync with a server. Protocols: NTP (Network Time Protocol) and PTP (Precision Time Protocol).

  • If time is skew, client will adjust the clock (slewing).
  • If this would take too much time, the client sets the clock according to server time (stepping).
  • If the skew is huge, panic mode triggers and clock is not adjusted.

Just because a node is running NTP the clock won't be correct. It could get stuck in a panic state.

Logical clocks

They are used for ordering events in distributed systems. We could send the timestamp with the message, but NTP clock adjustments may still put them in the wrong order. So instead, we work with event ids.

  • happens-before relation: each node has only a single thread of execution, so for any two execution steps of a node, it is clear which one happened first. We can include node id and a sequence number to avoid duplicates. This is a partial order, as events could be concurrent (did not necessarily happen at the same time, but are independent). It's a way of reasoning about causality.

Lamport clocks: Provide total order. Each node has its own variable t set to zero at start. When an event happens in the node, increment t by one, and send t and the message through the network. When a node receives (t', m) it sets its own time to t = max(t, t') + 1 and sends the message to the app.

Vector clocks: Provide partial order. They allow us to know if two events are concurrent or one happened before the other. Lamport clocks can't tell you this. Similar to Lamport but every node has a vector with the times of all the other nodes. The max is taken element wise. From any vector we can reconstruct the events that happened in its past.

Broadcast protocols

Also known as multicast protocols (point-to-point: unicast). Assumes point-to-point, all nodes deliver and receive from all nodes. Delivery may be delayed if events need to be sent in order.

Types of ordering:

  • FIFO: messages sent by one node must be delivered in the same order. Messages sent by other nodes, the order doesn't matter.
  • causal: causally related messages must be delivered in order. Concurrent messages, in any order.
  • total: there is agreement between all nodes about the order in which the messages should be delivered. All nodes deliver in same order.

Broadcast algorithms

We need to ensure that the messages are delivered by every node, and that they do so in the right order.

Delivered to every node:

  • Eager reliable broadcast: the first time a node receives a message, sends it to all nodes, but this is expensive, where n is the number of nodes. Reliable links (retry + de-duplicate) may not be enough because a node may crash before all messages are received.
  • Gossip protocols: the first time a node receives a message, sends it to a small number of random nodes. With the right parameters, they are very resilient to message loss and node crashes while also remaining efficient.

Delivered in order:

  • FIFO broadcast: we attach a sequence number to every broadcasted message.
  • Casual broadcast: we attach a vector of integers to every broadcasted message. Sometimes it's called a “vector clock algorithm“, but the vector elements count the number of messages from each sender that have been delivered, rather than the number of events.
  • Total order broadcast: Or “FIFO-total order”. Trickier and not fault-tolerant. “Single leader” (single point of failure), or “Lamport timestamps” (leaderless but the crash of one node stops the others from delivering).

Applied example: multiplayer network game

Client / server model: Each player/client sends a list of actions to the server, server receives from all clients and sends back to all clients, clients update their state.

Problem: network lag of 200+ms. LAN or few players helps.

Solution:

  • Speculation in the client:. The client runs a local server that speculates what the next game state is and shows that to the player. Then it updates with what comes from the server. Weak consistency: client games are slightly different to the server game and among them too.
  • Lag compensation on the server: Speculates state based on a guess of where the players where at t - lag.

Another problem: ghost players, I was shot through a wall, because the server perspective was different to the clients perspective.

Solution: Make the clients also send their actions to the other clients. Clients have half the lag among them (they don't have to go through the server), but will need more bandwidth . Also some players could cheat by sending different moves to different clients.

Now we need to compare the client state with the server state. When we receive a keystroke, we need to know if it is fresh and when it happened in time. To check if it's fresh, we include all keystrokes from all players in every server snapshot, and the client checks against the previous ones to find the new ones. This makes the snapshot big and compromised by packet loss. We could tag them with time but time-sync is an illusion and not accurate enough for games. It's better to use a logical clock: we give each keystroke an id.

Interesting reads on synchronizing several players through the network:

Comments