December 2, 2011

Paxos/Multi-paxos Algorithm

A couple of posts ago, I talked about the distributed systems programming class I was taking here in my last semester at Berkeley. Our projects are really cool - we've done everything from a quorum KVS to a distributed lock manager, and for our final project, my group chose to implement Multi-paxos. I'll start by explaining general Paxos first.

Basically, Paxos is an protocol (or technically, a "family" of protocols, says Wikipedia) that solves the problem of getting a group of nodes to agree, or reach consensus, on a single value.

Why is this important?

If there's a stable master, there's hardly any use for this, because the master can just decide and propose values itself and have the other nodes always follow it. However, what if leader election fails (and it sometimes does), and there are two nodes proposing values?

The Paxos algorithm ensures that only one value is chosen out of all of the proposed values. Where this algorithm comes in handy is when you'd like your distributed system to decide on a sequence of values, or in other words, a consistent ordering over a set of values. As Ayende Rahien describes, if these values are events, then you know that all of the machines in the cluster have the same state or subset of the state.


The Paxos algorithm is divided into two phases, the prepare phase, and the accept phase.

Disclaimer: the text in this section is partially paraphrased from Leslie Lamport's excellent Paxos Made Simple paper.

Prepare Phase

In the prepare phase, the proposer selects a proposal number n (which is greater than any n it has previously sent, and distinct from any n that any other proposers' n values, I'll discuss this more later) and sends a PREPARE request to a majority of the acceptors.

On the acceptor side, if it ever receives a PREPARE request with a proposal number n greater than that of any other proposal number n that it has previously received, it responds with a "promise" to not accept any proposals with a lower numbered n in the future and the value of the highest-numbered proposal that it has already accepted (if any).

Starting to sound confusing? I had to read this over several times before I actually understood it, so maybe the pseudocode below will help.

# Proposer
for acceptor in acceptors
  send :prepare_req, next_n()

# Acceptor
if (req.n > highest_proposed_n)
  highest_proposed_n = req.n
  reply :prepare_resp, {
    :n => highest_acc.n,
    :value => highest_acc.value

Accept Phase

After a proposer has received a response to its PREPARE requests from a majority of the acceptors, it then sends an ACCEPT to those acceptors with a value v, where v is the value of the highest numbered proposal among the responses, or any value if the responses reported no other proposals.

By contrast, when an acceptor receives an ACCEPT request, it always accepts it unless it has already promised not to in the prepare phase.

# Proposer
for acceptor in acceptors
  send :accept_req, responses.argmax { |i| i.n }.value }

# Acceptor
if (req.n >= highest_proposed_n)
  highest_acc = {:n => req.n, :value => req.value}
  reply :accept_resp


All this talk about Paxos. How does Multi-paxos fit in?

I mentioned in the introduction that one of the main useful applications of the Paxos application is having the group of participants decide on a sequence of numbers. Since one round of Paxos results in a decision of one value, the naive way to go about finding a sequence of numbers would be to run Paxos many times, right?

One optimization that can be made in this case, assuming a single stable leader, is to skip the prepare phase. If we assume that the leadership will remain "sticky", there is no need to continue sending out proposal numbers - the first proposal number sent out will never be "overridden" since there is only one leader.

Thus, we only need to do the prepare phase once. In subsequent rounds of Paxos, we can just send the ACCEPT messages, with n as the proposal number used in the original PREPARE request and an additional parameter that indicates the sequence number (the current round we're in). We don't have to worry about the worst case where leadership isn't stable (or distinct), because the algorithm will degrade gracefully into the general Paxos algorithm (both prepare and accept phases for each round). Cool!

Other Considerations

Persistent Storage

Machines must keep certain things in persistent storage to be able to recover from failures. In particular, acceptors need to be able to remember which PREPARE requests they have promised to follow, and which ACCEPT requests they have responded to in order to make decisions about which proposals to promise to and to pass necessary information back to the proposes.

Distinct Proposal Numbers n

In order for the algorithm to work, each proposer must propose monotonically increasing proposal numbers n that are distinct from any other proposers' numbers n.

To achieve this, we can assign disjoint sets of numbers for each proposer and have them only choose numbers from their own set (ex. assign each node a unique prime number, they choose multiples of that prime number).

Or, if we're assuming static membership of participants, assign each node a unique number i between 0 and k, where k is the total number of participants, and n = i + (k * round_number).


tl;dr Paxos is for getting a group of machines to agree on a value (or more usefully, provide a consistent ordering on a sequence of values). Multi-paxos is an optimization on using Paxos for consecutive rounds, where you can skip one of the phases if you assume a stable leader.

Hope this post was interesting and/or informative! Similar to the eventual consistency post, I don't claim to be an expert in this area, so I'd love to hear any comments or corrections.