January 3, 2012

2012 CS Resolutions

Matt Might's "What every computer science major should know" article is fantastic. I regret not reading this article in the beginning of the holiday break, but I guess I can use this as sort of a "2012 CS Resolutions" list since everyone seems to be making New Years resolutions right about now.

  1. Master LaTex

    In school, I never really used LaTex for typing up assignments until last semester for our project design docs. I actually don't know how feasible this will be to do now that I'm in the "real world", but I'll take any chance I can get to write up documents in pretty LaTex!

  2. Be better at UNIX/Sysadmin-ing

    Being traditionally a frontend person, I don't have a crazy amount of experience working with command line or doing sysadmin-y things. Sure, I know how to do some elementary things and I'm a vim fanatic (by the way, vim > emacs). However, I'd like to be able to count myself as an expert at writing shell scripts or configuring a web server.

  3. Learn a couple of new languages

    Not for the sake of "adding new languages on my resume", which is absolutely useless, but for exploring and mastering new programming paradigms. A few that sparked my interest/I've been meaning to pick up: Scala, Haskell, Erlang.

  4. "Understand a computer from the transistors up"

    Being an EECS major has exposed me to all of the "levels" of a computer, but I never took much effort in trying to piece things together and much of my EE knowledge is long forgotten.

  5. Get some hands-on experience with Operating Systems

    The semester I took it, Berkeley dramatically changed their traditional Operating Systems class into a hybrid (read: less in-depth/useful) systems class that covered minimal multi-programming and memory/scheduler topics but didn't actually give us any practical experience with how operating systems work. In particular, I'd love to learn more about file systems and possibly tinker around with the Linux kernel.

  6. Catch up on reading systems papers

    I have a giant list of systems papers to go through/that I semi-went through during winter break courtesy of my friend Sunil and the graduate cloud computing course at Berkeley. In addition, Stripe has super cute academic paper reading lunches, which I'm really excited about! I'm sure this will be an easy resolution to keep up with. (:

After reading the article, I'm not super satisfied with what I learned in my 3.5 years as a Berkeley EECS major. I'm looking forward to filling in the gaps now that I've graduated!

December 2, 2011

Paxos/Multi-paxos Algorithm

A couple of posts ago, I talked about the distributed systems programming class I was taking here in my last semester at Berkeley. Our projects are really cool - we've done everything from a quorum KVS to a distributed lock manager, and for our final project, my group chose to implement Multi-paxos. I'll start by explaining general Paxos first.

Basically, Paxos is an protocol (or technically, a "family" of protocols, says Wikipedia) that solves the problem of getting a group of nodes to agree, or reach consensus, on a single value.

Why is this important?

If there's a stable master, there's hardly any use for this, because the master can just decide and propose values itself and have the other nodes always follow it. However, what if leader election fails (and it sometimes does), and there are two nodes proposing values?

The Paxos algorithm ensures that only one value is chosen out of all of the proposed values. Where this algorithm comes in handy is when you'd like your distributed system to decide on a sequence of values, or in other words, a consistent ordering over a set of values. As Ayende Rahien describes, if these values are events, then you know that all of the machines in the cluster have the same state or subset of the state.

Algorithm

The Paxos algorithm is divided into two phases, the prepare phase, and the accept phase.

Disclaimer: the text in this section is partially paraphrased from Leslie Lamport's excellent Paxos Made Simple paper.

Prepare Phase

In the prepare phase, the proposer selects a proposal number n (which is greater than any n it has previously sent, and distinct from any n that any other proposers' n values, I'll discuss this more later) and sends a PREPARE request to a majority of the acceptors.

On the acceptor side, if it ever receives a PREPARE request with a proposal number n greater than that of any other proposal number n that it has previously received, it responds with a "promise" to not accept any proposals with a lower numbered n in the future and the value of the highest-numbered proposal that it has already accepted (if any).

Starting to sound confusing? I had to read this over several times before I actually understood it, so maybe the pseudocode below will help.

# Proposer
for acceptor in acceptors
  send :prepare_req, next_n()

# Acceptor
if (req.n > highest_proposed_n)
  highest_proposed_n = req.n
  reply :prepare_resp, {
    :n => highest_acc.n,
    :value => highest_acc.value
  }

Accept Phase

After a proposer has received a response to its PREPARE requests from a majority of the acceptors, it then sends an ACCEPT to those acceptors with a value v, where v is the value of the highest numbered proposal among the responses, or any value if the responses reported no other proposals.

By contrast, when an acceptor receives an ACCEPT request, it always accepts it unless it has already promised not to in the prepare phase.

# Proposer
for acceptor in acceptors
  send :accept_req, responses.argmax { |i| i.n }.value }

# Acceptor
if (req.n >= highest_proposed_n)
  highest_acc = {:n => req.n, :value => req.value}
  reply :accept_resp

Multi-paxos

All this talk about Paxos. How does Multi-paxos fit in?

I mentioned in the introduction that one of the main useful applications of the Paxos application is having the group of participants decide on a sequence of numbers. Since one round of Paxos results in a decision of one value, the naive way to go about finding a sequence of numbers would be to run Paxos many times, right?

One optimization that can be made in this case, assuming a single stable leader, is to skip the prepare phase. If we assume that the leadership will remain "sticky", there is no need to continue sending out proposal numbers - the first proposal number sent out will never be "overridden" since there is only one leader.

Thus, we only need to do the prepare phase once. In subsequent rounds of Paxos, we can just send the ACCEPT messages, with n as the proposal number used in the original PREPARE request and an additional parameter that indicates the sequence number (the current round we're in). We don't have to worry about the worst case where leadership isn't stable (or distinct), because the algorithm will degrade gracefully into the general Paxos algorithm (both prepare and accept phases for each round). Cool!

Other Considerations

Persistent Storage

Machines must keep certain things in persistent storage to be able to recover from failures. In particular, acceptors need to be able to remember which PREPARE requests they have promised to follow, and which ACCEPT requests they have responded to in order to make decisions about which proposals to promise to and to pass necessary information back to the proposes.

Distinct Proposal Numbers n

In order for the algorithm to work, each proposer must propose monotonically increasing proposal numbers n that are distinct from any other proposers' numbers n.

To achieve this, we can assign disjoint sets of numbers for each proposer and have them only choose numbers from their own set (ex. assign each node a unique prime number, they choose multiples of that prime number).

Or, if we're assuming static membership of participants, assign each node a unique number i between 0 and k, where k is the total number of participants, and n = i + (k * round_number).

Conclusion

tl;dr Paxos is for getting a group of machines to agree on a value (or more usefully, provide a consistent ordering on a sequence of values). Multi-paxos is an optimization on using Paxos for consecutive rounds, where you can skip one of the phases if you assume a stable leader.

Hope this post was interesting and/or informative! Similar to the eventual consistency post, I don't claim to be an expert in this area, so I'd love to hear any comments or corrections.

October 29, 2011

Using Javascript Timeouts for Performance

Wait, what? Doesn't setting timeouts in Javascript delay a piece of code from being run for an X amount of time, so how could it possibly help with performance?

Imagine writing a Javascript app that is doing some large, computationally-expensive task like adding a five thousand elements to the DOM:

var list = document.getElementByTagName("ul")[0];
for (var i = 0; i < 5000; i++) {
  var newElement = document.createElement("li");
  list.append(newElement);
}

Since Javascript is single threaded, executing Javascript will suspend any user interface interaction and/or updates to the page - some browsers like Chrome will even try to kill the script if it's been running for too long (have you ever seen that sad face "Oops" page?).

With the use of timers, we can split up the code into different segments (ex. 10 loops doing 500 DOM additions each):

var curr = 0; var max = 4999;
var step = 500;
var list = document.getElementByTagName("ul")[0];

setTimeout(function() {
  for (var end = curr + step; curr < end; curr++) {
    var newElement = document.createElement("li");
    list.append(newElement);
  }
  if (curr < max) { setTimeout(arguments.callee, 0); }
}, 0);

Setting the timer to 0 will tell the browser to fire off the timers as quickly as possible (in most browsers, the minimum timer delay is around 10-15ms). However, since the code is broken up into smaller pieces, it will allow the browser to do things like render updates to the page and respond to user events - basically, being more responsive (this is good!).

October 25, 2011

Hiree

Disclaimer: These screenshots in no way represent my actual interest/interview status with companies. I have randomly selected companies to demo the smiley face moods, so the sad/happy faces do not reflect my actual sentiments towards those companies.

You know how companies have these online systems in place to keep track of all their job candidates?

In the last semester of my senior year, I felt so overwhelmed with all of the companies that I was interviewing with that I decided to create the reverse - an online system for me, the job candidate, to keep track of all the companies. Side note: I thought I was solving a greater need when a couple of my friends told me, "Amber, nobody else has these problems".

Basically, you can create buckets and companies to go into those buckets, annotating each company with your current "mood" in happy/sad faces (IMHO, this is the most awesome feature). By default, I have "Interested", "Waiting", "Interviewing", and "Received Offer" buckets but everything is renamable and editable so you can feel free to use this application for anything you need (e.g. graduate school application tracking).

I'm currently in the process of adding new features—the ability to add notes for each company (deadlines, locations, etc) and backend storage (the app currently uses local storage, don't clear your browser data) are at the top of my list, but feel free to contact me if you'd like to see anything else.

This project was also an awesome opportunity for me to teach myself Backbone.js, which I've been meaning to learn for a while. I will blog about my experiences sometime and/or open source the code!

October 17, 2011

Eventual Consistency

I mentioned in my previous post that I was taking a Programming in the Cloud class on distributed systems and programming this semester, and one of the perks of this class is that we occasionally have speakers from the industry come in to talk about various topics.

Last week, Doug Terry from Microsoft Research came in to speak about eventual consistency, which I found super, super interesting. I don't claim to be an expert on the topic now, but here are my notes (more or less from memory). First, the two opposite sides of the spectrum:

Strong vs. Eventual

Strong Consistency - guaranteed consistency across replicas

Strong consistency guarantees, well, strong consistency. Any read always returns up-to-date data, although this may result in lower availability since you're limiting the machines that you're able to read from to keep this guarantee.

Eventual Consistency - all replicas will "eventually" become consistent.

Although this weaker form of consistency has better availability because you can read to/write from any machine, you can only hope that the data will be fresh (for some applications, this is enough).

And everything in between

This isn't an exhaustive list of consistency models, but a couple of other models in the middle of the spectrum that Doug went over that day:

Read My Writes - guaranteed to be able to read updates you write

In applications where you're doing some sort of read then update operations (like x+=1), it's very important to be able to your own writes. The example Doug gave during lecture was a baseball analogy, where a scorekeeper is keeping track of the current score of the game with operations like homeScore +=1 for every home run scored for the home team. It's a different story if there are multiple scorekeepers, where you might need some causal consistency guarantees, but a single scorekeeper must at minimum be able to read his own writes.

Afterwards in class, we had a discussion about the various ways that this consistency model could be implemented. There needs to be some sort of versioning system for the clients to be able to compare with the machines to determine if their writes are stored there or not, or as an optimization, the client could store the writes locally and always be able to "read their own writes" if they take the union of their write set and any other machines' write set.

Consistent Prefix - updates returned are some prefix of all the updates, with no gaps

In case the title didn't make sense, a "consistent prefix" means that some prefix of all of the writes (assuming the writes have a global order) are applied such that the prefix is consistent ("up to date" up until the last write in the prefix, with no gaps).

For example, if the writes are A, B, C, D, E, F, you will be able to read some consecutive set of these writes, starting with A. Note that the empty set is also trivially a consistent prefix.

Why is this useful? Following the baseball analogy from before, if you are recording score writes of A: 1, B: 2, A: 3, B: 6, A: 5, B:7 and you don't have a consistent prefix guarantee, you could be reading a database that only has the third and last writes applied, giving you A: 3 and B: 7 when in reality, that combination of scores never existed.

Monotonic Reads - for each client, guaranteed to read a more recent or at least the same set of updates

Monotonic reads guarantees that you'll never "go back in time" in terms of seeing updates - you'll always see a set of updates that is at least as new as the updates you saw the last time you read from the system. The implementation of this requires a client to keep some kind of version of when it performed its last read, and search for a machine that has at least those writes. However, like Read My Writes, we can apply the same optimization of keeping a read set on the client (and performing a union whenever reading from the system).

Since we're only concerned with reading data that at least as fresh as the last time we read it, we can also introduce a notion of "stickiness" or "affinity", where a client stays with the last machine it read (until it goes down, at least) instead of re-searching for another machine that is more up-to-date. Assuming a single machine will never lose any updates it has applied, staying with the same machine is trivially monotonic.

Bounded Staleness

Last one! Hopefully you find this fascinating (like I do), and not boring. The idea of staleness allows consistency to be defined in terms of how old a replica can be compared to the most up-to-date data. In some applications, it's okay to have the data be a little stale - for example, weather reports can be a couple hours old. Strong consistency is a special case of Bounded Staleness where the amount of staleness is none.

Conclusion

Wow, this blog post took longer than I thought it would to write. I'm really interested in these topics (and databases/distributed systems in general), so I'm pretty excited for what the class has in store for us for the rest of the semester!

Again, I don't claim to be an expert in this field (and most of these notes are from memory), so feel free to point out any mistakes/incorrect assumptions or start a discussion in the comments. Thanks for reading! (: