December 21, 2012

Private methods in Ruby

Consider the following code:

(This blew my mind the other day.)

class Hello
  def public_hello
    self.private_hello
  end

  private
  def private_hello
    puts "Hello!"
  end
end

You would expect that this would work fine: private_hello is a private method, but it's being called from within the class.

Nope.

>> hello = Hello.new
=> #<Hello:0x10d0cc200>
>> hello.public_hello
NoMethodError: private method `private_hello' called for #<Hello:0x10d0cc200>
  from (irb):3:in `public_hello'
  from (irb):13

I spent an embarassingly long time trying to figure out what was wrong ("Do I just not understand how private methods work?!"), and confused one of my coworkers as well in the process of doing so.

However, it turned out to be old news. One post puts the issue pretty succinctly:

private methods can never be called with an explicit receiver, even if the receiver is self

So, the problem with self.private_hello is that the private method is being called on an explicit receiver, even though the receiver is technically the same object—you'd need to call private_hello by itself instead.

Having learned access control modifiers in Java first, I thought this was really bizarre. I guess I need to learn Ruby a little better! (:

October 31, 2012

Discount Code Cards

I love career fairs. In college, I loved going to career fairs to get swag—EECS majors are definitely ridiculously spoiled when it comes to getting free things like shirts and food (I even got a poker set once!).

Anyway, while planning our trip to the UC Berkeley Startup Fair this year, I wanted Stripe to stand out so I thought pretty hard about what my favorite types of things to get were.

Number one was shirts, probably. We already had that covered. But I also remembered happily going to the Dropbox booth every semester to get those nifty free space cards: discount codes that you can apply to your account to get 5-10gbs of space at a time.

What if we made free Stripe processing cards? Since we'd be going to a career fair where all of the students were software engineers, and more specifically, very hacker/startup-minded, it made even more sense as a marketing effort.

Implementation

Most of the discount code implementation was already in place from our invite system, so all I really needed to do was generate an extra couple of hundred codes for use at the career fair. The hard part, however, was trying to figure out how to print them.

Traditional business card printing services won't let you print out cards with unique codes on them unless you go through some kind of custom order. I didn't have too much time to spare, so John told me about this nifty hack he used with Moo.com (I later found the same solution on Quora).

Moo lets you do this cool thing where you can print a variable number of designs per order—for example, if you want to have a different photograph on the back of each business card. We found that if you upload 100 different "designs" for 100 different cards, each card would be unique.

Their Text-o-matic tool lets you create up to 100 "designs" that are text only:

I took one look at the page and realized I could easily use a script to automatically generate all of the cards (sigh of relief when I discovered the page was not written in Flash):

// Create an array of the discount codes you want to use here
var texts = ['CODE1', 'CODE2', 'CODE3'];

// Click buttons on the page for each code to create the design
texts.each(function(text) {
  jQuery('#txtFrontText').val(text);
  jQuery('div#divMakeCard input').click();

  // Set color/size
  jQuery('#a_000000').click();
  jQuery('#aReverseColours').click();
  jQuery('#aZoomOut').click();
  jQuery('#aZoomOut').click();

  jQuery('#btnSaveCard').click();
});

Really, really simple but pretty hacky (just run this in the developer console on the web page). After this, you'll just go into their general template wizard and upload the design for the other side of the card to finish them up.

Improvements

Of course, there are several drawbacks to this approach:

  • You can't really design the "variable" side of the card. You either need to upload the images yourself, or use Text-o-matic, which has a super limited set of options (you can't even add a new line or style separate parts of the text).
  • You can only print 100 cards per order. Since the max number of designs you can upload is 100, if you try to double the card count to 200, you'll end up with every unique card twice.

So what can you do?

Probably the only legitimate way to do this is to generate the to-print PDF yourself, either using some kind of software (like InDesign) or hacking something in PostScript. I'll be sure to write a follow up post if I end up doing either!

September 9, 2012

Goodbye, Wordpress!

As you may or may not have noticed, this blog has recently undergone a pretty drastic change. Aside from pure design upgrades, I've moved off Wordpress (good riddance!) and joined the geek bandwagon by writing my blog with Jekyll. Here's my obligatory "I moved to Jekyll" blog post!

Why Jekyll?

  • Wordpress is bloated. I've never even used a third of the features—I wanted something super simple and lightweight that I could gradually add onto (if I wanted to).
  • Custom styling. Wordpress, I've found, is a nightmare to try to style. There are layers upon layers of templates, and I really hate trying to navigate through PHP code. Jekyll, by contrast, is a lot cleaner and easier to deal with.
  • Markdown/Vim > Wordpress editor. I've gotten to the point where I basically need to use Vim for everything (code, obviously, but also notes, to-do lists, etc). Markdown is great, and I like how all of my blog posts are now under version control.
  • Static pages. Blogs are almost pure static content (with the exception of comments, search, etc), so I figured that it would be simpler (and faster) to just have an entirely static site. This makes deploys super easy as well.

My setup

I'm not going to go through in depth how I got my Jekyll set up, because there are more than enough of those guides out there. Here are some of the main components that I used:

All of the templates and styles are handwritten. Being able to customize all aspects of the blog is definitely one of the major perks of using Jekyll!

March 18, 2012

Elasticsearch Space-savers

After setting up ElasticSearch, you'll be faced with the task of optimizing your index configuration for speed and for size. There are millions of documents in our index, and, for performance, it’s important that all of that be kept in memory. As a result, index size is pretty important. We spent a while tweaking ElasticSearch to minimize its index size, and it turns out that there’s a decent number of low-hanging fruit.

_source

For example, ElasticSearch stores the original data along with the indexed data for each document and returns the full document with each search result. We just wanted ElasticSearch queries to assemble a set of document IDs to be fetched from the database, so this overhead is unnecessary. Cutting this data shrank our search index four-fold.

:_source => { :enabled => false }

_all

In addition, ElasticSearch stores an aggregated _all field on each document, which contains the analyzed output from all of the other fields in the document. This doesn’t add any new information, and its purpose is just to simplify the query interface.

We don’t need to be able to query all fields (for example, we only use user IDs to partition the index); setting a flag to exclude these from the _all field and preventing them from being analyzed saved us another 4-5GB for ~2 million documents.

'id' => { :type => 'string', :include_in_all => false }

Multi-field types

Other problems were a little trickier. How do we minimize the size of the email address index given that we’d like to be able to perform both prefix and substring searches on them?

Email addresses may be very long and overflow our ngram tokenizer (which maxes out at fifteen characters), and so we decided to construct a multi-field tiered index for certain fields in to accommodate all of the search use cases. We don't know if this is a best practice, but it seems to work pretty well for us.

March 15, 2012

Real-time Search with MongoDB and Elasticsearch

Something I worked on a couple of weeks ago at Stripe was overhauling the entire search infrastructure. If you’ve ever used the search feature in manage, it may have appeared sluggish or may have even timed out on you - this is mostly due to the fact that our search consisted of running a query over our database (which you can imagine is very slow for full-text search).

We finally decided to make the switch to a dedicated full-text search engine and chose ElasticSearch, which is a distributed RESTful search engine built on top of Apache Lucene. I won’t go into the details, but you can read about ElasticSearch on the official website.

Setting up ElasticSearch

To set up ElasticSearch, I followed a third-party tutorial - it’s pretty straight forward, although you should be sure to swap in the current verison number.

Hooking up MongoDB

Assuming you’re already using MongoDB for the data that you want to index1, you’re probably wondering how to hook it up to index documents in real-time - this is one of the major hurdles that we faced. ElasticSearch has a built in feature of Rivers, which are essentially plugins for specific services to constantly stream in new updates for indexing.

Unfortunately, there’s no MongoDB River (probably due to the lack of built-in database triggers), so I did some research and realized that I could use the MongoDB oplog to continually capture updates to our main databases.

The oplog is a special collection held by MongoDB for replication purposes - I was able to make use of a tailable cursor (which performs similarly to the tail -f command) for shipping new updates from MongoDB to our search server (we’re using Tire as our Elasticsearch Ruby DSL).

def initialize
  @host = ENV['MONGO_RUBY_DRIVER_HOST'] || 'localhost'
  @port = ENV['MONGO_RUBY_DRIVER_PORT'] || Connection::DEFAULT_PORT

  @db_whitelist = %w(payments customers invoices)

  @stream = Queue.new
  @max_bulk = 10000

  # Or, just set up a connection the normal way if you’re not
  # using a repl set
  @connection = ReplSetConnection.new(['database1', 27017],
                                      ['database2', 27017],
                                      ['database3', 27017],
                                      :read => :secondary)

  local = @connection.db('local')

  # Change to oplog.$main if running mongod as a master instance
  oplog = local.collection('oplog.rs')

  @tail = Cursor.new(oplog,
                     :tailable => true,
                     :order => [['$natural', 1]])

  while true
    tailer
    indexer

    # In production, we keep this constantly running in a busy
    # loop; we constantly have new things to index
    # sleep 5
  end
end

The code here basically tails the oplog and reads off new operations, pushing them onto @stream, until there are none left or we’ve reached the max number of entries that we want to batch at a time:

def tailer
  # While there are new entries
  while @tail.has_next?
    _next = @tail.next

    # We only want to index specific collections
    match = /some namespace regex/.match(_next["ns"])
    if match && @db_whitelist.include?(match[1])
      @stream << [match[1], _next]
    end

    # Upon hitting the maximum bulk size we want to send over
    break if @stream.length >= @max_bulk
  end
end

Afterwards, we pull rows out of @stream, transform them into hashes that we actually want to index into Elasticsearch, and bulk index them via Tire. Rinse and repeat.

def indexer
  bulk_index = []
  while !@stream.empty?
    _next = @stream.pop
    item = _next[1]
    type = _next[0]

    # Our TransformRow.transform_for method just returns a hash
    # to index. Ex. {:type => “customer”, :name => “Amber”}
    new_row = TransformRow.transform_for(type, item['o'])

    # Add keywords if they exist on the database object
    bulk_index << new_row
  end

  # Bulk index the data via Tire
  if !bulk_index.empty?
    Tire.index "search" do
      import bulk_index
    end
  end
end

Of course, this is a very basic synchronous example - you could imagine making this into a more flexible asynchronous producer-consumer model.

In addition, this sort of naive first pass doesn’t save any state, so if the driver crashes due to an timeout on the database side or if the search server is down, the restarted driver will start again from the very beginning. To combat this, I simply wrote a timestamp to /tmp to keep track of the last record indexed, and queried the collection to start at that specific timestamp the next time around.

(Update: I recently changed this to write to permanent storage instead; although not necessary, it's simple enough to do and is a better idea in general.)

@tail = oplog.find({
  'ts' => {'$gte'=> BSON::Timestamp.new(@last_ts, 0) }
})
@tail.add_option(Constants::OP_QUERY_TAILABLE)

But wait, there’s more! Scanning the entire collection in most restart scenarios can be pretty expensive (by the way, the oplog doesn’t allow any indexes), especially if your oplog is large like ours, at 2GB.

You can take advantage of a special flag for this kind of use only, called OPLOG_REPLAY. This flag is meant for replication purposes - it first starts from the bottom of the log and searches upward for around 100MB or so, assuming that you just restarted (which is usually the case if the tailer crashes and restarts immediately). If it can’t be found, it’ll continue by traversing extents in the most efficient manner possible.

@tail.add_option(Constants::OP_QUERY_OPLOG_REPLAY)

Running this tailer as a daemontools service, we were able to attain nearly real-time search latency. It takes around 3-5 seconds from the time a record is created or updated to when it gets indexed (and is searchable) in ElasticSearch - pretty cool!

Footnotes

  1. It’s possible to use ElasticSearch as a data store itself (it’ll store the document data as well as the index data), but I’m not convinced this is a good idea in practice.