Reflections On My OPW Internship

My last patch to Parsoid got merged a couple of weeks ago, so now seems like a good time to write a final post to summarize what I learned. I picked up plenty of technical skills during my internship, but more importantly, I learned how to get up to speed on a large codebase and how to work with a team.

Tips for New Open Source Contributors

  • Don’t try to understand the whole thing. Just learn enough to make the contribution that you want to make, at least at first. You will get to know the codebase organically as you continue to make contributions.
  • Keep your patchsets small. I spent the majority of my internship working on the same patch; it took 7 weeks and 31 patchsets before it got merged. This was mostly because the patch spanned 23 different files. In general, I believe that new contributors should either stay away from projects that affect a huge number of files, or split up their work so that each patchset covers just 2-3 files at a time.
  • Having a mentor really helps. Even if you are operating on your own and not through a program like OPW, it’s good to establish a relationship with someone who can help guide you through the codebase. I spent about six hours a week asking my mentor all kinds of questions.
  • Peripheral participation can be productive. My team often got into long IRC debates about my project, most of which I didn’t understand. But I would save the IRC transcript and ask my mentor to translate it for me later. My coworkers were able to converse naturally without feeling held back, while I benefited from understanding all of their different viewpoints.
  • Figure out who the decision-makers are. Whose approval will you need before your patch gets merged in? On the Parsoid team, for example, it took me some time to realize that the team lead was the final authority, and that no patch would be merged without his approval. Once I knew that, I tried harder to seek his opinion ahead of time, so that I wouldn’t end up submitting a patch and be forced to revise it later.
  • You are ultimately responsible for your code. You know your code better than anyone else will, especially as you continue to make contributions. Don’t expect other contributors to have the same level of understanding, or to always spot your mistakes, even if they have more experience than you do.
  • Use IRC. In the beginning, I thought it would be weird if I had to correspond on IRC with people I’d never met in real life, and I asked if I could have Google Hangouts with my mentors once a week. After the first Hangout, I realized that everyone just felt more comfortable on IRC.

What I Learned (Technically)

From a technical perspective, I mainly learned better ways of structuring my code that are probably applicable to any language / framework. I learned the most from code review and from IRC discussions about how to implement my project. Examples of things people told me /taught me about were:

  • To have explicit returns on all paths of if / else statements, for code clarity
  • To avoid “side effects” where a function does more than what it says
  • To build something once in the constructor, instead of building it over and over again (to speed up performance)

I learned some neat Javascript tricks, like async.parallel and Object.assign, and improved my Javascript coding style (the good old === vs ==, for example). I also learned a little bit about Node and some of the Node packages that Parsoid relies on (in my case, mainly es6-shim and async).

I also became more familiar with version control and code review systems. I learned how to use Gerrit, picked up a few new git commands (like rebase -i), and learned how to write well-structured and informative commit messages.

Finally, by working on Parsoid I began to develop an understanding of how parsers work; how they break up text into indivisible chunks called tokens, transform the tokens, and then reassemble them on the other side using a tree structure.

Javascript Tricks: async.parallel

The async Node module contains tools for organizing asynchronous code to make it easy to understand the underlying control flow.

async.parallel(tasks, [callback])

async.parallel is used for running several tasks “in parallel”, and then executing a callback function once all the tasks are finished. It does not truly run the tasks in parallel because Node runs on a single event loop. Instead, it executes the tasks one after the other, in the order given. This is ideal when none of these tasks depend on one another. Its arguments are an array of tasks and an (optional) callback to be performed once the tasks have finished. The final callback is passed a set of arguments that consist of the return values of all of the tasks.

In the Parsoid logger, I use async.parallel to log data to all of the applicable backends. For instance, I might have two backends: one that prints an error message to the console, and another that sends an HTTP response containing the error message. After I’ve logged my error message to the console and sent an HTTP response, I run a callback that shuts down my process if my logType is fatal.

// After getting all the relevant backends with _getApplicableBackends
// I run the backends one after the other using async.parallel.
// Finally, I shut everything down if the logType is fatal.
Logger.prototype._routeToBackends = function(logData, cb) {
    var applicableBackends = this._getApplicableBackends(logData);
    async.parallel(applicableBackends, function(){
      if (/^fatal$/.test(logData.logType)) {
        process.exit(1);
      }
    cb();
  });
};

You certainly don’t need async.parallel to write JavaScript code with this kind of control flow. My first solution looked like the example below: I executed the backends one after the other, using a callback to increment a counter every time a backend function completed. Once the counter’s value was equal to the number of completed backends, I checked to see if the logType was fatal and then shut down the process if necessary.

I think that using async.parallel makes for cleaner and more legible code, though.

// Without `async.parallel`. I keep track of the number of backends
// completed using `numFinished`. When `numFinished` is equal
// to the number of backends, I check to see whether the logType
// is fatal (this is equivalent to running the final callback in
// async.parallel.
var numApplicableBackends = applicableBackends.length;
var numFinished = 0;
var fatalCallback = function () {
  if (numFinished === numApplicableBackends && /^(fatal)(\/|$)/.test(logType)) {
  process.exit(1);
  }
};

applicableBackends.forEach(function(backend) {
  backend(logData, fatalCallback);
 }

Javascript Tricks: Maps

Maps and Sets are part of the ECMAScript 6 proposal (Harmony). While they haven’t officially been implemented yet, you can start experimenting with them by using Paul Miller’s es6-shim module. Maps are much like Javascript objects in that they are collections of key-value pairs, but they have a few features that may make them more useful than regular Objects. The Mozilla guide lists the following advantages of Maps over regular Objects:

  • An Object has a prototype, so there are default keys in the map. However, this can be bypassed using map = Object.create(null).
  • The keys of an Object are Strings, where they can be any value for a Map.
  • You can get the size of a Map easily while you have to manually keep track of size for an Object.

In Parsoid, I used a Map to map logTypes (“error”, “warning”, or /error|warning/) to Arrays of logging backends (functions that would print to a console, write to a file, send an HTTP response, etc.)

Getting Started

Maps are easy to work with. Open up a terminal and try the following (but make sure to require es6-shim or an equivalent module first):

  • Add new key-value pairs with Map.set(key, value).
  • Retrieve a value for a given key using Map.get(key).
  • Determine whether a Map contains a given key with Map.has(key).
  • Get the size of a Map with Map.size.
  • Delete a key from a Map with Map.delete(key).
  • Clear all keys from a Map using Map.clear.

Beware of Non-Identical Keys

The keys to a Map can be any type of object, which seems like an improvement over using regular strings. Unfortunately, when using Map.get to retrieve a value for a particular key, you must pass in a key with the very same object ID as the original. In Parsoid, I initially used regular expressions as logTypes (keys) corresponding to Arrays of backends (values). This made it impossible to retrieve the backends later, since regular expressions all have unique object IDs, even when created with identical source strings.

var backendArray = [logToFirstBackend];
this._backends.set(/error|warning/, backendArray);
this._backends.get(/error|warning/); //undefined

Instead of storing regular expressions as keys, I had to obtain their source strings and save those instead. Unless you are able to store references to all the objects you are using as keys, you’ll have to do the same. From this perspective, Maps don’t have much of an advantage over regular objects.

if (logType instanceof RegExp) {
  logTypeString = logType.source;
} else {
  logTypeString = "/^" + logType + "$/";
}

this._backends.set(logTypeString, backendArray);

Iterating Over Maps

On the other hand, the convenient iteration method forEach is a good reason to use Maps. Like the forEach method for Arrays; forEach allows you to apply a callback function to every key/value pair in a Map. The arguments to the callback are the current value, the current key, and the Map itself. When you use a regular Object as a Map, that Object inherits default properties from Object.prototype that you’ll want to ignore by using the hasOwnProperty boolean; alternatively, as suggested in the Mozilla guide, you can create an Object with a prototype of null. Using a Map saves you the headache of fiddling with Objects, because the only key / value pairs in a Map are those that you deliberately set yourself.

In Parsoid, I use forEach when figuring out which backends to log a message to. If the current logType matches any keys (saved logTypes) in my Map savedBackends, then I take the relevant backend functions from the matching values (arrays of backend functions) and push them onto an applicableBackends array. For example, if my logType is “error” and savedBackends contains the keys "error" and "error|fatal", and "warning", then the backends are elements of the Arrays returned by savedBackends.get("error") and savedBackends.get("error|fatal").

// Iterate over all of the saved backends.
savedBackends.forEach(function(backendArray, logTypeString) {
  // Convert the saved string back into a regular expression
  // and test the passed-in logType.
  if (new RegExp(logTypeString).test(logType)) {
    backendArray.forEach( function(backend) { 
       // Push each backend from the matching backendArray
       // onto my list of backends.
    });
  }
});

Javascript Tricks: Object.assign

Object.assign() is a new ECMAScript 6 function that can be used to merge together two objects. (If you want to try ES6, I suggest checking out the es6-shim module!)

Basic Use Case: Merging Together Two Objects

Object.assign(target, source) copies over all of the own properties of source into target. In the example below, we create an object duckEgg with a prototype egg and assign its properties to the object omelet. Thus, third now has has property fat from duckEgg, in addition to its initial own property carbs. However, it doesn’t have access to property protein, which is defined only on duckEgg's prototype.

// Merging two objects together with Object.assign.     
> require('es6-shim');
{}
> var egg = {"protein": 4};
> var duckEgg = Object.create(egg);
> duckEgg.fat = 3;
3
> var omelet = {"carbs": 3};
> Object.assign(omelet, duckEgg);
{ carbs: 3, fat: 3 }
> omelet.protein
undefined

In Parsoid: Combining Fields from Multiple Objects

You can also use a combination of Object.assign and Array.prototype.reduce to merge together multiple objects. In my Parsoid logger, I can use this approach to combine logged objects with different custom fields into a single object. So far, I’ve mainly used Errors and strings for logging data rather than objects with specific fields, but you can imagine using different objects for different types of information and merging them together at the end.

// Calling env.log.
env.log("error", obj1, obj2, obj3);

// Within the log() function; combining logged objects into one
// loggedObjects is the Array [obj1, obj2, obj3].
loggedObjects = loggedObjects.reduce(function(prev, object) {
  return Object.assign(prev, object);
}, {});

A Quick Caveat

If both the target and source objects have a property with the same name, Object.assign overwrites the target object’s property with that of the source object’s. You can easily lose information if you don’t ensure that the two objects don’t have overlapping properties. In the example below, the nutrition and taste variables both share the property calories. When the two are merged together, the resulting object only has the calories property from taste.

> var nutrition = {"calories": 5};
> var taste = {"savory": true, "calories": 100};
> Object.assign(nutrition, taste);
{ calories: 100, savory: true }
> nutrition
{ calories: 100, savory: true }

Javascript Tricks: Array.prototype.slice.call(arguments)

In this round of posts, I’ll blog a little bit about Javascript trick I picked up while working on Parsoid, using some of my logger code to illustrate.

1. Array.prototype.slice.call(arguments)

Ideal for manipulating an arbitrary number of arguments that have been passed into a function. This code copies all or some of the arguments into an array, which can then be handed off to nested functions. For example, we might want to pass an object with an arbitrary number of properties to my logging / tracing function in order to describe an error or to provide tracing information. The logger then hands the object off to a data-processing function that constructs logging messages based on the object’s properties.

// A few sample use cases of the logging function. 
// We pass all but the first argument (the logType) to a nested data-processing function.
env.log("trace/request", "completed parsing of", prefix, ":", target, "in",
             env.performance.duration, "ms");
env.log("error", new Error());
env.log("error", token);

The first argument to env.log is the logType (the type of log output that we’re generating), while the remaining arguments are data that’s used to construct a log message. The arguments can be anything from an error to an object to a bunch of strings. In my implementation of log, logType is the only named parameter. I want to separate the remaining arguments from logType, funneling them into a logObject variable.

// How Array.prototype.slice.call is used in the logger
Logger.prototype.log = function (logType) {
  var self = this;
  var logObject = Array.prototype.slice.call(arguments, 1);
  var logData = new LD(this.env, logType, logObject);

arguments is a magical Javascript keyword that lets us access all the arguments passed to a function. So if I call env.log("error", token); then arguments[0] is "error", while arguments[1] is token. It seems like an Array because you can index into it, but it isn’t; though it has length and can be indexed into, it lacks Array methods like pop, shift, and slice. If arguments were an array, I could just set logObject to arguments.shift(1). But it isn’t, so that’s where Array.prototype.slice.call comes to the rescue.

\\ Copies an Array-like object into a new Array.
\\ Beginning and ending indices are optional.
newObject = Array.prototype.slice.call(oldObject, [beginningIndex, [endingIndex]]);

slice takes an Array and returns a new Array containing all or a subset of an existing Array. Its arguments are the beginning and ending indices of the copy. Even though slice is a method that’s only defined on Arrays, call allows us to use slice on Array-like objects. call redefines the this value in slice from an Array to the Arguments object. The first argument to call is the new this value. The remaining arguments to call are passed in as the regular arguments to slice. So you can use Array.prototype.slice on an Array-like object to get back a copy, starting (or ending) at specific indices.

In this case, we’re copying everything from arguments, except for the first argument, and putting it into an array named logObject. Although arguments isn’t an Array, slice can still handle it because it has the properties that slice is looking for (such as length and numeric indices).

The Never-Ending Patch

I’ve spent the past seven weeks on the same error logging patch. Being stuck on a patch is a new sort of purgatory; I’ll spend several days working on the next patchset, only to be sent back to the beginning when my team members discover a new error, ask for a new feature, or suggest different implementations.

Scoping is probably the biggest reason why the patch has dragged on for so long. The patch replaces every error and warning log in Parsoid with my logging function, which means that it’s used in a large number of files (23 at last count). In the beginning, this made for very slow going, since I wanted to test every call site to make sure that I was referencing the logging function properly and that it generated the desired output. Besides this, the potential for error increases along with the number of lines of code. As time goes on, my patchset gets larger and harder to review, and it’s easy for me and my reviewers to overlook important details.

Another reason is that I’m not very familiar with some of the underlying technologies. Not only is Parsoid a somewhat complicated project, but it relies on frameworks that I’m not very familiar with: Node.js (sever-side Javascript), Connect (a middleware framework for Node), and Express (a web development framework for Node). Whenever we run into a framework-related issue on Parsoid, I spend a day reading about the framework instead of writing code. I like to take the time to completely understand the problems with the current patchset before making any changes…which often results in too much rabbit-holing, and not enough coding.

A good example of this was an infinite error-logging bug that the team discovered on February 11th. It crashed the Parsoid servers by filling up the disk with identical error logs. The Parsoid web server uses Connect, which comes with its own default error handler. The web server also had its own error handler, which set HTTP headers and send an HTTP response with information about an error. If we called the custom error handler but set HTTP headers again afterwards, we ended up with a “Can’t set headers after they are sent” error that would go to Connect’s default error handler. The default error handler would try to set headers again, resulting in another “can’t set headers error”, sending Parsoid into an infinite error recursion tailspin.

It took me a couple of days of reading about Connect and talking to my mentor to even understand what had caused the error recursion in the main branch of Parsoid, and another several days to process my mentor’s suggestions for how to structure my logging function to avoid error recursion. Ten days passed before I felt confident enough about the restructured code to submit my next patch.

I’m now on my 22nd iteration of the patch and feeling (delusionally?) hopeful that the next patchset will be the last. If I were to do it all over again, I’d have kept my patch smaller and more tightly scoped; since it’s too late for that (we’re down to revising the same two files each time), I’ve devised a coping strategy to speed up the feedback cycle. I’ve been sending my mentor gists for specific files, instead of waiting for his input until I’ve submitted a patchset. I’d be curious to hear whether other people have suggestions for dealing with never-ending patches.

Coding by Consensus

My latest contribution to Parsoid was a generic logging and tracing function. It took me four weeks, twelve patchsets, and three different approaches before the patch was merged in.

Initially, I wrote a single function and put it in our Util module. Next, per my mentor’s suggestion, I expanded the function into a Logger class that could be customized with a different configuration for every class and file using it. The Logger class included a #log function as well as wrapper functions (#trace, #dump) that called the basic #log function with certain parameters. Our team lead disagreed with the Logger implementation, though, saying that it was too complicated to have separate loggers in each file. Based on his feedback, I moved the logging function to an “environment” object that’s accessible throughout most of the codebase. I also got rid of the wrapper functions, moving everything into a single logging function that prints different output depending on a logType parameter.

In the process of revising my patch, I learned a lot about Wikimedia’s culture. Whether it’s formal or informal, my team essentially operates by consensus. We can spend hours in friendly debates over questions of style and implementation (like the best way to write a logging function). And there’s always room for further discussion, even after the code’s already been merged. Because of the need for consensus, it takes longer to produce a final version of a patch.

I’ve learned a lot because of the consensus-based approach. Now I’ve implemented the logger three different ways (and understand the associated pros and cons), as opposed to having written it once and being done with it. I picked up some new concepts from my team’s debates on implementation, such as the difference between subclasses and subtypes. And I got used to the process of revising my code to accommodate feedback from many different perspectives.

I’m curious about how code review works in other teams. An inclusive, consensus-based approach has been helpful for my learning, but perhaps it would seem inefficient to some organizations.

Meeting My Mentor

Two weeks ago, I used my OPW internship travel stipend to visit my mentor Subbu Sastry in Minnesota. He spent two days helping me on my latest patch, explaining Parsoid and Wikimedia, and feeding me delicious South Indian food.

While visiting Subbu wasn’t strictly necessary (he’s extremely responsive on IRC, code reviews, and email), it was still very helpful to see him in person. Here are a few of the ways in which I benefited from the visit:

  • Understanding historical context. Documentation and wikis can give you a good sense of the current state of a project, but not its past or its future. Subbu helped me understand how Parsoid evolved out of MediaWiki’s original PHP parser, how it interacts with the Visual Editor project, and what the goals for Parsoid are going forward. (Some of this is also covered in a fairly lengthy and slightly outdated blog post.)
  • Visual learning. Parsoid’s process for converting wikitext tokens into an HTML DOM tree confused me until Subbu drew me a diagram showing the pipeline of transformations. I have a much better mental model of Parsoid as a result. You can’t readily send drawings back and forth over IRC or explain them very well over email; it’s really best when someone draws a diagram in real time and narrates as they go along.
  • Accidental learning. I learned a lot about Wikimedia’s internal tools and infrastructure just by looking over Subbu’s shoulder. For example, he showed me Zuul, a tool for running tests and other jobs on patches submitted through Wikimedia’s code review system Gerrit.

Most of all, meeting in person gave Subbu a good sense of who I am as a person and as a programmer. Even though we’re back to interacting on IRC, he can more readily detect when I’m making progress, or when I’m desperately confused and need to chat.

To other OPW interns: I definitely recommend seeing your mentor in person, assuming that the $500 travel stipend is sufficient to cover it. (I wish this stipend were higher for people living in other countries!)

Hacker School Month 3 Retrospective

Hacker School ended three weeks ago, a fact that I find both poignant and inescapable. In some ways, I feel like I didn’t make the most of it; I didn’t “finish” a major project while I was there. On the other hand, I didn’t go to Hacker School to learn more about web development. I went because I wanted to learn new languages and paradigms, to explore computer science topics like algorithms and data structures, and to collaborate with curious and talented programmers. From that perspective, I think that I spent my time well.

Throughout Hacker School, I made significant progress on my e-flirting web app, Datebot, but never completed and deployed it. I also worked through the first 1.5 chapters of SICP, learning about functional programming in the process. In addition, I paired extensively with other Rubyists on their projects, went to lots of seminars by Hacker School residents and facilitators, and engaged in lots of accidental learning. Finally, I started interning on an open source project, which is something I wouldn’t have dreamed of doing before Hacker School.

In the last month of Hacker School, I continued working on my projects, but also made time for the fun, sparkly, enlightening things that make Hacker School so wonderful and distracting. Here’s a roundup:

  • Datebot reorganization: I revised my database schema, added tests and validations, wrote some helpful Rake tasks, and began converting overly-powerful helper methods into modules. Now that I’m nearly done refactoring, the final step will be to finish the Google Calendar integration and actually schedule dates with crushes on behalf of the users.
  • Botastic: I paired with Will Chapin on this clever Zulip chatbot, which responds to messages with fun semi-relevant facts from Wikipedia. We refactored his code into short, three-line methods and experimented with functional programming techniques like pipelines. We also rewrote Botastic so that it could respond to any type of sentence, instead of only sentences in a specific format, by using a part-of-speech tagger.
  • Parsoid: I began interning on December 10, two weeks before Hacker School ended, so I had even less time for my Hacker School projects. On the other hand, working on Parsoid at Hacker School meant that I could get help from facilitators (especially maryrosecook, who’s a Javascript wizard), learn about Wikimedia’s organizational structure and developer tools from Sumana, and collaborate with Be Birchall, who’s both a Hacker School alumna and a fellow intern on Parsoid.
  • Markov Fun: I attended an amazing seminar by the lovely Alex Rudnick on using n-grams to generate sentences given a specific corpus. The demo code he used was all in Python, so I ported his code to a Ruby gem.
  • Functional programming techniques in Python and Ruby: maryrosecook gave a great practical introduction to functional programming using Python. I followed along in Ruby, and was surprised at how many functional techniques I take for granted (e.g., map, reduce, filter).
  • r0ml Lefkowitz’s talk on APL: Not only is APL a fascinating language (everything is a matrix! no for loops!), but it was great to hear about what programming was like Back In The Day (drum memory! teletypewriters!)
  • Korhal: I was deeply intrigued by Travis Thieman's Clojure-based Starcraft AI. I didn't know anything about Clojure or Starcraft, so I didn't feel qualified to contribute to it, but it was still thrilling to see him explain how to implement a zerg rush.

Making Ruby Gems with Bundler

Last week, I made my first Ruby gem, markovfun, which generates sentences using a technique that Alex Rudnick taught us at Hacker School. (There’s a Python version here for those who are interested.) Making a gem is very easy, especially when you’re using Bundler. The entire process takes about half an hour from start to finish, which is longer than it took me to write this blog post. I recommend it! You can use your gem locally, just as a way to keep your code uncluttered, or you can share it with the world by pushing it to Rubygems. As of this writing, my gem’s been downloaded 304 times, which means (hopefully) that I’ve helped hundreds of people have fun with Markov chains.

Read on to learn how to make and use a Ruby gem! If this takes you longer than 30 minutes, I’d like to hear about it.

Building Your Gem

  1. Run bundle gem my_gem from the command line. This will create a folder my_gem that contains a Gemfile, gemspec, Rakefile, and the lib folder. Inside the lib folder, you’ll see a file called my_gem.rb, which you’ll update with your gem’s methods. There’s also the folder lib/my_gem, which contains version.rb, a file that specifies the gem’s current version.

  2. Update my_gem.gemspec. Bundler has already pre-filled out this file with your name and email address (by integrating with Git, I imagine). Provide a gem description and summary; the gem won’t build until you do this. In addition, specify any gem dependencies at the bottom of the file with spec.add_development_dependency.

    lib = File.expand_path('../lib', __FILE__)
    $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
    require 'my_gem/version'
    require 'pry'
    
    Gem::Specification.new do |spec|
      spec.name          = "my_gem"
      spec.version       = MyGem::VERSION
      spec.authors       = ["Maria Pacana"]
      spec.email         = ["maria.pacana-rubygems@gmail.com"]
      spec.description   = %q{Best gem ever}
      spec.summary       = %q{Highly recommended}
      spec.homepage      = ""
      spec.license       = "MIT"
    
      spec.files         = `git ls-files`.split($/)
    
      spec.add_development_dependency "bundler", "~> 1.3"
      spec.add_development_dependency "rake"
      spec.add_development_dependency "pry"
    end
    
  3. Run bundle to install any gems that your gem relies on.

  4. Update my_gem.rb with the code that you want to share.

    If everything fits in one file, go to lib/my_gem.rb, where Bundler’s automatically created the module MyGem. Add your methods to this module.

    require "my_gem/version"
    
    module MyGem
      def self.happy_new_year
        puts "Happy New Year!"
      end
    end
    

    If your code involves classes or modules that are spread out across multiple files, you can place them in lib/my_gem/ and then require them from lib/my_gem.rb.

  5. git commit, if you haven’t done so already. This is necessary because Bundler uses git ls-files to figure out what files are being used in your gem. (git ls-files lists the files in Git’s index, or staging area.)

  6. Build your gem using gem build my_gem.gemspec.

Using Your Gem

You can either push your gem up to RubyGems or continue to test it out locally.

  1. Pushing to RubyGems

    Create a RubyGems account, if you haven’t done so already. Next, push your gem up to Rubygems with gem push my_gem-0.0.1.gem.

  2. Using your gem locally.

    Use rake install to make your gem available throughout your system. You can also use bundle exec pry if you want to test your gem out in a REPL.