-iom-

Split (Ruby MRI) Part 1.

posted: 21 Nov 2015

As I alluded to in my last post I wanted to do an academic activity that would strengthen my understanding of multiple languages. It may be a bit ambitious to say that I will be able to apply this same exercise to every flavor that is out there, but I will try my best to cover languages that range on the opposite spectrums of the art and to write about my experience and any take aways from the exercise.

Why do this?

Because I like to tinker and level up. I also don’t feel comfortable just saying that something works without knowing the why it works or doesn’t work.

I’ve decided to start with Ruby 2.2.3 (MRI) because it’s the language that I have the most experience with.

This exercise is of course for illumination and understanding. So if someone in the community notices any incorrect assumption or assertion by all means guide me back to the straight and narrow.

Exercise

Duplicate Linux’s existing split method. Pretty much a command that splits a file into multiple files based on certain criteria. For the sake of this exercise I’m going to focus on splitting files by line count. And for a little extra credit if I can beat the performance of the original function that would be great.

Enter csvspltr

https://github.com/dsalazar32/csvspltr/blob/master/ruby/csvspltr

I was able to successfully reproduce the behavior of linux’s split command. But was not too pleased with the outcome when I did a comparison in performance against my implementation as you will notice below.

The following is the result of splitting a large csv (Sample Datasets from STAR Experiment).

Ruby

Ruby with 1 thread

Ubuntu native split

What you will immediately notice is ruby being completely smoked by the OS native implementation. What’s even more telling is that threading did absolutely nothing in helping with the performance of the split operation. One could even say that it performed even more poorly then just running off of the main thread. Why did this attempt at parallelism fail?

Oh hai GIL!

So Ruby MRI comes with a catch. The Global Interpreter Lock or (GIL). It’s the mechanism that adds thread safety to MRI.

1So if you have 8 threads busily working on a 8-core machine, only one thread and one core will be busy at any given time. The GIL exists to protect Ruby internals from race conditions that could corrupt data. There are caveats and optimizations, but this is the gist.

The quoted snippet is from an article appropriately named “Nobody Understands the GIL”, by Jesse Stormier. It’s a good read and I suggest that anyone trying to understand the why read it. But in a nut shell the GIL removes parallelism. I have added some ruby profiling flags to investigate this interrupt and my findings are telling. If you look at split-prof.txt you notice that there is no wait time at any point of the execution, but if you look at split-prof-threaded.txt you will see interrupts, manifesting themselves in wait times peppered in the lifetime of the threads. Conclusion the forced nondeterministic mutex is expensive.

Well thanks for stopping by. If you want to give the piece of code this article is profiling a once over and even squeeze out some more performance improvements go ahead and fork csvspltr.

Suggested reads:

And for my next trick I will try this exercise with the following language…

Golang. Bye bye GIL.