Steve Eichert

Hacker, Entreprenuer, Father

Don’t Be the ‘Idea Guy’

Over the years I’ve run into “Idea Guys”. You know the type, they’ve come up with a brilliant idea that nobody in the world but them could have come up with. They’ve done all the hard work, now its just a matter of pounding out the code to make it a reality and become rich.

Typically “Idea Guys” are so mesmerized by their brilliance they’re not able to see that its all about execution. Many of the great companies of today weren’t the one’s that had a brilliant idea that noone else had thought of, they were the folks that recognized a good idea, and executed like crazy to turn it into a success.

Please don’t be the “Idea Guy”. And if you’re looking for a business partner or co-founder make sure they’re not one either. Instead look for someone who has great ideas…or good enough ideas, but prides themselves on executing against their ideas until they find their way to success.

Consumption Kills

Earlier this week I talked about my need to create. I’ve found that consumption is my biggest barrier to creating.

In today’s world we have endless distractions. Twitter, Facebook, Email, RSS, IM, HackerNews, and on and on. I could fill my entire day by opening up a browser tab to each of the above and refreshing…over…and over…and over. I’d find lots of interesting articles, videos, and updates from smart people. All of which would distract me from creating.

Over the last year I’ve noticed that my level of consumption has increased. I’d like to blame Twitter, or perhaps the people I follow for creating so much interesting content, but there’s really no one to blame but myself. I’ve let myself lose discipline and focus at the times when its most important. While waiting for a console to start up, a script to run, or a page to load I’m drawn to my email, twitter, facebook, and countless other sites that do nothing but distract from what I should be focused on. All the sudden, rather then thinking about the problem I’m working on I’m thinking about some article, or email, or other unimportant thing that I’ve come across.

In my prior life I worked in an environment that helped prevent distractions during times of focus through pair programming. Having someone sitting next to you is a solid way to prevent distractions. The key to preventing distraction is discipline. A pair forces that discipline through peer pressure. However, discipline doesn’t need to come via a peer, leading a life of discipline can be a choice you make for yourself.

As I reflect on the past year I clearly see that I’ve be undisciplined, unfocused, and as a result haven’t produced or created to the degree that I’ve come to expect from myself.

My re-newed commitment to creating starts with a re-newed commitment to discipline.

Creating

Over the course of the last 3 years I’ve learned quite a bit about myself. Three years ago I left the comfy confines of a corporate job to go out on my own. I was lucky in that I teamed up with my pops who had already built a client base for his consulting business that would be immediate clients for consulting work of my own, and as the theory went, ideal candidates for several of the software products we were looking to create. A post about the ups and downs, all the learnings, and madness that has ensued is for another day. Today, I’m focused on what I’ve found to be the most important aspect to my overall happiness, creating.

With any new company there are huge resource constraints that you must work within. I’ve felt this in a big way. I’ve had a ton of ideas for things I’ve wanted to build, but all too often I’ve had to restrain myself from diving in so that we could continue to make enough money to sustain the business. The tradeoffs for going big with your ideas vs. continuing to muck your way through the work that pays the bills is a difficult one. As a developer, I like to create. I particularly like to create new things that I’m interested in and that I think will be loved by those its aimed to serve. Too often I’ve found myself putting off creating. I have lots of excuses.

  • We need to pay the bills
  • A client is waiting for that all important deliverable
  • blah
  • blah, blah, blah

The all suck.

Over the last year, I’ve felt the ill effects of not creating. My motivation evaporated. I didn’t want to go to work. I stopped thinking of new ideas. Why bother when they weren’t being made real.

No matter what I do, I’m still going to have things I need to do that will prevent me from creating. The good news for me, is I have the ability to put aside all of those things and go and create. I’m starting small, trying to pick a few features that I’ve wanted to add to one of our software products and getting them implemented this week. My goal is to continue that tradition every day by finding at least 1 small thing to add, improve, or tweak for the better.

I’m getting back into the habit of creating. This blog post is proof.

Using Hadoop Streaming for XML Processing

In a few previous posts I talked about a project that we’re working on that involves analyzing a lot of XML documents from pubmed. We’re currently not using Hadoop to parse the raw XML, however, due to the large number of documents in pubmed and the time it takes to do the parsing we’ve been discussing options that would allow us to scale up the processing to happen on multiple machines. Since we’re already using Hadoop for analysis I decided to poke around a bit to see if we could figure out a way to use Hadoop for the parsing of the 617+ XML documents.

After some digging I came across this page on the Hadoop Streaming page that said the following: “You can use the record reader StreamXmlRecordReader to process XML documents….Anything found between BEGIN_STRING and END_STRING would be treated as one record for map tasks.”

After a few tries I wasn’t having much success, so continued to look for alternate options. I came across Paul Ingles post on Processing XML with Hadoop which pointed me to the XmlInputFormat class in Mahout. I believe in order to use the XlInputFormat class from Mahout I either need to recompile Hadoop with that class included or be using a jar file for my jobs that includes that class. Since we’re writing our mappers and reducers in Ruby I didn’t have a jar to add the class to.

In hopes that I was being stupid with the StreamXmlReaderRecord I decided to return to it and attempt to get it working. After configuring it I saw some positive things in the console as I ran my job. It did in fact look like Hadoop was breaking apart my XML documents into the appropriate chunks (using the start and end tags I specified in my config)

1
2
3
4
5
6
hadoop jar hadoop-0.20.2-streaming.jar 
   -input medline10n0515.xml 
   -output out 
   -mapper xml-mapper.rb 
   -inputreader "StreamXmlRecordReader,begin=<MedlineCitation,end=</MedlineCitation>" 
   -jobconf mapred.reduce.tasks=0

The next thing to figure out was how I should be retrieving the entire XML contents from within my mapper. With Hadoop Streaming the input is streamed in via STDIN so I attempted building up the XML myself using some mega-smart “parse” logic!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env ruby
xml = nil
STDIN.each_line do |line|
  line.strip!

  if line.include?("<MedlineCitation")
    xml = line
  else
    xml += line
  end

  if line.include?("</MedlineCitation>")
    puts convert_to_json(xml)
  end
end

As you can see I look for the start and end tags relevant for my XML, and once I have a complete document I pass the XML to the convert_to_json method. There’s definitely quite a bit of cleanup that can be done, as well as edge cases that aren’t handled (nested tags that match the root tag), but we’ve at least co-erced Hadoop into doing what we want. Next up is seeing how well it works when run against the entire dataset.

Attempts at Analyzing 19 Million Documents Using MongoDB Map/reduce

Over the course of the last couple weeks we’ve been developing a system to help analyze the 19+ million documents in the pubmed database. In my previous post I shared details about the process that we’ve been using to bring down the ~617 zipped XML documents that contain the articles and import them into MongoDB. Today I’m going to share a few more details about our attempts at analyzing the pubmed database using the Map/Reduce capabilities MongoDB offers.

After completing the download, unzip, parse, and load steps required to get the pubmed articles into our MongoDB instance we set out to use the map/reduce capabilities in MongoDB to do analysis and aggregation. Our initial work has focused on the keywords and MESH headings within pubmed articles, as well as on the relationships between authors within pubmed. Our end goal is to have a profile for every author who has published an article in pubmed with details about what keywords and MESH headings appear most within the articles they publish, as well as who they commonly co-author articles with.

In order to build this profile we set out to write a map/reduce job to count the number of articles written by each author by keyword. Our job writes the results of the map/reduce job to a named collection.

1
2
3
4
5
6
7
8
9
10
connection = Connection.new config["mongodb"]["host"]
db = connection.db(config["mongodb"]["db"])
collection = db.collection("articles")

map = ...
reduce = ...

result = collection.map_reduce map, reduce,
                               :verbose => true,
                               :out => "keywordstats"

The “keywordstats” map/reduce job resulted in over a half million documents being inserted into the keywordstats collection.

1
2
3
4
5
6
7
8
9
#keywordstatus example document
{
  _id: author_name,
  keywords: {
    keyword1: 310,
    keyword2: 21,
    keyword3: 22
  }
}

The running of the keyword map/reduce analysis took approximately 30 minutes and didn’t cause us to think twice about our use of MongoDB map/reduce for our analysis. Next we moved onto doing analysis on MESH headings. Since MESH headings are pubmed’s official way of categorizing articles there are a lot more articles with MESH headings, and thus a lot more crunching for MongoDB to do. The map/reduce jobs for the MESH headings were almost exactly the same as those for keywords, however, the processing took much longer due to the larger number of articles with MESH headings assigned. When all was said and done MongoDB was able to process our map/reduce jobs for MESH headings, however, it took over 15 hours to complete (Note: we didn’t do any optimization work so its likely this could be trimmed).

The large increase in time required to analyze the MESH headings made us start to think about what other options we might consider. However, we pressed onto our final analysis: author/co-author relationships. Our goal with the author/co-author analysis is to be able to see who authors are co-authoring with most. Additionally, we want to be able to create a network graph of all the authors within pubmed to do social network analysis on the graph. In order to create the network we need to be able to figure out who has written with one another so we can create an edge between the relevant author nodes.

Since every article within pubmed has an author, and often multiple authors, we expected this bit of analysis to be the most taxing on MongoDB. Pretty soon after kicking off our author/co-author jobs we ran into problems. Due to the large number of author/co-author relationships and the fact that a single author may co-author papers with many other authors we were unable to get our job to run without running into the memory size limitations of documents within MongoDB.

We evaluated other map/reduce strategies that would reduce the document size, however, the limitations that MongoDB places on the mappers and reducers prevented us from implementing those alternate strategies. To be more specific, MongoDB requires the mapper and reducer to emit the same structure. From the map phase we were emitting:

1
  author, {coauthor1: 1} #emit for each author/co-author "pair"

And in our reduce phase we were consolidating all the co-author counts into a single hash to end up with:

1
2
3
4
5
6
7
8
{
  _id: author_name,
  value: {
    coauthor1: 31,
    coauthor2: 211,
    coauthor3_: 122
  }
}

We found that some authors had so many papers and thus so many coauthors that we were blowing past the size limitations MongoDB places on documents. An alternate strategy that we considered was changing our reduce stage to output a single author coauthor relationship with a count rather then our initial approach which reduced to an author with a hash containing all the coauthors with the counts. However, since we can only reduce to a single output we would need to change our mapper to emit the author/co-author as the key. Our initial attempts with this approach weren’t working well which prompted us to taken another step back to consider alternate approaches.

Given our needs and the amount of custom analysis we want to do against this (and other largish datasets) we decided to spend some time investigating Hadoop and Amazon Elastic Map Reduce. Our initial experiences have been very positive, and have us feeling much more confident that the technology choice (Hadoop) won’t prevent us from exploring different types of analysis.

We still feel that Mongo will be a great place to persist the output of all of our Map/Reduce steps, however, we don’t feel that it’s well suited to the type of analysis that we want to do. With Hadoop we can scale our processing quite easily, we have tremendous flexibility in what we do in both the map and reduce stages, and most importantly to us we’re using a tool that is designed specifically for the problem we’re trying to “solve”. Mongo is a nice schema free document database with some map/reduce capabilities, however, what we need for our analysis stage is a complete map/reduce framework. We’ll still be using Mongo, we’ll just be using it for what it’s good at and Hadoop for the rest.

Large Scale Data Processing With MongoDB Map/Reduce (Part 1:Background)

Over the course of the last week I’ve been working with a member of our team to develop a prototype data processing “engine” for analyzing articles within the pubmed database. The pubmed database consists of approximately 19 million articles that can be downloaded as approximately 617 zipped XML documents.

Our initial work has focused on downloading the complete dataset, pulling out the bits that we have interest in, and importing them into MongoDB. For our initial analysis we’re focusing on a subset of the details available for each article. In the future we’ll likely expand our analysis to include more details.

We started by downloading the 617 zipped XML documents from pubmed. Once downloaded we unzipped each file, parsed out the bits that we’re interested in and saved the details in a JSON file optimized for importing into MongoDB. [1]

Once all the XML files were processed and the details were saved out to a JSON file, we used the mongoimport utility to import the JSON files into MongoDB.

The above process was run over the course of a couple days. The most time consuming part was the parsing of the XML files. We wrote Resque workers to handle the above so that the work could be distributed to multiple nodes running on EC2, however, I ended up running things locally so that I could test the process. Given the pubmed database doesn’t change that often, and that we’ll rarely need to re-process the entire dataset having it run on a single machine over the course of a couple days will likely suffice.

After importing all the articles into MongoDB we had a pretty large MongoDB database consisting of ~18 million “documents”. With the articles loaded into MongoDB, we moved onto the next step…analyzing all 18 million documents.


[1] MongoDB likes a single JSON record on each line. [2] This is my first blog post in ages, I need to get back into it slowly, oh so very slowly! :-)

Network Visualization on the Web

Over the course of the last couple months I’ve been doing quite a bit of investigation and experimentation of existing network visualization libraries. There are a number of libraries available, some open source, some built specifically for the web, others meant for a desktop environment, some in java, others in flash, and round and round we go.

I’ve talked to quite a few people who have specific expertise in technologies for doing network visualization as well, ranging from flash to javascript to Silverlight to java. My conclusion thus far is that large scale network visualizations (300+ nodes) is hard. Once you cross the 100 node mark, you begin to have serious problems with laying out the network in a way that is usable by the user of the system that the visualization is within. Drop on top of that the desire to make the visualization interactive (zoom, click, drag, etc), as well as the desire to have the visualization software figure out the best layout for the network itself and you have a pretty difficult problem to solve.

I’m currently doing some prototypes myself using Silverlight. I don’t love the idea of using Silverlight since I doubt the penetration of Silverlight is as great as some have proclaimed, but, the advantages it offers are hard to look past. As a long time .NET/C# developer I’m very comfortable with the development tools used to build Silverlight applications, as well as the language within which to do so, C#. Silverlight appears to offer some pretty decent performance, and I suspect that it will get better as the VM improves. The major disadvantage of Silverlight, which I don’t know the validity of, is it’s lack of existing user base. Since it’s relatively new, and not many sites use it, I suspect the installed base of Silverlight is much less then something like Flash.

The other piece of software that I’ve been spending a bit of time with is graphvis. Graphvis is good at creating network visualizations, and supports a number of different layout algorithms. Unfortunately the output isn’t always great, and it most certainly isn’t very interactive. What I’m experimenting with is using graphvis to pre-compute the network layout, and then feeding that positional information into the Silverlight visualization. The primary advantage will be that the Silverlight app won’t have to figure out the initial layout, however, it will be able to handle all the nice visualization and interactivity that’s desired. The question still remains, is Silverlight up to the challenge? Or is flash, processing, or a pure java applet more appropriate/capable? Only time will tell….

Moving Gems From One Version of Ruby Enterprise Edition to Another

As mentioned in my previous post I recently built a small internal micro app with merb. As part of the process of deploying that app I needed/wanted to update to the latest version of Ruby Enterprise Edition (REE) and Passenger on my slice. One of the issues I ran into while trying to update the REE version is that all my old gems where not installed in my fresh new version of REE. There may be a better way to accomplish this task, but the approach I ended up using was to modify this capistrano file (http://github.com/jtimberman/ree-capfile/tree/master) to install the gems in the old version of REE in the new version.