Customisable Cloud Hadoop: Automating MapR on EC2 using Brooklyn

Cloudsoft were recently asked to help automate the rollout of a MapR Hadoop cluster running in a public cloud, using our open-source application management and cloud portability tools.

The result is something we think could be useful to more people, and so we’re happy to share it open-source. This post describes the high-level objectives and use case. A more technical description of the solution will follow.

Cloud Hadoop: Background

For those not familiar with Hadoop, it is a tool for implementing the MapReduce computing technique, where big data problems or complex computing tasks are broken down into smaller pieces and shared out to many machines running in parallel. Put simply, it allows the creation of a supercomputer resource, without the supercomputer.

Traditionally, Hadoop has used ‘bare metal’ machines on premise, but increasingly we are seeing public cloud Hadoop becoming popular.  This means enterprises can access their ‘own supercomputer’, using IaaS or more efficient MaaS (Metal as a Service) clouds, using massive cloud elasticity without upfront investment.

The Problem

However, setting up MapReduce clusters is complicated and time-consuming. Roll-out and management requires a lot of skill and care, and small mistakes can take a long time to resolve.  Doing this in a cloud adds yet another level of complexity.  Combined, these issues prevent reliable or repeatable reuse, adding costs and decreasing flexibility.

Key Solution Components:  Brooklyn and jclouds

Our solution uses the open-source tools jclouds and Brooklyn, to reliably automate the roll out of MapR M3. For those not familiar with these tools:

  • MapR M3 is a distribution of Hadoop Map-Reduce with a number of built-in practical features that make processing big data easier.
  • Brooklyn is a flexible multi-cloud application control plane, which simplifies deployment and management of cloud applications.
  • jclouds is a cloud abstraction layer, which allows applications to run in multiple clouds without code changes


The design requirements for deployment were to allow:

  • fixed IP (bare metal) machines and MaaS clouds
  • other custom hardware (especially the location and configuration of disks)
  • control over startup sequence for reliability (e.g. ensuring mapr-warden isn’t started until zookeeper has settled)

Post-deployment, the requirements were to be able to manage and change the cluster, automatically, after it is up.  This included:

  • adding nodes (resizing)
  • changing roles
  • monitoring for events and driving responses (such as autoscale on high load)

Before describing our solution, it is worth calling out Apache Whirr, a library for running cloud services, with excellent support for Apache Hadoop and Cloudera’s Hadoop distribution.  Whirr tackles some of the issues above, with support planned for others (Whirr JIRA 214, 221, 252).

In our case, however, we were asked to help with MapR M3 and M5.  MapR stands out from other Hadoop distributions in that it automates much of the complex multi-machine configuration that is normally required. Whirr implements some of this logic for other Hadoop distributions, but much of this is redundant in a MapR world.

There are MapR-Whirr routines available (developed by MapR), which benefit from Whirr’s multi-cloud support via jclouds and the phased install/configure bootstrap, but these routines mask a lot of the power-configuration that MapR users find essential for optimizing their deployments.

Our Solution

We have chosen to use Brooklyn as the starting point for our integration.  Primarily this was because Brooklyn’s native integration with both Whirr and jclouds gave us the ability to re-use some of the best parts of these tools:

  • Hadoop-specific client-side setup from Whirr (e.g. SOCKS proxy, and config file generation)
  • multi-cloud support in jclouds (including the latest version of jclouds, which supports fixed IP through BYON (bring your own nodes), and MaaS in development)

A secondary consideration was that Brooklyn’s “application descriptors” (recipes) can be written as Java or Groovy code.  This means that end-users have a natural place to add desired customization, including support for specific hardware, or for adding Brooklyn policies (code which monitors autonomic sensors and drive activity through effectors.)

In our Brooklyn-MapR solution we define three entities, corresponding to the node types used in standard MapR M3 deployments.  jclouds is used to provision the required machines in EC2 (or alternatively, in a large number of other target environments).  Brooklyn then assembles the machines as required, installing the necessary software and starting processes at the appropriate time.

The user enters a single command, and gets a MapR cluster.  But the real strengths are the ability to customize the Brooklyn application descriptor for real-world use cases:

  • Configurations can easily be customized, e.g. for specific hardware
  • Autonomic management exposes sensors for monitoring
  • Policies allow sensors to be linked to actions, such as waiting for license approval (a manual step in M3) before starting the warden on non-master nodes, or defining the scale-out and scale-back logic (e.g. application-specific rebalancing)

Post-deployment changes to the MapR cluster are supported. As an example a “resize” effector is exposed for creating new TaskTracker/FileServer worker nodes.  Additional effectors, for changing other node types, or other custom policies, such as responding to load, could be added depending on what particular organizations require.

For More Information

Our code is available on GitHub (, with a detailed technical post following shortly.

Brooklyn and jclouds are two components in Cloudsoft’s Multi-cloud Application Management Platform, both released under the Apache License v2.

If you have an interesting multi-cloud challenge or are interested in simplified cloud application management including the roll out of Hadoop, please get in touch.