Big Data – It’s a zoo out there

Even the data analytics darling of today could be extinct tomorrow – how to tame the beast that is Big Data with the right IT skills and tools.

Hadoop became a rock star in 2014, emerging into mainstream IT from relative obscurity, and being recognized by analysts in formal market analyses. But equally important to Hadoop itself, are the plethora of other tools in the ecosystem, also fueled in the main, by the influential Apache Foundation.

The revolution in data analytics we see today just would not have happened without the confluence of open source software and very cheap processing power, whether that’s cloud or commodity servers in-house. Those two forces were like the finger of God in the software world, kicking off the equivalent of a Cambrian explosion of engineering creations.

My illustration below gives a brief overview of some of the major parts of the Hadoop ecosystem, but there are actually many others; this was all I could fit easily on one PowerPoint slide for a recent talk I gave on the subject.

The-Hadoop-Ecosystem

Large animal pictures

The peculiar, perhaps Indian sounding name of Hadoop, was taken from the creator’s, daughter’s toy elephant hence also the logo. And following this theme, Mahout is an Indian term for the elephant keeper, the person who leads and maintains control over the elephant. And Ambari is the name of a special sort of howdah, the seats or thrones that elephants can carry on their backs in India.

It’s this complexity with the implication of arcane knowledge known only to insiders that is still holding back many companies from being successful. Yes, we have yet another IT skills shortage and we will be fighting over the best talent in this area for a while yet. Probably until more tools emerge that either bring order to chaos, or entirely remove the need for the lower level knowledge.

With all this complexity it really is a zoo out there hence the need for Apache ZooKeeper, a product that allows you to track the configuration data for all these components and ensure that you maintain the connections as you move systems and components around or move new projects into production.

Natural selection or genetic engineering?

Great diversity is always indicative of creative change – the evolutionary forces are certainly at work here. Many new species and varieties appear constantly and certainly some of the creations we see today will be extinct tomorrow. Preserved perhaps, stuffed and inactive in a museum of software, but no longer a part of the living zoology.

We are already seeing a decline in the use of the initial MapReduce process and the growing use of SQL layers to process Hadoop data. Even Hadoop itself, today’s data analytics darling, could be extinct tomorrow, displaced in the dominant gene pool by Pachyderm: software that is related only by the inference in the name. The latter is an exciting new startup that uses Docker containers to store the data and is built on CoreOS for the processing infrastructure.

Perhaps saying, “It’s a zoo out there,” is an understatement and really it is like the actual jungle where only the fittest will survive this initial bloom of new life. Hearing this, the timid may well decide that they don’t want to come outside to play; they will stay indoors with their RDBMS and traditional data warehouses. I suspect the Dodo did that!

If you want to avoid becoming a fossil yourself you cannot hang back; now is the time for IT to learn this stuff and for line-of-businesses (LOBs) to start demanding access to it via their IT departments, or to simply bypass IT and start playing with it on the cloud somewhere.

IT as zookeeper or ringmaster

So how does IT tame this jungle, circus or whatever metaphor you like best for this wild ride? How can they manage this diversity and both give their LOBs the tools that will drive a competitive edge for the company, and contain costs at the same time?

CA Technologies has the answers and will be talking about them at the Gartner BI and Analytics event in March. Be there or risk the likelihood of becoming a stony artifact of your former self!

How are you taming big data within your organization? Leave me a comment below.


David Hodgson is no longer working at CA Technologies. In his most recent role he…

Comments

rewrite

Insights from the app driven world
Subscribe Now >
RECOMMENDED
The Sociology of Software >How (Not) to Lie with Data Visualization >DevOps and Cloud Computing: Exploiting the Synergy for Business Advantage >