Hadoop is an open-source data management platform. It allows for easy processing and storage of really big datasets. It is an open-source initiative, meaning the software is free and written by many different companies.
Hadoop consists of a number of applications, but two are key and form the foundation of the platform; a storage system called the Hadoop Distributed File System (HDFS) and the process and an analysis framework called MapReduce.
HDFS is based on a distributed file system originally created by Google called the Google File system (GFS). The first 15 minutes of this Cloudera video provide a good explanation of the history and basic structure of that file system and its evolution into HDFS.
The analysis framework MapReduce is used to query and process data stored on the HDFS. This video describes the basic principles on which MapReduce works.
Built on top of HDFS and MapReduce are a whole bunch of other tools. These tools let you schedule jobs and manage resources (YARN), run SQL queries (HIVE), provide indexing for searching (SOLR), improve upon the processing techniques of MapReduce (IMPALA and SPARK) or even provide a simpler framework for writing Hadoop programs (PIG). There are others.
The open-source initiative on which HDFS, Mapreduce and the other tools are available is called the Apache Hadoop project. The Apache Software Foundation is the volunteer body that decides the direction of development and manage what tools will be developed and by whom. Individual companies that are member of the Apache Hadoop project propose new applications and then develop those openly for all. For example Hortonworks built YARN and put the code up on Apache and it is free for anyone, including competitors.
So what would you use Hadoop for?
Here are a few examples I came across. Imagine trying to store customer usage data from a pool of ATM’s being used across the country at a large bank. Or collecting information on driving and usage patterns of a connected car fleet. Or storing machine data from a large manufacturing operation. Or an oil and gas firm collecting real time minute by minute drilling, seismic or production data.
Any application where the data set is large, analysis of the data requires that it be stored, and where storage would be unmanageable in traditional database structures is data that would be conducive to Hadoop. Cloudera provides a number of use cases in this paper.
Hadoop is particularly useful for unstructured data sets. An unstructured data set is where the data doesn’t follow a particular table or columnar style. Social media data, mobile data, internet of things data coming straight from the device would all be typical examples of unstructured data. Conversely think of an Excel workbook, where data is laid out in a particular column by column metholology as an example of a structured data set.
Hadoop uses a methodology called schema on read that makes it particularly adept for unstructured data. Schema on read means that data that is read into HDFS does not need to have any particular structure. Instead the schema can be created at the time that the data is accessed and analyzed, at which time it can take on a form most suitable for the analytics being performed.
One feature of the HDFS storage system is that it doesn’t care what format the data is in. You can add data from an Oracle database, from an SQL database, or from an IoT application stream. HDFS doesn’t care, all the data can reside together. When you have a large database of hybrid data, usually somewhere on the cloud, its referred to as a data lake.
Hadoop implementation began as an on-premise extension of traditional databases. It was a response to the expanding amount of data being gathered and the unstructured nature of some of that data, which had caused requirements to surpass the capabilities of applications like Teradyne appliances.
Today, with the advent of the public cloud some customers are choosing to push data and workloads out into the cloud; to AWS or Azure for example. So far the work has been “ephemeral workloads”, which means that data is pushed out for a particular job, maybe to run analytics or reporting and then it is turned off. But as time goes on the move will likely be towards more data residing permanently on the cloud.
This paper by Accenture describes the deployment options and prices them out against one another. The figure below, taken from the paper, illustrates the range of implementation options.
While a company can implement Hadoop on its own, because of the complicated nature of its implementation (and from what I have read the bugginess of the code) they are more likely to contract support services from a company of experts to help with the integration.
This article lists some of the biggest players in Hadoop integration. Cloudera and MapR are both private start-ups, while Hortonworks is a public company. Each is involved with the development of the open-source applications as members of Apache.
Because Hadoop is open-source they are also limited with what product they can sell. Some, like MapR and Cloudera, have proprietary applications that work with Hadoop. Hortonworks, on the other hand, only distributes open-source software and generates revenue strictly through its implementation, support and maintenance services of the Hadoop infrastructure that they implement.
Below is the Hortonworks data platform. It consists of HDFS, a number of data access tools including MapReduce, and then supporting tools that let you manage, govern and support that data. Some of these tools (like YARN) were developed by Hortonworks, while others were developed by other Apache members (like MapR and Cloudera).
Why Invest in this Space?
There are a few trends that are converging that I am looking for ways to capitalize on.
First, the need to analyze large volumes of unstructured data, from social media, from mobile, and soon from Internet of Things devices, is going to continue to grow. Hadoop is the best technology for storing that data. In their Deutsche Bank Technology Conference presentation, Hortonworks said that there are already now 10x as many hadoop nodes as nodes on Teradata.
Second, as the cloud gains more acceptance as a trusted depository for proprietary data I think more of that data is going to find its way to Hadoop databases that exist on the cloud. So moving data to the cloud, and analyzing and managing that data on the cloud, is going to grow in importance.
Third, much of the data coming from unstructured sources is going to be best utilized if it can be analyzed soon after it is received. This sort of data analysis is referred to as “data in motion”, as opposed to “data at rest”, which is data that sits and accumulates over time. The traditional Hadoop systems are designed to store data at rest. However there are other tools, such as the Apache Spark and Kafka projects, that are tweaking Hadoop (in the case of Spark) or building a data in motion platform that can run parallel to Hadoop (in the case of Kafka) to handle data in motion.
Both types of data are going to need to work together. Historical data at rest gives context and provides learning while streaming data delivers timely insights. This very short video from Hortonworks has a good graphic that illustrates how data in motion can be ingested, analyzed and then given persistent storage in Hadoop.
I’ve been looking for ways that I can take advantage of these trends. The two companies that I have found so far are Hortonworks and Attunity.
I took a small position in Hortonworks after the Goldman downgrade of the stock. I added to the position as it rose after printing a solid third quarter but have kept the position relatively small as I still have some trouble with their open-source model.
The company is growing at a 40%+ top line rate. Even after the recent move to $9 it is not expensive relative to peers growing at a similar rate. The company has an enterprise value of around $400 million and given that it generated a little under $50 million in revenue in the third quarter it is trading at less than 2x forward revenue.
The Hortonworks business model is to sell subscriptions for the integration and maintenance of their Hortonworks Data Platform (HDP) and open-source product. The revenue model is recurring and prices their services on a “per node” basis, which means that as companies scale out their Hadoop infrastructure Hortonworks takes an proportionate piece of the pie.
In addition to it Hadoop platform, Hortonworks acquired a company called Onyara in 2015 that expanded their suite of data analysis and management tools to include data in motion. From this acquisition they have developed a second platform called Hortonworks Data Flow (HDF). The HDF platform is part of another Apache project called Nifi and provides
“an infrastructure to acquire data and store it, taking into account data that has to be processed quickly to have value (low latency), discarding data after it has reached its useful limit, and d provide decision making when the data coming in is coming in faster than speed of storage.”
There is a good youtube video here (it’s a little long but the first 20 minutes is worth watching) that describes how HDF works.
HDF leverages the trend towards analyzing data as close to the point of origin as possible. The vision with HDF was expressed at the Pacific Crest conference as follows:
Now [customers] want to have the ability to manage their data through the entire life cycle. From the point of origination, while its in motion, until it comes to rest and they want to be able to drive that entire life cycle. It fundamentally changes how they architect their data strategy going forward and the kind of applications and engagements they can have with their customers. As they’ve realized this in the last year its changed everything about their thinking about how they are driving their data architecture going forward starting with bringing the data under management for data in motion, landing it for data at rest and consolidating all the other transactional data. So it’s a very big mind shift that’s happening.
Hortonworks has said that they expect HDF to make up one-third of revenue next year. That is significant growth from what is currently a small base. I think one of the most interesting aspects of Hortonworks is the growth it potentially can generate from HDF that does not seem to be adequately reflected in the stock price.
There are decent Seeking Alpha articles on Hortonworks here and here but it’s the comment sections that are particularly useful. Hortonworks also attends a lot of conferences, and their presentations at Deutsche Bank, RBC , and the previously mentioned Pacific Crest are all worth listening to.
Hortonworks reminds me a bit of Apigee. It’s a recent IPO, a fast growing company, and where some of the analyst community have lost faith in their ability to continue that growth. Therefore you have a multiple that is out of sync with the growth rate, and where a stabilization or acceleration of the growth rate (something we saw in the third quarter) should mean the stock gets re-rated to at least its prior level.
The other way I am playing the evolution of big data lakes of unstructured data is with Attunity, which provides tools for data transfer and visibility.
Attunity has 3 main products. The main revenue driver, Attunity Replicate, facilitates the transfer of data across databases, data warehouses and Hadoop platforms. A second product, called Attunity Visibility, provides insights into your database by monitoring data usage, identify which databases/tables/columns are being used frequently and identifying who is using the data. A third product, Attunity Compose, automates many of the aspects of designing, building and managing your data warehouse.
Hadoop data lakes fit into Attunity’s product strength because they require large scale, heterogeneous, real time integration. One of the benefits of Replicate is its ability to transfer data from a wide variety of data sources. The company had this to say about the Hadoop opportunity on their first quarter call:
…the Hadoop environment creates huge opportunities for us to be more competitive and serve the markets much better. Customers are asking us basically to automate more and more the activities that are happening with Hadoop. So you really provided an end-to-end automation process and that’s one of the focuses we have.
Attunity has a bunch of case studies on their website that illustrate how Replicate and Visibility are used – most involve transferring data from a main hub (ie. A legal case file database that cannot be queried directly) or from operational databases (ie. Oil sands plant databases or retail location databases) for consolidation, to offload or to run workloads offsite.
There is also a very good SeekingAlpha article here that gives a revenue breakdown between products that is very useful. I would recommend making a copy of the article as I don’t know how long it will remain in front of the pay wall.
I’m less excited about Attunity than Hortonworks. Attunity faces a lot of competition in the extract-transfer-load market, they compete against Informatica, Oracle’s GoldenGate and SAP. Gartner recently named them a “challenger” in the magic quadrant (here is the report ). That means that they are not yet considered a leader in the field. In particular the report said:
While awareness of Attunity is starting to grow in this market, there remains a lack of recognition by buyers seeking data integration tooling as their enterprise standard.
Attunity has been growing Replicate revenues at around 25% but their legacy business has been shrinking and the Visibility product is not selling well so far. Compose remains a small portion of revenues.
So it’s a bit of a show-me story. I like the idea enough to take a starter position, but I would want to see some signs of accelerating adoption by large enterprises before adding. I would add at a higher price if I see that, because the opportunity with it, as with Hortonworks, is large.