Introduction to Continuuity Reactor¶
The Challenge of Big Data Applications¶
The amount of data being generated by businesses and consumers is compounding exponentially. Applications are becoming increasingly complex and data-intensive as developers try to extract value from this enormous trove of information. These applications—Big Data applications—need to scale with the unpredictable volume and velocity of incoming data without the need for the developer to re-architect the deployment infrastructure—even while dealing with hundreds of petabytes if not exabytes of data. Building Big Data applications is challenging on many fronts.
Steep learning curve¶
As an application developer building a Big Data application, you are primarily concerned with four areas:
- Data collection framework
- Data processing framework
- Data storage framework
- Data serving framework
There are many technology frameworks from which to choose in each of these four areas; data storage alone runs the gamut from open-source NoSQL projects to proprietary relational databases and can require you to learn CAP theorem concepts and understand distributed systems principles. Evaluating the pros and cons of each of these frameworks, becoming competent, making them work with disparate use cases from realtime to batch processing, learning to use them effectively, and operating them in production is a daunting task.
No integrated framework, numerous integration points¶
As an application developer, one of the main challenges of building a Big Data application is that you have to focus not only on the application layer of code but also on the the infrastructure layer. As highlighted above, you first make choices about the underlying technology frameworks, then spend time integrating the different pieces of technology together, all before you even start building your application. Each of the technology frameworks come with their own APIs making it harder to integrate them quickly.
Lack of development tools¶
Big data application development involves dealing with technology frameworks in a distributed system environment, and there is no development framework that makes it easy to develop, test and debug these types of applications. Debugging is especially difficult in a distributed environment. Sometimes you have no choice but to scan through hundred of lines of log files on multiple systems to debug your application.
No monitoring solutions¶
Once your application is ready for production, you’ll need to monitor and manage it. Operability of each of the technology frameworks presents its own set of challenges. A lack of proper tools makes application operations a full-time job.
Continuuity Reactor Overview¶
Under the covers, Continuuity Reactor™ is a Java-based middleware solution that abstracts the complexities and integrates the components of the Hadoop ecosystem (YARN, MapReduce, HBase, Zookeeper, etc.). Simply stated, Reactor behaves like a modern-day application server, distributed and scalable, sitting on top of a Hadoop distribution (such as CDH, HDP, or Apache). It provides a programming framework and scalable runtime environment that allows any Java developer to build Big Data applications without having to understand all of the details of Hadoop.
Without a Big Data middleware layer, a developer has to piece together multiple open source frameworks and runtimes to assemble a complete Big Data infrastructure stack. Reactor provides an integrated platform that makes it easy to create all the elements of Big Data applications: collecting, processing, storing, and querying data. Data can be collected and stored in both structured and unstructured forms, processed in real-time and in batch, and results can be made available for retrieval, visualization, and further analysis.
Continuuity Reactor aims to reduce the time it takes to create and implement applications by hiding the complexity of these distributed technologies with a set of powerful yet simple APIs. You don’t need to be an expert on scalable, highly-available system architectures, nor do you need to worry about the low-level Hadoop and HBase APIs.
Full Development Lifecycle Support¶
Reactor supports developers through the entire application development lifecycle: development, debugging, testing, continuous integration and production. Using familiar development tools like Eclipse and IntelliJ, you can build, test and debug your application right on your laptop with a Local Reactor. Utilize the application unit test framework for continuous integration. Deploy it to a development cloud (Sandbox Reactor) or production cloud (Enterprise Reactor) with a push of a button.
Easy Application Operations¶
Once your Big Data application is in production, Continuuity Reactor is designed specifically to monitor your applications and scale with your data processing needs: increase capacity with a click of a button without taking your application offline. Use the Reactor dashboard or REST APIs to monitor and manage the lifecycle and scale of your application.
Now, let’s talk about the components within Reactor. Continuuity Reactor provides four basic abstractions:
- Streams for real-time data collection from any external system;
- Processors for performing elastically scalable, real-time stream or batch processing;
- DataSets for storing data in simple and scalable ways without worrying about details of the storage schema; and
- Procedures for exposing data to external systems through stored queries.
These are grouped into Applications for configuring and packaging.
Applications are built in Java using the Continuuity Core APIs. Once an application is deployed and running, you can easily interact with it from virtually any external system by accessing the Streams, DataSets, and Procedures using the Java APIs, REST or other network protocols.
In the next section, we will compare three application architectures and their pros and cons. This will give you a good understanding of the benefit of architecting Big Data applications using Continuuity Reactor.
Architecture Comparison: Building a Big Data Application¶
Consider the problem of building a real-time log analytic application that takes access logs from Apache™ web servers and computes simple analyses on the logs, such as computing throughput per second, error rates or finding the top referral sites.
Traditional Database Log Analysis Framework¶
A traditional architecture will involve using a log collector (Custom ETL) that gathers logs from different application servers or sources and then writing to a database. A reporting framework OLAP/Reporting Engine) then acts as the processing layer to aggregate the log signals into meaningful statistics and information.
This is a good example of an application architecture that cannot scale with unpredictable volume and velocity of data. The custom ETL (extract, transform, load) framework includes a log collector to extract data, transformation of the logs with simple filtering and normalization, and performs the loading into the database of the events.
The disadvantages of this approach include:
- Complexity of the application increases when processing large volumes of data
- The architecture will not be horizontally scalable
- Producing results in realtime at high-volume rates is challenging
Apache Hadoop®-based Log Analysis Framework¶
To achieve horizontal scalability, the database architecture of the preceding design has evolved to include scalable log collection, processing and storage layers.
One of the most commonly-used architectural patterns consists of custom ETL and log aggregators using map reduce, a realtime stream processor such as Storm as a data processing layer, Apache HDFS/HBase™ as a storage layer of results and a custom reporting engine reading the computed results and creating visualizations for a web browser. This is just a summary of the many components required to implement this solution. (Don’t worry if you are not familiar with these technology frameworks.)
The disadvantages of this approach include:
- Steep learning curve
- Difficult to integrate different systems
- Lack of development tools
- Operating the composite software stack
- No single unified architecture
Continuuity Reactor Log Analysis Framework¶
Designing Big Data applications using Continuuity Reactor™ provides a clear separation between infrastructure components and application code.
Reactor functions as a middle-tier application platform, exposing simple, high-level abstractions to perform data collection, processing, storage and query. Logs are collected by Streams, while Flows do basic aggregation and realtime analysis. Advanced, off-line aggregation is performed by Map Reduce jobs and Workflow components. Procedures provide stored queries. The application can now be scaled independent of the underlying infrastructure.
The advantages of this approach include:
- A single unified architecture to perform data collection, processing, storage and query, with interoperability designed into the framework.
- Horizontal scalability is derived from the underlying Apache Hadoop layer, while the Continuuity Reactor APIs reduce the application complexity and development time.