What is Big Data Architecture?
This post provides an overview of fundamental and essential topic areas pertaining to Big Data architecture. We will start by introducing an overview of the NIST Big Data Reference Architecture (NBDRA), and subsequently cover the basics of distributed storage/processing. The chapter will end with an overview of the Hadoop open source software framework.
Everyone presently studying the domain of Big Data should have a basic understanding of how Big Data environments are designed and operated in enterprise environments, and how data flows through different layers of an organization. Understanding the fundamentals of Big Data architecture will help system engineers, data scientists, software developers, data architects, and senior decision makers to understand how Big Data components fit together, and to develop or source Big Data solutions.
The NIST Big Data Reference Architecture
In order to benefit from the potential of Big Data, it is necessary to have the technology in place to analyse huge quantities of data. Since Big Data is an evolution from ‘traditional’ data analysis, Big Data technologies should fit within the existing enterprise IT environment. For this reason, it is useful to have common structure that explains how Big Data complements and differs from existing analytics, Business Intelligence, databases and systems. This common structure is called a reference architecture.
A reference architecture is a document or set of documents to which a project manager or other interested party can refer to for best practices. Within the context of IT, a reference architecture can be used to select the best delivery method for particular technologies and documents such things as hardware, software, processes, specifications and configurations, as well as logical components and interrelationships. In summary, a reference architecture can be thought of as a resource that documents the learning experiences gained through past projects.
The objective of a reference architecture is to create an open standard, one that every organization can use for their benefit. The National Institute of Standards and Technology (NIST) ― one of the leading organizations in the development of standards ― has developed such a reference architecture: the NIST Big Data Reference Architecture.
The benefits of using an ‘open’ Big Data reference architecture include:
- It provides a common language for the various stakeholders;
- It encourages adherence to common standards, specifications, and patterns;
- It provides consistent methods for implementation of technology to solve similar problem sets;
- It illustrates and improves understanding of the various Big Data components, processes, and systems, in the context of a vendor- and technology-agnostic Big Data conceptual model;
- It facilitates analysis of candidate standards for interoperability, portability, reusability, and extendibility.
The NIST Big Data Reference Architecture is a vendor-neutral approach and can be used by any organization that aims to develop a Big Data architecture. The Big Data Reference Architecture, is shown in Figure 1 and represents a Big Data system composed of five logical functional components or roles connected by interoperability interfaces (i.e., services). Two fabrics envelop the components, representing the interwoven nature of management and security and privacy with all five of the components. In the next few paragraphs, each component will be discussed in further detail, along with some examples.
The NIST Big Data Reference Architecture is organised around five major roles and multiple sub-roles aligned along two axes representing the two Big Data value chains: the Information Value (horizontal axis) and the Information Technology (IT; vertical axis). Along the Information Value axis, the value is created through data collection, integration, analysis, and applying the results following the value chain. Along the IT axis, the value is created through providing networking, infrastructure, platforms, application tools, and other IT services for hosting of and operating the Big Data in support of required data applications. At the intersection of both axes is the Big Data Application Provider role, indicating that data analytics and its implementation provide the value to Big Data stakeholders in both value chains.
The five main roles of the NIST Big Data Reference Architecture, shown in Figure 24 represent the logical components or roles of every Big Data environment, and present in every enterprise:
- System Orchestrator;
- Data Provider;
- Big Data Application Provider;
- Big Data Framework Provider;
- Data Consumer.
The two dimensions shown in Figure 1 encompassing the five main roles are:
- Security & Privacy.
These dimensions provide services and functionality to the five main roles in the areas specific to Big Data and are crucial to any Big Data solution.
System Orchestration is the automated arrangement, coordination, and management of computer systems, middleware, and services. Orchestration ensures that the different applications, data and infrastructure components of Big Data environments all work together. In order to accomplish this, the System Orchestrator makes use of workflows, automation and change management processes.
A much cited comparison to explain system orchestration ― and the explanation of its name ― is the management of a music orchestra. A music orchestra consists of a collection of different musical instruments that can all play at different tones and at different paces. The task of the conductor is to ensure that all elements of the orchestra work and play together in sync. System orchestration is very similar in that regard. A Big Data IT environment consists of a collection of many different applications, data and infrastructure components. The System Orchestrator (like the conductor) ensures that all these components work together in sync.
The Data Provider role introduces new data or information feeds into the Big Data system for discovery, access, and transformation by the Big Data system. The data can originate from different sources, such as human generated data (social media), sensory data (RFID tags) or third-party systems (bank transactions).
One of the key characteristics of Big Data is its variety aspect, meaning that data can come in different formats from different sources. Input data can come in the form of text files, images, audio, weblogs, etc. Sources can include internal enterprise systems (ERP, CRM, Finance) or external system (purchased data, social feeds). Consequently, data from different sources may have different security and privacy considerations.
As depicted in figure 1, data transfers between the Data Provider and the Big Data Application Provider. This data transfer typically happens in three phases: initiation, data transfer and termination. The initiation phase is started by either of the two parties and often includes some level of authentication. The data transfer phase pushes the data towards the Big Data Application Provider. The termination phase checks whether the data transfer has been successful and logs the data exchange.
Big Data Application Provider
The Big Data Application Provider is the architecture component that contains the business logic and functionality that is necessary to transform the data into the desired results. The common objective of this component is to extract value from the input data, and it includes the following activities:
The extent and types of applications (i.e., software programs) that are used in this component of the reference architecture vary greatly and are based on the nature and business of the enterprise. For financial enterprises, applications can include fraud detection software, credit score applications or authentication software. In production companies, the Big Data Application Provider components can be inventory management, supply chain optimisation or route optimisation software.
Big Data Framework Provider
The Big Data Framework Provider has the resources and services that can be used by the Big Data Application Provider, and provides the core infrastructure of the Big Data Architecture. In this component, the data is stored and processed based on designs that are optimized for Big Data environments.
The Big Data Framework Provider can be further sub-divided into the following sub-roles:
- Infrastructure: networking, computing and storage
- Platforms: data organization and distribution
- Processing: computing and analytic
Most Big Data environments utilize distributed storage and processing and the Hadoop open source software framework to design these sub-roles of the Big Data Framework Provider.
The infrastructure layer concerns itself with networking, computing and storage needs to ensure that large and diverse formats of data can be stored and transferred in a cost-efficient, secure and scalable way. At its very core, the key requirement of Big Data storage is that it is able to handle very massive quantities of data and that it keeps scaling with the growth of the organization, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to applications. IOPS is a measure for storage performance that looks at the transfer rate of data.
The platform layer is the collection of functions that facilitates high performance processing of data. The platform includes the capabilities to integrate, manage and apply processing jobs to the data. In Big Data environments, this effectively means that the platform needs to facilitate and organize distributed processing on distributed storage solutions. One of the most widely used platform infrastructure for Big Data solutions is the Hadoop open source framework . The reason Hadoop provides such a successful platform infrastructure is because of the unified storage (distributed storage) and processing (distributed processing) environment.
The processing layer of the Big Data Framework Provider delivers the functionality to query the data. Through this layer, commands are executed that perform runtime operations on the data sets. Frequently, this will be through the execution of an algorithm that runs a processing job. In this layer, the actual analysis takes place. It facilitates the ‘crunching of the numbers’ in order to achieve the desired results and value of Big Data.
Similar to the Data Provider, the role of Data Consumer within the Big Data Reference Architecture can be an actual end user or another system. In many ways, this role is the mirror image of the Data Provider. The activities associated with the Data Consumer role include the following:
- Search and Retrieve;
- Analyze Locally;
- Data to Use for Their Own Processes.
The Data Consumer uses the interfaces or services provided by the Big Data Application Provider to get access to the information of interest. These interfaces can include data reporting, data retrieval and data rendering.