Project Description

A Short History of Big Data

The term ‘Big Data’ has been in use since the early 1990s. Although it is not exactly known who first used the term, most people credit John R. Mashey (who at the time worked at Silicon Graphics) for making the term popular.[i] Big Data is now a well-established knowledge domain, both in academics as well as in industry.

In order to best understand how Big Data was able to grow to such popularity, it is important to place Big Data into its historic perspective. From a knowledge domain perspective, Big Data is the combination of the very mature domain of statistics with the relatively young domain of computer science. As such, it builds upon the collective knowledge of mathematics, statistics and data analysis techniques in general.

Ever since the early beginnings of civilization, people have tried to use ‘data’ towards better decision making, or to gain a competitive (or military) advantage. This quest can even be dated back to the ancient Egyptians and the Roman Empire. The famous Library of Alexandria, which was established around 300 B.C., can be considered as a first attempt by the ancient Egyptians to capture all ‘data’ within the empire. It is estimated that the library consisted of 40,000 to 400,000 scrolls (which would be the equivalent of around 100,000 books).[ii] Even the ancient leaders of the world realized that combining different data sources could result in an advantage over other competing empires.

Other well documented use cases of the first forms of data analysis come from the Roman empire. The ancient Roman military utilized very detailed statistical analysis to ‘predict’ at which border the chance of an enemy insurgency would be the most prevalent. Based on these analyses, they were able to deploy their armies in the most efficient way possible. It is not a far stretch to consider these calculations one of the earliest forms of ‘predictive’ data analysis. And again, these analysis techniques provided the Roman military with an advantage over other armies.

In order to understand the world of Big Data, it is therefore important to realize that most techniques that are used today (from predictive algorithms to classification techniques) have been developed centuries ago, and that Big Data continues to build on the work of some of the greatest minds in history. The key aspect that has changed, of course, is the availability and accessibility to massive quantities of data. Whereas up until the 1950s, most data analysis was done manually and on paper, we now have the technology and capability to analyse terabytes of data within split seconds.

Especially since the beginning of the 21st century, the volume and speed with which data is generated has changed beyond measures of human comprehension. The total amount of data in the world was 4.4 zettabytes in 2013. That is set to rise steeply to 44 zettabytes by 2020.[iii] To put that in perspective, 44 zettabytes are the equivalent to 44 trillion gigabytes. Even with the most advanced technologies today, it is impossible to analyse all this data. The need to process these increasingly larger (and unstructured) data sets is how traditional data analysis transformed into ‘Big Data’ in the last decade.

Figure: Data and the volume of data in Perspective (source: MyNASAData)

The evolution of Big Data can roughly be subdivided into three main phases.[iv] Each phase was driven by technological advancements and has its own characteristics and capabilities. In order to understand the context of Big Data today, it is important to understand how each of these phases contributed to the modern meaning of Big Data.

Big Data Phase 1 – Structured Content

Data analysis, data analytics and Big Data originate from the longstanding domain of database management. It relies heavily on the storage, extraction, and optimization techniques that are common in data that is stored in Relational Database Management Systems (RDBMS). The techniques that are used in these systems, such as structured query language (SQL) and the extraction, transformation and loading (ETL) of data, started to professionalize in the 1970s.

Database management and data warehousing systems are still fundamental components of modern-day Big Data solutions. The ability to quickly store and retrieve data from databases or find information in large data sets, is still a core requirement for the analysis of Big Data. Relational database management technology and other data processing technologies that were developed during this phase, are still strongly embedded in the Big Data solutions from leading IT vendors, such as Microsoft, Google and Amazon. A number of core technologies and characteristics of this first phase in the evolution of Big Data is outlined in figure 3.

Big Data Phase 2 – Web Based Unstructured Content

From the early 2000s, the internet and corresponding web applications started to generate tremendous amounts of data. In addition to the data that these web applications stored in relational databases, IP-specific search and interaction logs started to generate web based unstructured data. These unstructured data sources provided organizations with a new form of knowledge: insights into the needs and behaviours of internet users. With the expansion of web traffic and online stores, companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by analysing click-rates, IP-specific location data and search logs, opening a whole new world of possibilities.

From a technical point of view, HTTP-based web traffic introduced a massive increase in semi-structured and unstructured data (further discussed in chapter 1.6). Besides the standard structured data types, organizations now needed to find new approaches and storage solutions to deal with these new data types in order to analyse them effectively. The arrival and growth of social media data greatly aggravated the need for tools, technologies and analytics techniques that were able to extract meaningful information out of this unstructured data. New technologies, such as networks analysis, web-mining and spatial-temporal analysis, were specifically developed to analyse these large quantities of web based unstructured data effectively.

Big Data Phase 3 – Mobile and Sensor-based Content

The third and current phase in the evolution of Big Data is driven by the rapid adoption of mobile technology and devices, and the data they generate. The number of mobile devices and tablets surpassed the number of laptops and PCs for the first time in 2011.[v] In 2020, there are an estimated 10 billion devices that are connected to the internet. And all of these devices generate data every single second of the day.

Mobile devices not only give the possibility to analyse behavioural data (such as clicks and search queries), but they also provide the opportunity to store and analyse location-based GPS data. Through these mobile devices and tablets, it is possible to track movement, analyse physical behaviour and even health-related data (for example the number of steps you take per day). And because these devices are connected to the internet almost every single moment, the data that these devices generate provide a real-time and unprecedented picture of people’s behaviour.

Simultaneously, the rise of sensor-based internet-enabled devices is increasing the creation of data to even greater volumes. Famously coined the ‘Internet of Things’ (IoT), millions of new TVs, thermostats, wearables and even refrigerators are connected to the internet every single day, providing massive additional data sets. Since this development is not expected to stop anytime soon, it could be stated that the race to extract meaningful and valuable information out of these new data sources has only just begun. A summary of the evolution of Big Data and its key characteristics per phase is outlined in figure 3.

Three Phases in the Evolution of Big Data

Figure: The Three major Phases in the evolution of Big Data

[i] Mashey, J.R., 1997, October. Big Data and the next wave of infraS-tress. In Computer Science Division Seminar, University of California, Berkeley.

[ii] Wiegand, W.A. and Donald Jr, G., 2015. Encyclopedia of library history (Vol. 503). Routledge.

[iii] Gantz, John, and David Reinsel. “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east.” IDC iView: IDC Analyze the future 2007.2012 (2012): 1-16.

[iv] Chen, H., Chiang, R.H. and Storey, V.C., 2012. Business intelligence and analytics: From big data to big impact. MIS quarterly, pp.1165-1188.

[v] Economist. 2011. “Beyond the PC,” Special Report on Personal Technology, October 8, (http://www.economist.com/node/21531109)