03.07.2019

Big Data Technologies

The phrase Big Data has been gaining more and more traction for almost two decades now. Initially talked about only among inventors and early adopters, nowadays it's become a part of the ubiquitous language within all kinds of businesses. With this popularity, however, came a lot of misconceptions as to what Big Data actually is, what comprises its technology ecosystem and what sort of scenarios its techniques can be applied to. We'll try to briefly answer those questions in this article.

What Big Data is

It all started around the year 2000 when the founders of Google – Sergey Brin and Larry Page – were graduate students at Stanford and they have been working on a side-project – a search engine that was supposed to be a lot better than anything else available at that time.

They managed to achieve just that thanks to a novel algorithm for ranking search results, called PageRank, which was inspired by how relevancy of research papers in academia is being established even to this day – by the number of citations.

 

Yet if not for some more clever ideas and technical decisions that they made when actually implementing their product, we wouldn't have Google today as we know it. These ideas later became the cornerstones of Big Data.

The basic idea, underlying the ones to follow, was to use a large number of commodity computers, rather than a smaller number of huge, very expensive servers. These computers, forming a so-called cluster, would be able to communicate with each other and coordinate their work to efficiently perform computations over large volumes of data.

Speaking of large volumes of data, another concept that the founders of Google came up with, was that of a distributed file system, designed to deal with very large files. This was a necessity because they wanted to be able to keep up with the rapidly growing size of the World Wide Web.

Having a distributed, scalable and reliable data storage, the last piece of the puzzle was a computation model and a framework that would allow developers to express their algorithms without having to deal with the complexity of the infrastructure, its scalability and inevitable hardware failures. That model was named MapReduce.

Shortly after these ideas were described in research papers, open-source projects aiming to implement them were born at Yahoo and now are being actively developed under the umbrella of Apache Software Foundation. Collectively known as Apache Hadoop, these projects became the de facto synonym of Big Data.

What we described so far was around 15 years ago and that is a long time especially when it comes to technology and innovation. Since then we have been observing what is quite common in the programming world when an interesting, promising and potentially breakthrough idea comes to life. Programmers start building on the shoulders of giants by introducing abstractions which further enrich the technology, simplify it or make it more expressive.

Hence, there's been an explosion of libraries, frameworks, and platforms which together comprise the rich ecosystem of Big Data: Pig, Hive, Impala, HBase, Tez or Spark - to name just a few. The sheer number of different tools can be overwhelming at first but all of them are being developed to address a handful of broad use-cases, that we would like to briefly describe next.

 

What Big Data is for

 

The first large area of use-cases is generally called Advanced Analytics. There are plenty of techniques being utilized here, like Machine Learning, Artificial Intelligence, Predictions, Recommendations, Clustering, Text Mining, Anomaly/Fraud Detection, Sentiment Analysis, etc.

The second area we could jointly name Enterprise Data Lakes. Those use-cases are all around gathering data from multiple and varied data sources within large organizations in a single place - usually a platform built on top of Apache Hadoop. Different parts of businesses can then run their analytics and business intelligence reports by correlating data from those numerous sources.

 

This set of use-cases is oftentimes a direct equivalent of traditional analytics built using Data Warehouses but having advantages like scalability, cheaper hardware, and software, as well as the potential to also do advanced analytics.

The last range of use-cases is somewhat orthogonal to the ones already described but also comes with some original applications as well. We're talking here about Real-Time Analytics or Fast Data or Streaming Data. All the hype these days seems to be here, because in our modern world businesses want to make informed decisions as fast as possible, that is immediately at the time when the information becomes available.

Another reason for so much interest in Streaming Data is IoT – Internet of Things. With more and more devices and sensors being part of the Internet, there is an expanding need for tools that make it possible to handle and reason about large volumes of data coming in at very high speeds.

Technology in this space has seen a lot of innovation in recent years. Also, more and more tools are becoming mature enough so that even large enterprises start evaluating them. For these reasons, in the future, we can only expect the adoption of products like Kafka Streams, Spark Structured Streaming or Apache Flink to grow.