BIG DATA CASE STUDY -“Smart City data ingestion with Apache Spark”

Case Study overview

We will outline and explain in detail the implementation of a framework built on top of Spark to enable agile and iterative data discovery between legacy systems and new data sources generated by IoT devices.

The internet of things (IoT) is certainly bringing new challenges for data practitioners. It’s basically a network of connected objects (the “things”) that can communicate and exchange information. The concept is not new, however there is a significant shift in the list of objects that could be part of this network. Very soon, it will be nearly every object on the planet. There are already few applications emerging around the internet of things, such as smart cities, smart agriculture, demotic & home automation, smart industrial control…etc.

The data* for our case study contain various information such as:

• Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the Province of Trento.

• Measurements about temperature, precipitation and wind speed/direction taken in 36 Weather Stations.

• Amount of current flowing through the electrical grid of the Trentino province at specific instants.

Framework overview

The primary focus of the framework is not to collect in real-­‐time the data from the devices although it can be extended to support such capability thanks to Spark.

The ultimate goal of the framework approach here is to speed up development effort to ingest new data sources available and reduce time to market for data consumers. It aims to avoid rewriting new scripts for every new data sources available and enables a team of data engineer to easily collaborate on a project using the same core engine.

Step 2: The purpose of this function is to use Spark SQL engine to execute pre-­‐defined sql template that contains the transformations to apply to the data. Most of the time, the requirement for the data to be extracted can change. We need to avoid rewriting the entire script, every time that we need to extract some pieces of information for the data consumers. To support this functionality, we rely on two concepts:

  • Configuration files that contains the information about the data that we want to compute and where we want to store the results in Hive
  • Templates files that contains SQL queries to be executed to apply transformation/extraction

Summary & Benefits

This concept of a framework aims to avoid rewriting new scripts for every new data sources available. Spark has been identified as a good candidate, as it offers the ability to execute various workloads and support multiple type of computations.

The framework has been designed with flexibility in mind. The current two core modules are independent and can be replaced or customized without impacting the whole solution.

In addition, it enables a team of data engineer to easily collaborate on a project using the same core engine. Each member can focus on writing their templates and updating configuration files. Everyone can benefit from evolution to the core modules.

Many companies are looking to enrich their conventional data warehouse or data platform with new sources of information that usually come in a variety of format. With the rise of the internet of the things, this concern will certainly become more and more important, as the companies would like to ingest and derive value from these new data-sets. They need to do it in an agile way:

  • Acquire the data
  • Test concepts
  • Validate and industrialize

A framework is definitely the way to go. The key benefits are:

  • Agility & time to market – give the ability to quickly acquire new data sources
  • Re-usability – data engineers need to focus on data integration without re-­‐writing core functionalities for every new project or proof of concept.
  • Accessibility – core ETL transformations are written in SQL making therefore this framework accessible to most data engineers.
  • Flexibility – each module of the framework can be enhanced independently and everyone will benefit from these evolution’s.
This entry was posted on May 23, 2020 by Admin

Conatct Us

Email

contact@stability.co

Phone

For Enquires : +91 9100451116/7/8

For Technical Support : +91 9030070085

Address

#122/P&123/P, 1st Floor,
MG's Brindavan, Beside More Supermarket,
Gajularamaram, Hyderabad
Telangana -500 055.