Skip to main content


Several Industries including Oil and Gas, Manufacturing, Mining, Healthcare, etc. deploy technicians and machines at remote locations.  These technicians operating at the remote locations on the ground need real-time data and guidance with respect to multiple aspects of the machine operational schedule, tool wear and tear, weather forecast, etc. A robust strategy that enables and empowers them with critical data needs is imperative to achieve business success.  This will be of real help in providing them with real-time data to mitigate operational risks and challenges in day-to-day activities. This will further improve business preparedness for overall risks and performance.

The Business Case/Solution

For the above-mentioned heavy industries, where machines and personal work in remote locations, a centralized management system that helps reduce risk and lowers the total cost of operations is invaluable for companies.  Hence, the data is generated from the technicians and machines, which needs to be analyzed in real-time and which gives meaningful insights into the patterns.

The business is scoped under the popular terminology of IoT (internet of things) as there are machine sensors, internet cloud, Big data, Analytics, etc.

The broad level steps involved are listed below in line with the popular terminology of IoT

  • Collection of data generated by sensors at remote locations - Data Ingestion (or Harvesting)
  • Transforming and storing Data - Data Storage (or Persisting)
  • Meaningful Insights into the Data - Data Analytics (or Deriving value)

Technical Solution

The technical solution would be based on Hadoop open source Platform and technologies and the steps are listed below.

  1. Source data: There will be variety of data formats coming from Human inputs, embedded machine sensor or from committed log, Geospatial data could comply to KML format. KML is an international standard of the Open Geospatial Consortium in Latitude and Longitude information.
  2. The sensor data will be gathered and stored in the Cloud, Amazon or Azure, etc.
  3. Apache Kafka - Java applications (Producers) create Messages from input streams and publish them to the Kafka broker for further consumption.
  4. Apache Kafka - Broker: The multiple varieties of data formats captured by producers will be consumed by Kafka broker. With Zookeeper, the input data will be converted to a usable format called consumer.
  5. Apache Consumer is the out format of the file which can be consumed by other applications.
  6. Apache Strom can capture these events, data produced by Kafka, and process them for real-time analysis.
  7. Storm Spout program will read data from Kafka consumer data file and passed it to Bolt.
  8. Storm Bolt will further process the data and passes it to another bolt or persists it to storage.
  9. Storm Topology - To do real-time computation on Storm. A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes.
  10. Real-time data ingestion in HBase and Hive.
  11. HBase provides near real-time, random read and write access to tables (or to be more accurate 'maps') storing billions of rows and millions of columns. In this case, once we store this rapidly and continuously growing dataset from the Internet of Things (IoT), we will be able to perform a swift lookup for analytics regardless of the data size.
  12. Hive is used for data analytics on the processed data, providing the deepest SQL analytics and supporting both batch and interactive query patterns as SQL engine.
  13. YARN allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. This provides a seamless integration of operational and analytical systems and a foundation on which the enterprise can build.
  14. The requite queries and analysis can be run on the Hive or HBase database as per the business needs.


Addressing business solution outlined here using Hadoop-based open source technology platform proves invaluable for organizations. This architecture can be extended to other industries like transport, healthcare, mining, oil exploration, farming, etc. for better results. The data collection at the discrete and remote locations can be ported to a central data management platform for insights into holistic parameters and providing solutions in real time to enhance business value and risk mitigation.

Sricharan Vadapalli

About the author

Sricharan Vadapalli

Practice Director, Data, Analytics and DevOps

Vandapalli, or “Sri,” as friends call him, helps clients harness the power of data with the latest and greatest analytic technologies. With a background in IT consulting and career guidance, Sri knows how to bring clients from where they are, to where they need to be. Always developing new skills and knowledge sets, Sri hopes to leave a legacy by teaching others to achieve their goals. An author of literature on Big Data and DevOps, and a yoga and meditation instructor, Sri finds joy in public speaking and mentoring peers in his community.

Cookie Notice

This site uses cookies to provide you with a more responsive and personalized service. By using this site you agree to our privacy policy & the use of cookies. Please read our privacy policy for more information on the cookies we use and how to delete or block them. More info

Back to top