Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster


Kashi Sai Prasad,S Pasupathy,



Apache Spark,Big Data,Flume,Hadoop,Map Reduce,Twitter data ingestion,


Data plays a major role in today's Internet world.Analyzing historical data became easy due to advancement of analytical tools. Gathering data from social networking websites is a great challenge for today's data scientists. Many advancements and research has been conducted to gather streaming data(data generated every second) .Hadoop has provided acomponent called Apache Flume to ingest data into HDFS for processing using MapReduce. It has its own benefits,which made many analysis easy for social networking data,but Apache Flume requires a depthknowledge on configuration files and administration. Our work proposes a framework for real-time data streaming of Twitter data. Apache spark which is an enhancement of Hadoop in terms of speed and faster processing provides much more insight than Apache flume.Spark is an in-memory distributed computing engine to increase processing speed over MapReduce, Spark is considered one of the most advanced ecosystem component for Batch and near-real time processing. We in our paper are explaining in detail about data ingestion using Apache Spark and Scala IDE. In our work the data will be directly ingested from Twitter website through tokens and access keys provided,which will be explained in chapter 3,4. Our GUI can also help a user to tweet into Twitter directly without moving on to Twitter website. We have also provided an option to categorize tweet of specific persons using '#' tags.The data thus obtained can be used for statistical analysis and generating reports.


I.Altti Ilari Maarala, Mika Rautiainen, Miikka Salmi, Susanna Pirttikangas and Jukka Riekki”, Low latency analytics for streaming traffic data with Apache Spark” IEEE InternationalConference on Big Data (2015).

II.Anand Gupta, Hardeo Kumar Thakur ” A Big Data Analysis Framework Using Apache Spark and Deep Learning”, IEEE International Conference on Data Mining Workshops (2017).

III.Babak Yadranjiaghdam, Seyedfaraz Yasrobi, Nasseh Tabrizi “Developing a Real-time Data Analytics Framework For Twitter Streaming Data”,IEEE 6th International Congress on Big Data (2017).

IV.Hassan Nazeer, Waheed Iqbal, Fawaz Bokhari, Shuja Ur Rehman Baig ” Real-time Text Analytics Pipeline UsingOpen-source Big Data Tools”,arXiv:1712.04344, Dec(2017).

V.Marouane Birjalia, Abderrahim Beni-Hssane, Mohammed Erritali “Analyzing Social Media through Big Data using InfoSphere BigInsights and Apache Flume “, The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks Elsevier (2017).

VI.Ramkrushna C. Maheshwar, D. Haritha “Survey on High Performance Analytics ofBigdata with Apache Spark”,International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) (2016).VII.Sangeeta “Twitter Data Analysis Using FLUME & HIVE on Hadoop Framework”,Special Issue on International Journal of Recent Advances in Engineering & Technology (IJRAET) V-4 I-2February (2016).

VIII.S. Cha and M. Wachowicz. “Developing a real-time data analytics framework using Hadoop”,IEEE International Congress on Big Data June (2015)

Author(s): Kashi Sai Prasad, Dr. S Pasupathy View Download