Project Background

In the near future Big Data is going to touch every business and every person on this planet. MIT Technology Review reported that currently less than 0.5 % of all data collected is being analyzed and used; therefore its potential is huge. We are Frog-B-Data and our senior capstone project is a Big Data research project in which we setup and compare three environments: stand-alone Java, Apache Hadoop and Apache Spark. Apache Hadoop and Apache Spark are setup as clusters with three nodes. We handle the application dependencies using Apache Maven and develop them on Eclipse IDE, using Mahout and ML libraries. Apache Hadoop has been the go-to framework for Big Data applications, but is slowly being replaced by Apache Spark which is gaining more popularity. We perform four comparison tests: Word Count, Matrix Multiplication, Recommendation using Co-occurrence Matrix or Collaborative filtering algorithms, and K-means clustering. Non-structured data files of sizes ranging from a few Megabytes to 10 Gigabytes are being used for comparative studies. Based on the results, we built our own recommender system in Spark and Hadoop. Read More

One billion people logged on to Facebook on August 24, 2015. This would not have been possible if we did not have techniques to handle huge amounts of data like this. Handling of User Data in Tech Giants like Facebook has a very important place and with the increasing amount of data every second, it is becoming difficult to process data at high speeds without the loss of security. Frog-B-Data compares three of the largest big-data handling softwares- Weka, Hadoop and Spark,  and conducts various type of tests to measure the capability and speed of processing data. We analyze the results of these tests and choose the best of the three softwares and come to a definite answer to the question - Can Spark really replace Hadoop?

Data now streams from everywhere in our daily lives: phones, credit cards, computers, tablets, sensor-equipped buildings, cars, buses, trains and the list goes on and on. We have heard so many people say “There is Big Data Revolution”. What does that mean? It is not the quantity of data that is revolutionary. The Big Data revolution is that now we can do something with the data. The revolution lies in the improved statistical and computational methods which can be used to make our lives easier, healthier and more comfortable.

Familiar uses of Big Data to a common man include “recommendation engines” used by Netflix and Amazon, credit card companies, and tech giants like Facebook. In the public realm, there are all kinds of applications: allocating police resources by predicting where and when crimes are most likely to occur; finding associations between air quality and health; or using genomic analysis to speed the breeding of crops like rice for drought resistance. However, this is a very small fraction of what can be done and what is being done. The potential for doing good is nowhere greater than in public health and medicine where people are dying everyday just because data is not being properly shared.

Nowadays, it’s not just about mining data and analyzing results, it is about using data smartly. The purpose of smart data is to filter out the noise from the Big Data and hold the valuable data to solve business problems. There are no formulae to convert Big Data into smart data, but if we understand the clues in the questions around the data and analyze data qualitatively, we can use it smartly. read less

Objective

All of us are a part of the data revolution and data flowing through the air increases every second. More data crosses internet every second than were stored in the entire internet just 20 years ago. Frog-B-Data, a 2015-16 Big Data Project, is a research based project that tests the performances of data mining algorithms in three environments: Weka, which is a traditional data mining tool and two true big data environments - Hadoop Map Reduce and the recently introduced Apache Spark - both of which provide a processing model for analyzing big data. We carry out different tests to validate the feasibility of preparing big data files and processing them in an unstructured manner.