The day has come, after playing with hadoop distributions around a year and two trainings; I feel ready to write an introduction post about Big Data, Hadoop and ecosystem projects.
1. What is Big Data?
Big Data is not Hadoop, Hadoop is just an implementation of Big Data concept. Big Data is a young concept in data analysis, ETL, data warehousing and data discovery fields; or data science for short. Every year, every day, every minute we create data, and every time we create more than we’ve created before. And we, data workers, process this data to get valuable pieces to our company, client, industry to increase income.
Here, Big Data comes in our life. Big Data has 5 V’s (in some sources you may find only 3 V’s) which are : Volume, Velocity, Variety, Verification (or Validity) and last but the most important to me Value.
5V for Big Data
Volume : Data we need to process and analyze every day is getting bigger day by day. So we need a new approach to data processing, this is where big data comes in.
Velocity : With the use of mobile devices, social media and internet we can create more data in some time, before we could do. For example, before social media if we generate just X MB data on the internet in a day, now we generate more than X GB of data in a day, may be in a couple hours. So we need to catch and process data really fast to catch-up with its speed. (There is also CEP or Fast Data concepts you may interest in this specific topic.)
Variety : I will blame social media again but with the increase of social media usage and internet usage now we generate unstructured and various types of data, like shares, likes, status updates, retweets, vines, videos, texts, gifs, other images. And to create value to our company we have to process this data. They are also coming from variable sources, network systems, internet, forms on corporate website etc.
Verification : Of course, we need the right data to get right results.
Value : This is it! It’s the reason why we process so much data in so short time. We need to get valuable pieces of data, we try to extract VALUE from the data. It is like searching for diamond in a mine.
2. What is Hadoop?
Hadoop is an open source project which implements Big data. It’s a distributed system to store and process data with commodity hardware. It does not require big powerful servers, instead you can create a cluster with desktop computers of quad-core processors with 2GB of RAM. (less than most of the modern laptops, almost all smartphones have 1 or 2 GB of RAM nowadays.)
Hadoop first started with Google’s white paper publishing on its Big Table and MapReduce architecture. Some cool guys tried to implement and develop these features in open source manner. Then Yahoo, Facebook, Google and Apache Foundation supported them. Now Hadoop is an open source Apache project.
3. Hadoop Distributions
You can install and use Hadoop through Linux’s repositories. But there is also start-ups who bundled Hadoop with some other open-source ecosystem projects for Hadoop and with their own tools as well.
Cloudera is one of these start-ups who has own bundle and it provides a VM also for quick start.
Hortonworks is the one other start-up, who has own bundle and own VM to getting started.
Also there is bigger solutions to Hadoop like Oracle’s Big Data Appliance, Teradata’s solution, IBM and HP also have their own enterprise solutions.
4. Hadoop Ecosystem Projects
Pig : Pig is one of the data analytic tools we can use with Hadoop. It has its own language to code which is a scripting language and called Pig Latin. It is really close to English so you can code like you are writing an English essay.
Hive : Hive is another way to run data analytics. It has a SQL like language, so it is usually preferred by developers who already have SQL knowledge.
Impala : Impala is the rival of Hive, it also has a SQL like language and it is much more faster than Hive, because it does not convert code to MapReduce, instead it runs on HDFS directly. (I will tell more about MapReduce and HDFS next time.)
Oozie : Oozie is a scheduling and job management tool. Where you can define flows as XML files, and it runs jobs as defined in this XML.
Sqoop : Sqoop is a tool to load data to Hadoop from a RDBMS or vice versa. It crates MapReduce jobs to load data and runs them automatically.
Flume : Flume is a listener basically. User defines an input channel and Flume polls it repeatedly, for example user defines a log file as a channel and Flume polls in every five minutes to get latest logs from this file.
Ambari : Ambari is administration (provisioning, managing and monitoring) console for the Hadoop cluster.
That’s all for introduction post, soon I will be writing about Hadoop Internals: HDFS and MapReduce. Please do not hesitate to leave comments or ask question in comments section. And hopefully in february I will be building a mini Hadoop cluster at home which will be topic for another blog post.
Thanks for reading.