Bigdata is the word which is being used very often but yet to be defined properly. How much data can be classified as big data? Does big data only means big in size or big in complexity or both? If hard disks can be clubbed together to store very very large amount of data then why whole world has become so obsessed with big data? Is it something different from traditional DBMS?
There are many questions which may come to the mind of a person who is starting his journey with big data. I can recall one of my friend Amrit who worked with a company which was also selling computers in year 1999. I asked him for quotes and he told me about a computer which was having 2GB of hard disk. He was very excited, in excitement he declare that you will not be able to fill that hard disk in next few years. In his words, “Its really Big hard disk which can hold big data” Today we can only laugh on it. Today even my car keys have a 64GB flash drive companion.
Big is a relative term and its quite subjective. Although nowadays when data runs in several hundred GB we start calling it big data. My first encounter with big data was with apache logs on a server which was hosting more than few thousands websites. Due to restrictions on the number of opened file pointers on server, I tweaked apache configuration to store all logs in a single file with virtual host information as first column of the records.
We were supposed to process those logs for awstats log analyzer. Server was already under heavy load, so we transfer log files to another server and run our processing routines there. After processing data we were putting furnished awstats files back to originating server so users can see site stats without going to another server. For us this was big data.
I can identify following things due to which it was big for us :-
1. A single file was needed for multiple loggers as we were not able to write to multiple files at once due to restrictions on system.
2. As server was overloaded with high number of requests, we were not able to process our large file which require high memory as well as processor time on the same server.
So data can be classified as big data if your one machine is not able to create, hold or process it for the purpose you want to achieve.
If your machine is not sufficient to process or hold your data then first thing which come to your mind will be an upgrade of hardware. There is a catch in this option, we have limitations on hardware upgrades and there will be a time when upgrades will not be possible.
So you need multiple systems to act like one. Whenever you feel that you are in that situation, you have big data at your disposal to handle.