If someone were to define a dataset of x amount of bytes and called it ‘Big Data’, the term would become obsolete within months. Technology is advancing at such a rapid rate that datasets are in turn getting larger and larger. We would have to start using comparative adjectives such as ‘Bigger Data’ and ‘Larger Data’, and this would defeat the point.
In the modern technological age, we are no longer constrained to taking samples because we are often able to collect all the data needed on an entire population. Using the size of the data alone to define ‘Big Data’ is not enough. In 2001, Doug Laney (a former analyst at Gartner) helped outline some key characteristics to help move towards a better understanding what ‘Big Data’ actually is. The Big Vs.
Volume
How much data is being collected? How much data is being stored? These questions help gauge the volume of data. A 2012 study by the University of Oxford and IBM revealed that over half of the 1,144 data professionals surveyed judged datasets of between 1 Terabyte and 1 Petabyte to be ‘big’. More interestingly, about a third of the respondents said they frankly didn’t know.
As a general statement, datasets that are so big that they cannot be collected, stored and analysed using traditional computing methods can be considered to meet the volume criteria.
Variety
Variety is a measure of how diverse the data is. This can refer to one of two things: the source of the data or the structure of the data. The former is usually considered when collecting data. Take for example, polling data- this can be collected directly after voters cast their votes on the ground, or it can be collected from online surveys. Although the format of the data is different, i.e. in raw paper form or stored in online databases, the data represents the same thing.
Secondly, variety can be derived from structure. Data can generally be categorised into three ‘buckets’: structured, semi-structured and unstructured. Let’s take the polling data example from above again; data collected from online surveys will follow a consistent format and so this would be considered as semi-structured (there may still be text that need to be fully processed before it can be classed as structured). However, data collected directly from the ground with pen and paper may not follow such a standard format- people could write in words or could maybe just dictate their answers out loud. This data would be considered unstructured.
Velocity
The speed at which data needs to be collected, processed and analysed defines the final key characteristic of big data. Velocity and volume come hand in hand because the faster the data is generated, the more data there is. Likewise, the faster the data is collected and processed, the more there is to be analysed allowing for actionable decisions to be made. High speed data processing can be seen all around us from the image processing in your smartphone camera to watching videos on your laptop. This is usually referred to as stream processing and is beneficial when data need to be interpreted in real time.
On the flip side of the coin, there is batch processing which is usually a slower form of data flow but allows for much larger volumes of data to be processed. A good example of this is financial data. There is no need for the data to flow continuously and quickly because it is more important the data comes in correctly. That’s why most financial institutions generate their data in chunks or ‘batches’ so that they can rely more on the data and also be more cost effective.
Over the years, there have been more words that have been proposed including Veracity, referring to the reliability of the data and Visualisation, referring to how the data is presented in an interpretable manner. Although, these characteristics definitely help add to the definition of big data, it goes to show that the term can be used in many ways. Data is in fact, just a universal tool that allows us to better understand ourselves and the world we live in.

Leave a comment