Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualisation, management and preservation of large collection of information.
As seen in the previous article (Data Science for beginners!), Data Science is the ability to take data, to be able to understand it, to extract value from it, to visualise it and to communicate it. There is a deep connection between science and the huge quantity of data that can be collected nowdays.
“In the last few hundreds years, Science has accepted theoretical models as a valid method of inquiry”
In the last 50 years, high speed computation has allowed an entirely new method of scientific inquiry. It is now possible to simulate with a computer a phenomena that otherwise couldn’t be observed directly or couldn’t be reproduced in laboratories, it’s also possible to write initial condition and run the simulation to get a result. As stated by many scientists, “Science is in the midst of a generational shift from a data-poor-enterprise to a data-rich-enterprise”.
In the last 10 years, the acquisition of massive data sets from new kind of instruments (like sensors) or from simulations, has contributed to the development of a new typology of science, called e-Science. This is really about massive and complex data, large enough to require automated analysis with tools that are the same as those from data science (databases, sequel systems or machine learning techniques).
If Science is about asking questions (“Query the World with data acquisition activities coupled to a specific hypothesis”), e-Science is about “downloading the World with huge data sets acquired in support of many hypothesis”. This new kind of Science is driven by data more than by computation, thanks to a massive volume of data coming from many fields.
Also business is beginning to look a lot like science because companies acquire data aggressively, hire data scientists and make empirical decisions.
Thanks to advances in technology, the cost of data acquisition has dropped and, today, the new bottleneck is the cost of finding, integrating and analysing data. It is even possible to say that data analysis has replaced data acquisition as the new bottleneck to achieve new discoveries. The automated extraction of knowledge from massive volumes of data, suggests that there’s simply too much data to look at; but it’s not just a matter of volume. We can call these volumes of data as Big Data.
Big Data is a buzzword, meaning a massive volume of data that is so large it is difficult to process using traditional software techniques.
Everyday we’re collecting new forms of data, regardless of their immediate usage: Big Data come mostly from customers behaviour and pervasive sensors, usually in a non-structured form.
“The information hidden within Big Data is now a key theme in all the sciences – arguably the key scientific theme of our times”
Large scale data is not just bigger but it’s also different. To understand the phenomenon of Big Data, it is often described using four Vs:
- Volume: (number of rows/objects/bytes) The size of data, refers to the vast amounts of data generated every second.
- Velocity: (number of rows/bytes per unit time) Refers to the speed at which new data is generated and the speed at which data moves around. Technology allows us to analyse the data while it is being generated, without ever putting it into databases.
- Variety: (number of columns/dimensions/sources) The diversity of formats, quality and structures, refers to the different types of data we can now use.
- Veracity: Refers to the trustworthiness of the data. With many forms of data, quality and accuracy are less controllable.
We can consider another V for Value because it is important that businesses make a clear choice about costs/benefits before any attempt to collect big data.
“The coming of the Big Data Era is a chance for everyone in the technology world to decide into which camp they fall, as this era will bring the biggest opportunity for companies and individuals in technology since the dawn of the Internet”
Organizations today are facing more and more Big Data challenges, they have access to an abundance of information, but they don’t know how to get value out of it because most of data are in an unstructured format. In fact, today, 80% of the world’s information is unstructured and this format is growing at 15 times the rate of structured information. Organizations that don’t know how to manage this data are overwhelmed by it, but the opportunity exists, with the right technology platform.
Volume, Velocity and Variety defines what kind of data are “Big” and what are not. Dealing effectively with Big Data requires that you perform analytics against the volume and variety of data while it is still in motion, not just after it is at rest. To sum up, today the Velocity characteristic of Big Data is a key discriminant to work in data science.