What is Data Science?
Often the term Data Science is uttered in conjunction with Big Data (remember the 4 V’s? - Volume, Variety, Velocity and Veracity), however Data Science is not limited to this context. Data Science is the art of converting raw data to useful information that can be used to draw conclusions and make decisions.
For this goal a data scientist needs a deep understanding of techniques and methodologies from different disciplines in order to capture the full potential that is hidden in the data.
Competences of a Data Scientist
First of all math and statistics yield the foundation of models and methods to analyse and interpret data. The challenges from a statistical perspective include the quality of the data, the characteristics of the sample, the validity of generalisation, the confidence of the obtained results and the balance of humans and computers. Statistical rigour is thus necessary to justify the inferential leap from data to knowledge. In the era of Big Data, statistics plays also a special role in handling the veracity aspect.
Computer Science brings theory to action. In order to make analysis of large datasets possible it is necessary to develop not only effective but efficient algorithms that are tailored to the types of data and the problem one wants to solve. Thus it is often not sufficient to have knowledge about frameworks like Hadoop, Spark and SPSS, but it is crucial to be able to extend these frameworks for the sake of the task. Without computer science, data science is hardly possible and even unimaginable when it comes to Big Data. The problems of Volume and Velocity cannot be tackled without a deep algorithmic understanding.
Last but not least domain knowledge is inevitable. It is necessary in order to understand the data and the (business) processes that shall benefit from the analysis. Since it is unreasonable to have deep analytical skills and domain knowledge for a specific task combined in one person, a data scientist needs to have good communication skills. This is crucial for two things, first to understand the “language” of the domain expert (i.e., owner of the data) and second to communicate the results of the analysis. An analysis is meaningless unless it convinces someone to take action. On top, he needs to be creative in order to adapt analytical methodologies to new problem settings in different environments.
Putting all this knowledge together the typical data science process involves several steps as shown in the following figure
The Data Science Process (stepping backwards is possible and neccessary)