Although Data Science can be an all-encompassing umbrella term of statistics, machine learning, and programming/computer science, I consider Data Science to be the part of the process where one obtains the data and begins to make progress on identifying features and covariates. Certainly, there is an ambiguous border between feature selection and machine learning model selection and refinement but much of what I consider Data Science has been achieved by the time that intersection is reached. Obtaining data is often very challenging. For Kaggle competitions, this data engineering step is omitted. But real life data is not as easy to obtain and requires the data scientist to determine not only what variables to collect and how to collect them but also make judgments on how much time and money to spend acquiring different data.

Once the data set is obtained, the next step is to examine the type of data that is at our disposal. Not only is the data often a mixture of categorical, ordinal, and interval variables, but often it has problems such as noise, sparseness, and just plain veracity. Knowing when and how to transform the data and what ramifications it may have is the data scientist’s job. The decisions made in these steps are critically important – and are often much more important than hypertuning a machine learning model much further downstream in the process.

The following data science problems could be classified under machine learning. However, I am listing them here because in these cases, the performance of the models and predictions depends far more on feature engineering and feature selection than selecting the optimal activation function or gradient descent technique.

Data Science Analyses

Kaggle’s Titanic Survival Prediction – This is an introductory data science competition hosted by Kaggle. This is my kernel, also hosted on Kaggle. The script was written in R and R Markdown.
Iris Dataset – Ronald Fisher’s classic dataset, in which he introduced Linear Discriminant Analysis for classification. This script was written in Python in a Jupyter notebook.
HoopSci – This is a new project of mine, a website and blog devoted to college basketball analytics.  This site is my current major focus.