Data Science Process
The various phases of a data science process are explained below:
This is the first phase of the data science process that involves asking the right questions. When you start any project of data science process, you need to gather what are the basic requirements, priorities, and project budget. In this phase, we gather all the requirements of the project, such as the number of people, technology, time, data, and an end goal. It involves gathering data from all the identified internal & external sources. The data can be logs from web servers, data gathered from social media, data from online repositories like US Census datasets, or data streamed from online sources using APIs.
In this phase, you develop goals and a plan on how to achieve those goals. If the right questions have been asked in this phase, it becomes easy to narrow down to correct data sources.
A major challenge faced here is to understand where the data comes from and whether it is the updated data or not. It makes it an important step to keep track of the project life cycle, as data needs to be re-acquired to test other hypotheses, run any other experiments, and reach conclusions.
Data preparation, also known as Data Munging and Data Wrangling, involves tasks such as data cleaning, data Reduction, data integration, and data transformation. There are many inconsistencies like missing value, blank columns, and an incorrect data format that needs to be cleaned. The cleaner is your data; the better are your predictions.
Data gathered in the previous phase might not give a clear analytical picture or patterns in the data. Therefore, to understand this data, it needs to be structured and cleaned. Data can be obtained from different sources, but for analysis, data need to be clubbed together from different sources. This is also termed as structuring the data. Data after reformatting can be converted to JSON, CSV, or any other format that makes it easy to load into one of the data science tools.
In this phase, you need to determine the method and technique to create a relation between input variables. Planning for a model is performed by applying exploratory data analytics (EDA) using different statistical formulas and visualization tools. SQL analysis services, R, and SAS/access, Python are some of the common tools used for this purpose.
This phase includes choosing the appropriate type of model, whether the problem is a classification problem, a regression problem, or a clustering problem. After choosing the type of model, we need to choose the algorithms carefully to implement them. We also need to tune the hyperparameters of each model to achieve the desired performance. We need to ensure that there is a correct balance between performance and generalizability. We do not want such a model that performs poorly on new data.
The actual model building process starts in this phase. Here, Data scientist creates datasets for training and testing purpose. Techniques such as association, classification, and clustering are applied to the training data set to build the model. The model, once prepared, is tested against the testing dataset. The commonly used Model Building Tools are SAS Enterprise Miner, WEKA, SPSS Modeler, and MATLAB.
Here the model is evaluated to check if it is ready to be deployed. The model is tested on unseen data and evaluated on a carefully thought out set of evaluation metrics. We also need to ensure that the model conforms to reality. If we do not get a satisfactory result in the evaluation, we must re-iterate the complete modeling process until we achieve the desired level of metrics.
In this phase, final reports and briefings, code, and technical documents are delivered. This phase gives you a clear overview of complete project performance, and other components on a small scale before the full deployment after a thorough testing model is deployed into a real-time production environment.
In this phase, we check whether we have reached the goal set on the initial phase. After that, the key findings and results are communicated to all stakeholders and the business team. This also helps you to decide if the results of the project are a success or a failure on the basis of inputs from the model.