Role of Data Scientist in Machine Learning project

What is the role of data scientist in a Machine Learning project? What all different steps or stages of work do a data scientist needs to perform for a successful ML project?

What we are going to cover

  • Scoping and formulation
  • Data aspects
  • Technique selection and model development
  • Reporting and explanation
  • Implementation
  • Model maintenance

Scoping and formulation

This is the first and (probably) the most important step in the entire process. A proper scoping of the problem will not only help in the subsequent steps, but also sets the right expectation of the stakeholders. Formulation of the problem correctly, on the other hand, helps in getting a clear direction and working towards the goal which will have clear business impact. Let us try to understand how can we scope and formulate a problem.

  • what question(s) that the stakeholders wants to answer?
  • what is the product and how it operate?
  • what data points are available to address the question?
  • what is the intended use of the model? Will it be an online (real time) or an offline (batch scoring) model?
  • What kind of solution is required?
  • How will the solution be developed? What will be the role of machine learning in the process?
  • What data is captured and available for the exercise?
  • What kind of machine learning algorithm (supervised or unsupervised) do we need to apply to generate the insights? If supervised ML, then what will the problem look like: classification or regression?
  • How should the dependent/outcome variable be defined?
  • What will the final solution look like?
  • How will the score be generated and consumed by the system?

Data aspects

Data is the most important component for building a successful Machine Learning model. Feeding the correct data to the algorithm is critical, otherwise it will be garbage-in garbage-out. Hence, it is the responsibility of a Data Scientist to ensure the data is of high quality, all required data is available, and understand each and every attribute present in the data. More details on why understanding every attribute in the data is important will be provided in the subsequent section “Reporting and Explanation”.

  • Data from which time period should be considered?
  • Was there a paradigm shift in the business — in such scenarios, data for the period which represent current scenario should be considered.
  • What should be the in-time data vs out-of-time data?

Technique selection and Model development

Next in the stack comes selecting the proper selection of modeling technique. Based on the formulation above, now is the time to select the correct modeling technique. While selecting the technique, few important questions should be answered:

  • Will the technique help in answering the question?
  • What will be the complexity of the technique?
  • What will be the resource requirement?
  • Will the results obtained from the model justify the cost of implementing and running it?
  • What will be the estimated scoring time?

Reporting and explanation

After finishing the model development and validation, next step will be to report the performance of the model. While reporting, make sure all metrices are reported based on new validation (unseen) data and not training data or test data which was used while validating during development process.


It is the job of the Data Scientist to develop the implementation code — replicating the data extraction, pre-processing, feature generation and finally score generation by the model. In complex systems, Data Scientist should work hand-in-hand with engineering team ensuring the implementation is done exactly the way it should be.

Model maintenance

So the model development and implementation is done, huh! That should be the end of the project. Wait, no! The role of a Data Scientist is not yet done. A proper charter needs to be built and required mechanism should be developed so that the model get updated or re-trained with minimum intervention.


A role of data scientist is not just developing a Machine Learning model — it is much more than that. A proper data scientist needs to do scoping, formulation, data extraction, quality check on the data, model development, reporting performance, implementation and maintenance of model.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store