What is a Data Science Life Cycle?
A data science life cycle is an iterative set of data science steps you take to deliver a project or analysis. Because every data science project and team are different, every specific data science life cycle is different. However, most data science projects tend to flow through the same general life cycle of data science steps.
We’ll illustrate how a hypothetical project progresses through a typical data science life cycle framework.
A General Data Science Life Cycle
Some data science life cycles narrowly focus on just the data, modeling, and assessment steps. Others are more comprehensive and start with business understanding and end with deployment.
And the one we’ll walk through is even more extensive to include operations. It also emphasizes agility more than other life cycles.
This life cycle has five steps:
- Problem Definition
- Data Investigation and Cleaning
- Minimal Viable Model
- Deployment and Enhancements
- Data Science Ops
These are not linear data science steps. You will start with step one and then proceed to step two. However, from there, you should naturally flow among the steps as necessary.
Several small iterative steps are better than a few larger comprehensive phases.
What are other popular Data Science Life Cycles?
The above generic life cycle is one of the dozens (hundreds?) you can find on-line. We’ll explore some of the more popular ones.
Data Mining Life Cycles
These three classic data mining processes have been thrown under the general umbrella of data science life cycles. All of them hail from the 90s. These tend to be more myopic. Specifically, the KDD Process and SEMMA focus on the data problem and not the business problem. Only CRISP-DM has a deployment phase. None of them have an operations phase.
- Knowledge Discovery in Database (KDD) Process: This is the general process of discovering knowledge in data through data mining, or the extraction of patterns and information from large datasets using machine learning, statistics, and database systems.
- SEMMA: SAS developed Sample, Explore, Modify, Model, and Assess (SEMMA) to help guide users through tools in SAS Enterprise Miner for data mining problems.
- CRISP-DM: The CRoss Industry Structured Process for Data Mining is the most popular methodology for data science and advanced analytics projects. It has six steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Validation, and Deployment. It is broader-focused than SEMMA and the KDD Process but likewise lacks the operational aspects of a data science product life cycle.
Modern Data Science Life Cycles
The below life cycles are more modern approaches that are specific to data science. Like the data mining processes, OSEMN is more focused on the core data problem. Most others, especially Domino’s, tend to focus on the fuller solution.
- OSEMN: Standing for Obtain, Scrub, Explore, Model, and iNterpret, OSEMN is a five-phase life cycle. Go to this Towards Data Science post to learn more.
- Microsoft TDSP: The Team Data Science Process combines many modern agile practices with a life cycle similar to CRISP-DM. It has five steps: Business Understanding, Data Acquisition and Understanding, Modeling, Deployment, and Customer Acceptance.
- Domino Data Labs Life Cycle: This life cycle is perhaps most similar to my generic life cycle, in part because it includes a final operations stage. Its six steps are: Ideation, Data Acquisition and Exploration, Research and Development, Validation, Delivery, and Monitoring.
- Lesser-Known Life Cycles: Jeff found several interesting but lesser-known life cycles described in various blog posts. See his post on Data Science Workflows to learn more.
There are numerous data science life cycles to choose from. Most communicate the same basic steps necessary to deliver a data science project but often have a distinct angle.
The angle of this life cycle stresses the need for agility and the broader data science product life cycle.
Regardless of the life cycle you use, combine it with a collaboration process so that your team can effectively coordinate with each other and stakeholders.
Good luck. This journey is a challenge. But it can be fun. Have a blast in your next data science project!