You are looking for a data science job and don’t have much experience. Having a few public portfolio projects can help you do well in an interview. However, it’s very important this project shows you know what is needed for the job.

In 2023 it’s becoming more and more important to know more about data engineering rather than just machine learning. This means you need to know how to collect the data, define the problem you are going to solve, deploy this solution and communicate it. Seems like a lot to learn, right?

My recommendation is you focus on having at least one end to end project in your portfolio. Here is a step by step approach to defining your project:

1. Look for a realistic dataset

The idea here is to look for a real world dataset. For example, you should avoid the titanic, iris or mnist datasets as these have been overused.

Look for a dataset of a similar industry to the job you will apply. Interested in applying to a retail company, look for a dataset related to that industry.

Unless you are applying for an AI job, my recommendation is to find an interesting tabular dataset to work with. Generally old kaggle competitions might be a good place to start. Try this one if you are up for a challenge!

2. Focus on what matters

What you want to show with an end to end project is you understand all the process of going from raw data to a finished data product. This “data product” could be a data visualization of insights or something interesting you found on the data or a deployed model.

Nobody will care the exact value of the error metric you choose as long as it seems reasonable. I’d be very suspicious if your AUC is 0.99 for example… There is probably a bug!

3. Key components

An awesome data science project should have the following components:

  • Real world dataset: If you scraped it yourself and the process involved some data cleansing, this would be a plus.
  • Use a database: Even if you used kaggle’s dataset that is generally on a CSV file you can insert it in a database.
  • Training pipeline: This is where you do the typical machine learning work of feature engineering, model training and evaluation. You can do this in a notebook but it’s a lot better if you clean up the code and write a script with the training logic.
  • Inference pipeline: Here you would write a script that takes as input “new data”, generates predictions and stores them in a database.
  • Deploy as an API: Develop an API with flask or flask-restful packages where given a new observation passed in a JSON you would return the model prediction. This is not extremely complex to build and most people don’t know how to do it. It’s a great plus!
  • Describe your logic: Assume you have to present your project to a senior data scientist for him to review the methodology. Include a notebook explaining how’s and why’s of your choices. This involves explaining why you used a given metric for evaluation, which cut-off you choose (if it’s a binary classification model) and why did you decide on using XGBoost instead of logistic regression model, for example.
  • Communicate results: Assume you have to present your model to an executive. Create some slides for a non-technical audience. This is the place to include some nice graphs or an interactive web application.
  • Write a blog: Create a medium blog and share your project. Include technical details and some data visualization results.

4. Take time to do it well

This type of end to end project can take a reasonable amount of time. It can take a junior data scientist around 3 months to create a project with this level of detail and high quality.

The upside of learning these skills is massive. It will help you immensely in an interview and you will also improve as a data scientist.

Good luck!


Leave a Reply

Your email address will not be published. Required fields are marked *