  • There are multiple ways to detect outliers. In this post I’ll cover two simple methods based on statistics and a third method where I take into account the time series nature of the data. Let’s plot the empirical distribution using a histogram. For these type of univariate plots I generally use the pandas .plot() method. […]

  • Here is an example of the typical simplest approach to filtering out missing values from a pandas DataFrame. One small detail: In Pandas the missing values (NAs) are represented as NaN (not a number). Here are some potential use cases. You need to make sure you are not removing “good” data. Filter out all missing […]

  • In this post I want to explain how to read a CSV file and benchmark multiple methods available in Pandas and Polars. Let’s use the flights dataset that is ~500MB, so it’s fairly realistic for a benchmark. It’s still in the small end of the spectrum but it will be useful to compare results. 1. […]

  • Before trying to read the file, open a few rows in a text editor. In this example, I can see the lines are separated by a newline character. This means the file is not really a proper JSON file. However, we can still read it in pandas. This file format is generally called JSONL, this […]

  • There are multiple ways to do this in pandas let’s take a look at an example. Each has it’s merits depending on the use case. It’s one of those cases where you need to know all of them. Let’s first read some data. I’ll be using the diamonds dataset. Method 1: This is a great […]

  • There are at least 4 methods to do this in pandas but there are only 2 that are simple and efficient. The best solution involves sorting the pandas DataFrame and then selecting the top-n results. On this concrete dataset, I might want to answer the question: Which are the top-5 countries with higher mortality each […]

  • There are multiple ways you can iterate or loop over rows in pandas. However, this is often a “code smell” that implies you are not very experience with pandas. First, let me cover how to iterate over rows. Let’s assume we have this pandas Data Frame. First of all, let me cover how to iterate […]

  • I’ve developed a video series where I teach pandas, data analysis and data visualization while working on real world datasets. I think learning to code solving an actual problem is a lot more useful than doing tutorial on a particular tool. Each video covers a different data set, this is the first video of the […]

  • In this video I cover how you can do an exploratory data analysis with Python using Pandas and matplotlib. The idea is to use an online retailer’s e-commerce dataset for the analysis. I think this is a realistic dataset, you can encounter something similar in a data science job. Here is the video with the […]