  • In this post I’m going to cover how you can run a local LLM (large language model) of the stype of chatGPT on your own computer. We will be using the Vicuña model that has been quantized. What does this all mean!? Well, this initially facebook released a “fundational model” called LLaMa. Using this “base […]

  • There are multiple ways to detect outliers. In this post I’ll cover two simple methods based on statistics and a third method where I take into account the time series nature of the data. Let’s plot the empirical distribution using a histogram. For these type of univariate plots I generally use the pandas .plot() method. […]

  • Here is an example of the typical simplest approach to filtering out missing values from a pandas DataFrame. One small detail: In Pandas the missing values (NAs) are represented as NaN (not a number). Here are some potential use cases. You need to make sure you are not removing “good” data. Filter out all missing […]

  • If you want to fit a linear model to get insights and you care more about the coefficients rather than making predictions, you will love this trick. The statsmodels formula interface is basically a copy of a functionality available in the R programming language. Let’s give it a try with some data. Linear Model with […]

  • In this post I want to explain how to read a CSV file and benchmark multiple methods available in Pandas and Polars. Let’s use the flights dataset that is ~500MB, so it’s fairly realistic for a benchmark. It’s still in the small end of the spectrum but it will be useful to compare results. 1. […]

  • Before trying to read the file, open a few rows in a text editor. In this example, I can see the lines are separated by a newline character. This means the file is not really a proper JSON file. However, we can still read it in pandas. This file format is generally called JSONL, this […]

  • There are multiple ways to do this in pandas let’s take a look at an example. Each has it’s merits depending on the use case. It’s one of those cases where you need to know all of them. Let’s first read some data. I’ll be using the diamonds dataset. Method 1: This is a great […]

  • There are at least 4 methods to do this in pandas but there are only 2 that are simple and efficient. The best solution involves sorting the pandas DataFrame and then selecting the top-n results. On this concrete dataset, I might want to answer the question: Which are the top-5 countries with higher mortality each […]

  • There are multiple ways you can iterate or loop over rows in pandas. However, this is often a “code smell” that implies you are not very experience with pandas. First, let me cover how to iterate over rows. Let’s assume we have this pandas Data Frame. First of all, let me cover how to iterate […]