Blog

  • Home / Python / How to rewrite your SQL queries in Pandas, and more

How to rewrite your SQL queries in Pandas, and more

  • July 16, 2021

Some of us are familiar with data manipulation in SQL but not in Python, we tend to switch frequently between SQL and Python in a project, result in reducing our efficiency and productivity. In fact, we can achieve a similar result of SQL in Python using Pandas.

Getting Started

As usual, we would need to install pandas package if we do not have it.

conda install pandas

We will be using the famous titanic dataset from Kaggle in this session.

After installing the package and downloading the data, we need to import them in our Python environment.

sql-with-python.py – Medium


We will use pandas dataframe to store the data and use various pandas function to manipulate the dataframe.

SELECT, DISTINCT, COUNT, LIMIT

Let’s start with simple SQL queries which we use it frequently.

sql-with-python1.md – Medium


titanic_df[“age”].unique() will return an array of unique values here, so we would need to use len() to get the count of unique values.

SELECT, WHERE, OR, AND, IN (SELECT with conditions)

You should know how to explore the dataframe in simple way after the first part. Now let’s try with some conditions (which is WHERE clause in SQL).

sql-with-python2.md – Medium


If we only want to select specific columns from the dataframe, we can select using another pair of square bracket.

isin() works exactly the same as IN in SQL. To use NOT IN, we would need to use the negation(~)in Python to achieve the same result.

GROUP BY, ORDER BY, COUNT

GROUP BY and ORDER BY are also the popular SQL that we use to explore data. Let’s try this in Python now.

sql-with-python3.md – Medium


If we want to sort the COUNT only, we can just pass the boolean into the sort_values function. If we are going to sort multiple columns, then we have to pass an array of boolean to the sort_values function.

sum() function will give us all aggregate numeric sum columns in the dataframe, we need to specify the column name using square bracket if we just need a specific column.

MIN, MAX, MEAN, MEDIAN

Lastly, let’s try with some common statistical functions which are important in data exploratory.

sql-with-python4.md – Medium


Since SQL doesn’t have a median function, I will use BigQuery APPROX_QUANTILES to get the median of age.

The pandas aggregation function .agg() also supports other function like sum.

Now you have learnt how to rewrite your SQL queries in Python with pandas. I hope this article is useful to you.

-Metacoder