Vlad Filippov

Vlad Filippov

full-stack software developer / open source hacker

Working with Python Pandas

Python Pandas, the data analysis library has very good documentation, including both the API reference and the user guide. I like how descriptive it is and the clean design of the docs site makes it very readable. However, I do find that it still takes a bit of time to get started with Pandas, even if you have years and years of Python programming experience.

Getting started, install the module. This could be an obvious step, but anyway:

pip3 install pandas
# in a new or existing .py file, import the module and make sure it works
import pandas as pd

Pandas is able to load CSV and SQL files, you can use .read_csv() for that. In this article, we create a DataFrame “inline” to focus on the rest of the Pandas API. We can create an example frame like this:

data = pd.DataFrame({
    "name":["Tim","Roger","Bob"],
    "town":["Toronto","Ottawa","London"],
    "balance":[50000,52000,None],
})

Printing this out with print(data) will produce a nice table for you:

    name     town  balance
0    Tim  Toronto  50000.0
1  Roger   Ottawa  52000.0
2    Bob   London      NaN

Now that we have our sample data all setup, I want to go over some of the common and crucial Pandas operations that will help you become a Pandas expert. These should also help you understand how the library works and what it is cable of.

Finding items by index

# Setup an an "index" on name. Now we can use ".loc" to find things
data.set_index('name', inplace=True)

# Look up item by name "Tim" using the index
print(data.loc["Tim", :])

# Look up items by ".iloc", the "index" location.
print(data.iloc[0, :])

Rewriting column names

# The following updates the columns in our data frame.
cols = list(data.columns)
cols[0] = 'Location'
cols[1] = 'Balance'
data.columns = cols

Get the number of rows and columns

# returns a tuple pair, which are number of rows and cols in the data
data.shape

Ranges

# Looking up first and last rows
data.head(1)
data.tail(1)

# Using "list" lookup, other list range options work here as well
data[1:]

Data types

# query the data type
data.dtypes

# Location     object
# Balance     float64
# dtype: object

Query the data

# look up the data based on columns
data[['Location', 'Balance']]

# Get all the positive balances
print(data[data['Balance'] > 0])

# You can also use the query syntax for this
print(data.query(('`Balance` > 0')))

Conditional Lookup

# an AND statement
print(data[ (data['Balance'] > 0) & (data['Location'] == 'Toronto') ])

# an OR statement
print(data[ (data['Balance'] > 0) | (data['Location'] == 'London') ])

Sorting

# sorts the data
data.sort_values(by='Balance', ascending=False))

For more advanced sorting methods, check out this article: Pandas Sort: Your Guide to Sorting Data in Python.

Save to CSV file

# Saves the data into a file
data.to_csv('some.csv')

As recommended by the official Pandas website, the Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython book can also be very useful if you are looking to learn more about Python data analysis. I hope these syntax examples saved you some time learning the Pandas library :).

© Vlad Filippov