Vlad Filippov

Vlad Filippov

full-stack software developer / open source hacker

Working with Python Pandas

Python Pandas, the data analysis library has very good documentation, including both the API reference and the user guide. I like how descriptive it is and the clean design of the docs site makes it very readable. However, I do find that it still takes a bit of time to get started with Pandas, even if you have years and years of Python programming experience.

This article has been updated in March 2026 to include the latest changes with the Pandas 3.0 release. There are some significant improvements worth knowing about, including better performance with Arrow backends and the new Copy-on-Write default behaviour that eliminates those annoying SettingWithCopyWarning messages.

Getting started, install the module. This could be an obvious step, but anyway:

# Install with performance improvements
pip install pandas[all]
# or the basic version
pip install pandas

# in a new or existing .py file, import the module and make sure it works
import pandas as pd

Pandas is able to load CSV and SQL files, you can use .read_csv() for that. In this article, we create a DataFrame “inline” to focus on the rest of the Pandas API. We can create an example frame like this:

data = pd.DataFrame({
    "name": ["Tim", "Roger", "Bob", "Sarah", "Mike"],
    "town": ["Toronto", "Ottawa", "London", "Montreal", "Vancouver"],
    "balance": [50000, 52000, None, 65000, 48000],
    "account_type": ["Savings", "Checking", "Savings", "Investment", "Checking"],
})

Printing this out with print(data) will produce a nice table for you:

    name       town   balance account_type
0    Tim    Toronto   50000.0      Savings
1  Roger     Ottawa   52000.0     Checking
2    Bob     London       NaN      Savings
3  Sarah   Montreal   65000.0   Investment
4   Mike  Vancouver   48000.0     Checking

Now that we have our sample data all setup, I want to go over some of the common and crucial Pandas operations that will help you become a Pandas expert. These should also help you understand how the library works and what it is capable of. I’ve also included some modern patterns that have become popular in 2026.

Finding items by index

# Setup an "index" on name. Now we can use ".loc" to find things
data.set_index('name', inplace=True)

# Look up item by name "Tim" using the index
print(data.loc["Tim", :])

# Look up items by ".iloc", the "index" location.
print(data.iloc[0, :])

# Modern approach: using query() for readable filtering
toronto_accounts = data.query('town == "Toronto"')
print(toronto_accounts)

Rewriting column names

# The following updates the columns in our data frame.
cols = list(data.columns)
cols[0] = 'Location'
cols[1] = 'Balance' 
cols[2] = 'AccountType'
data.columns = cols

# Modern alternative using rename() method
data = data.rename(columns={
    'town': 'Location',
    'balance': 'Balance',
    'account_type': 'AccountType'
})

Get the number of rows and columns

# returns a tuple pair, which are number of rows and cols in the data
data.shape

# Get more detailed info about your DataFrame
data.info()

Ranges

# Looking up first and last rows
data.head(1)
data.tail(1)

# Using "list" lookup, other list range options work here as well
data[1:]

# Get specific rows by position
data.iloc[1:3]  # rows 1 and 2

Data types

# query the data type
data.dtypes

# Location     object
# Balance     float64
# AccountType  object
# dtype: object

# Modern approach: optimize data types for better performance
data['Balance'] = pd.to_numeric(data['Balance'], errors='coerce')
data['AccountType'] = data['AccountType'].astype('category')  # saves memory
data['Location'] = data['Location'].astype('string')  # new string dtype in pandas 3.0

Query the data

# look up the data based on columns
data[['Location', 'Balance']]

# Get all the positive balances
print(data[data['Balance'] > 0])

# You can also use the query syntax for this (recommended for readability)
print(data.query('Balance > 0'))

# Query with string columns
toronto_accounts = data.query('Location == "Toronto"')
high_balance = data.query('Balance > 50000')

Conditional Lookup

# an AND statement
print(data[(data['Balance'] > 50000) & (data['Location'] == 'Toronto')])

# an OR statement  
print(data[(data['Balance'] > 55000) | (data['Location'] == 'Vancouver')])

# Modern query syntax (more readable)
print(data.query('Balance > 50000 and Location == "Toronto"'))
print(data.query('Balance > 55000 or Location == "Vancouver"'))

Sorting

# sorts the data
data.sort_values(by='Balance', ascending=False)

# Multiple column sorting
data.sort_values(['Location', 'Balance'], ascending=[True, False])

For more advanced sorting methods, check out this article: Pandas Sort: Your Guide to Sorting Data in Python.

Modern Method Chaining (New in 2026)

One of the most powerful patterns that has become popular is method chaining, which lets you write more readable data pipelines:

# Instead of multiple assignment statements, chain operations together
result = (data
    .query('Balance > 40000')  # Filter accounts with balance > 40k
    .assign(
        BalanceCategory=lambda x: pd.cut(x['Balance'], 
                                        bins=[0, 50000, 60000, float('inf')], 
                                        labels=['Standard', 'Premium', 'Elite'])
    )
    .groupby(['Location', 'BalanceCategory'])
    .agg({'Balance': ['count', 'mean']})
    .round(2)
)

Working with Missing Data

# Check for missing values
print(data.isna().sum())

# Fill missing values
data['Balance'].fillna(data['Balance'].median(), inplace=True)

# Or use method chaining approach
clean_data = (data
    .assign(Balance=lambda x: x['Balance'].fillna(x['Balance'].median()))
    .dropna()  # remove any remaining missing values
)

Grouping and Aggregation

# Group by location and get summary stats
location_summary = data.groupby('Location').agg({
    'Balance': ['count', 'mean', 'sum'],
    'AccountType': lambda x: x.mode().iloc[0] if not x.empty else 'Unknown'
}).round(2)

# Custom aggregation functions
def balance_range(series):
    return series.max() - series.min()

custom_stats = data.groupby('Location').agg(
    avg_balance=('Balance', 'mean'),
    total_accounts=('Balance', 'count'),
    balance_spread=('Balance', balance_range)
)

Performance Tips for 2026

# Use proper data types for better performance
optimized_data = pd.DataFrame({
    "name": pd.array(["Tim", "Roger", "Bob"], dtype="string"),  # Arrow-backed strings
    "town": pd.Categorical(["Toronto", "Ottawa", "London"]),  # Categories for repeated values
    "balance": pd.array([50000, 52000, 48000], dtype="Int64"),  # Nullable integers
})

# For large datasets, use eval() for complex calculations
# This is much faster than regular operations on big DataFrames
large_data = pd.DataFrame({
    'revenue': np.random.randint(1000, 10000, 100000),
    'costs': np.random.randint(500, 8000, 100000),
    'tax_rate': 0.13  # Ontario HST rate
})

# Fast calculation using eval
large_data = large_data.eval('profit = revenue - costs')
large_data = large_data.eval('after_tax_profit = profit * (1 - tax_rate)')

Save to CSV file

# Saves the data into a file
data.to_csv('canadian_accounts.csv', index=False)

# Modern alternatives for better performance
data.to_parquet('canadian_accounts.parquet')  # Much faster for large files

# Excel with multiple sheets
with pd.ExcelWriter('bank_report.xlsx') as writer:
    data.to_excel(writer, sheet_name='Account_Data', index=False)
    location_summary.to_excel(writer, sheet_name='Location_Summary')

Reading Data with Modern Options

# Read CSV with optimized settings
df = pd.read_csv('canadian_accounts.csv', 
                dtype_backend='pyarrow',  # Use Arrow backend for performance
                parse_dates=['date_column'] if 'date_column' in data.columns else None)

# Read Parquet (recommended for large datasets)
df = pd.read_parquet('canadian_accounts.parquet')

Error Handling and Debugging

# Debug your data pipelines with helper functions
def debug_dataframe(df, step_name=""):
    print(f"{step_name}: {df.shape} rows, {df.columns.tolist()}")
    return df

# Use in method chains
result = (data
    .pipe(debug_dataframe, "Start")
    .query('Balance > 45000')
    .pipe(debug_dataframe, "After filtering") 
    .groupby('Location')['Balance'].mean()
    .pipe(debug_dataframe, "Final result")
)

As recommended by the official Pandas website, the Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython book can also be very useful if you are looking to learn more about Python data analysis. I hope these syntax examples saved you some time learning the Pandas library :).

The major improvements in Pandas 3.0 make it significantly faster and more memory-efficient than previous versions, especially when working with string data and large datasets. The Copy-on-Write behavior also means you don’t have to worry about those confusing warning messages anymore.

© Vlad Filippov