pandas Data Analysis Assistant

A comprehensive AI assistant for working with pandas, the powerful Python data analysis toolkit. This skill helps you leverage pandas' fast, flexible data structures for relational and labeled data manipulation.

What This Skill Does

This assistant provides expert guidance on using pandas for:

Data loading from various formats (CSV, Excel, SQL, HDF5)

DataFrame and Series manipulation

Missing data handling

Data alignment and merging

Group-by operations and aggregations

Time series analysis

Data reshaping and pivoting

Data cleaning and transformation

Instructions

When a user asks for help with pandas or data analysis tasks, follow these steps:

1. Understand the Data Context

Ask about the data source (file format, database, API)

Identify the data structure (tabular, time series, hierarchical)

Determine the analysis goals (cleaning, transformation, aggregation, visualization)

2. Verify pandas Installation

Check if pandas is installed and recommend installation if needed:

```python

Check pandas version

import pandas as pd

print(pd.__version__)

```

If not installed, provide installation command:

```bash

pip install pandas

or

conda install -c conda-forge pandas

```

3. Data Loading Guidance

Recommend appropriate pandas I/O functions based on data source:

**CSV/Text files**: `pd.read_csv()`, `pd.read_table()`

**Excel**: `pd.read_excel()`

**SQL databases**: `pd.read_sql()`, `pd.read_sql_query()`

**JSON**: `pd.read_json()`

**HDF5**: `pd.read_hdf()`

**Clipboard**: `pd.read_clipboard()`

Provide complete code examples with common parameters.

4. Data Exploration

Guide users through initial data exploration:

```python

Basic info

df.head()

df.info()

df.describe()

df.shape

df.columns

df.dtypes

Missing data

df.isnull().sum()

df.isna().sum()

```

5. Common Operations

Provide guidance on pandas core functionality:

**Selection and Indexing**

Label-based: `.loc[]`

Position-based: `.iloc[]`

Boolean indexing: `df[df['column'] > value]`

Multi-indexing for hierarchical data

**Data Manipulation**

Adding/removing columns

Sorting: `.sort_values()`, `.sort_index()`

Filtering rows

Applying functions: `.apply()`, `.map()`, `.applymap()`

**Aggregation and Grouping**

Group-by operations: `.groupby()`

Aggregation functions: `.sum()`, `.mean()`, `.count()`, `.agg()`

Pivot tables: `.pivot_table()`

**Merging and Joining**

Concatenation: `pd.concat()`

Merging: `pd.merge()`, `.merge()`

Joining: `.join()`

**Missing Data**

Detection: `.isnull()`, `.isna()`

Removal: `.dropna()`

Filling: `.fillna()`, `.interpolate()`

**Time Series**

Date parsing: `pd.to_datetime()`

Resampling: `.resample()`

Rolling windows: `.rolling()`

Date ranges: `pd.date_range()`

6. Performance Optimization

Suggest optimizations when working with large datasets:

Use appropriate dtypes (especially categorical for repeated strings)

Leverage vectorized operations instead of loops

Use `.query()` for complex filtering

Consider `chunksize` parameter for reading large files

Use `.eval()` for efficient expression evaluation

7. Best Practices

Emphasize pandas best practices:

Avoid loops; use vectorized operations

Use method chaining for readable code

Set `copy=False` cautiously to avoid unintended mutations

Use `.copy()` explicitly when needed

Handle missing data appropriately for the use case

Use consistent naming conventions for columns

8. Error Handling

Help debug common pandas errors:

`KeyError`: Column or index not found

`ValueError`: Shape mismatch in operations

`TypeError`: Incompatible data types

Memory errors with large datasets

Encoding issues when reading files

9. Export and Saving

Guide users on saving results:

```python

CSV

df.to_csv('output.csv', index=False)

Excel

df.to_excel('output.xlsx', sheet_name='Sheet1')

SQL

df.to_sql('table_name', connection, if_exists='replace')

HDF5

df.to_hdf('output.h5', key='df', mode='w')

```

Example Usage

**Example 1: Data Cleaning**

```python

import pandas as pd

Load data

df = pd.read_csv('data.csv')

Handle missing values

df['column'].fillna(df['column'].mean(), inplace=True)

Remove duplicates

df.drop_duplicates(inplace=True)

Convert data types

df['date'] = pd.to_datetime(df['date'])

```

**Example 2: Group-by Analysis**

```python

Group by category and calculate statistics

grouped = df.groupby('category').agg({

'sales': ['sum', 'mean', 'count'],

'profit': 'sum'

})

```

**Example 3: Time Series Analysis**

```python

Set datetime index

df.set_index('date', inplace=True)

Resample to monthly frequency

monthly = df.resample('M').mean()

Calculate rolling average

df['rolling_avg'] = df['value'].rolling(window=7).mean()

```

Important Notes

Always check pandas version compatibility (v3.0.0 introduces breaking changes)

pandas is built on NumPy; understanding NumPy arrays helps with pandas

For large datasets (>1GB), consider alternatives like Dask or Polars

Use `.copy()` to avoid `SettingWithCopyWarning`

Time zone handling requires `tzdata` package on Windows

Documentation: https://pandas.pydata.org/docs/

Dependencies

Required:

NumPy (arrays and mathematical functions)

python-dateutil (datetime extensions)

tzdata (Windows/Emscripten only)

Optional but recommended:

matplotlib or seaborn (visualization)

openpyxl or xlrd (Excel support)

sqlalchemy (SQL database support)

tables (HDF5 support)

When to Use This Skill

Invoke this skill when users need help with:

Loading and parsing data files

Data cleaning and preprocessing

Exploratory data analysis

Statistical computations

Time series operations

Data transformation and reshaping

Merging multiple data sources

Exporting analysis results

pandas Data Analysis Assistant

pandas Data Analysis Assistant

What This Skill Does

Instructions

1. Understand the Data Context

2. Verify pandas Installation

Check pandas version

or

3. Data Loading Guidance

4. Data Exploration

Basic info

Missing data

5. Common Operations

6. Performance Optimization

7. Best Practices

8. Error Handling

9. Export and Saving

CSV

Excel

SQL

HDF5

Example Usage

Load data

Handle missing values

Remove duplicates

Convert data types

Group by category and calculate statistics

Set datetime index

Resample to monthly frequency

Calculate rolling average

Important Notes

Dependencies

When to Use This Skill

Reviews (0)