Python Data Science Best Practices

A comprehensive skill for generating clean, well-documented Python code following industry best practices, with a focus on data science workflows using pandas, numpy, and plotly.

What This Skill Does

This skill enforces strict code quality standards for Python development, automatically generating:

PEP 8 and Ruff-compliant code

Comprehensive type hints for all functions and methods

Google-style docstrings with examples

Modular, production-ready code with no placeholders

Proper error handling and logging for data science operations

Instructions

General Style & Formatting

1. **Adhere to PEP 8 and Ruff standards**

- Use descriptive variable names without unnecessary abbreviations

- Prefer early returns to reduce nesting complexity

- Format code using standard Python conventions

2. **Generate complete, working code**

- Never include TODOs or placeholders

- All code must be production-ready and fully functional

- Include all necessary imports at the top of each code block

3. **Prioritize readability**

- Use vectorized operations (list comprehensions, pandas methods) when available

- Break complex operations into logical, modular functions or classes

- Add inline comments explaining the "why" behind key operations

Type Hints & Documentation

4. **Always include Python type hints**

- Add type hints to every function parameter and return value

- Use appropriate types from `typing` module when needed

5. **Generate Google-style docstrings for all functions and classes**

Required sections:

- Brief one-sentence description

- `Args:` section with parameter types and descriptions

- `Returns:` section with return type and explanation

- `Example:` section showing usage (especially for data manipulation)

Template:

```python

def function_name(param1: Type, param2: Type) -> ReturnType:

"""

Brief one-sentence description.

Args:

param1 (Type): Description of parameter 1.

param2 (Type): Description of parameter 2.

Returns:

ReturnType: Description of the returned value.

Example:

>>> function_name(example_param1, example_param2)

expected_output

"""

# Function implementation

```

6. **For DataFrame operations, document expected structure**

- Include column names and data types

- Specify assumptions about missing values

- Example structure in docstring or comments

Code Generation & Refactoring

7. **Default to data science libraries**

- Use `pandas`, `numpy`, and `plotly` as primary libraries

- Standard imports:

```python

import pandas as pd

import numpy as np

import plotly.express as px

```

8. **Include error handling and logging**

- Add try-except blocks for file operations (e.g., `pd.read_csv`)

- Log important operations, especially in statistical/ML tasks

- Handle edge cases gracefully

9. **Generate modular, logical code structure**

- Break down complex operations into smaller functions

- Each function should have a single, clear responsibility

- Ensure functions are reusable and well-encapsulated

10. **When refactoring**

- Maintain full functionality of original code

- Improve readability without sacrificing performance

- Ensure all tests still pass

Context Awareness & Integration

11. **Assume data science context**

- Default to DataFrame and array operations

- Use pandas for CSV operations with proper error checks

- Include data validation where appropriate

12. **For machine learning code**

- Assume PyTorch workflow when applicable

- Include device settings: `torch.device('cuda' if torch.cuda.is_available() else 'cpu')`

- Add model evaluation and metrics tracking

13. **Multi-file project support**

- Coordinate code across modules using context

- Maintain consistent style and documentation across files

- Use relative imports appropriately

14. **Documentation export compatibility**

- Ensure docstrings are copy-paste ready for Google Docs

- Include comments explaining integration processes when relevant

Data Science Specific Rules

15. **Data cleaning and analysis functions**

- Document assumptions (e.g., missing value handling strategy)

- Include data validation checks

- Provide clear error messages for invalid inputs

16. **Statistical operations**

- Use vectorized pandas/numpy operations

- Avoid explicit loops when vectorization is possible

- Include appropriate statistical tests and confidence intervals

17. **Visualization code**

- Use plotly for interactive visualizations

- Include clear axis labels and titles

- Make plots accessible with proper color schemes

Examples

Example 1: Data Loading Function

```python

import pandas as pd

from typing import Optional

def load_customer_data(filepath: str, encoding: str = 'utf-8') -> pd.DataFrame:

"""

Load customer data from CSV file with error handling.

Args:

filepath (str): Path to the CSV file.

encoding (str): File encoding. Defaults to 'utf-8'.

Returns:

pd.DataFrame: DataFrame with columns ['customer_id', 'name', 'email', 'signup_date'].

Raises:

FileNotFoundError: If the specified file does not exist.

pd.errors.EmptyDataError: If the file is empty.

Example:

>>> df = load_customer_data('data/customers.csv')

>>> df.head()

"""

try:

df = pd.read_csv(filepath, encoding=encoding)

required_columns = ['customer_id', 'name', 'email', 'signup_date']

if not all(col in df.columns for col in required_columns):

raise ValueError(f"Missing required columns. Expected: {required_columns}")

return df

except FileNotFoundError:

raise FileNotFoundError(f"File not found: {filepath}")

except pd.errors.EmptyDataError:

raise pd.errors.EmptyDataError(f"Empty file: {filepath}")

```

Example 2: Data Analysis Function

```python

import pandas as pd

import numpy as np

from typing import Dict

def calculate_customer_metrics(df: pd.DataFrame) -> Dict[str, float]:

"""

Calculate key customer engagement metrics from transaction data.

Assumes df contains 'customer_id', 'purchase_amount', and 'purchase_date' columns.

Missing values in 'purchase_amount' are treated as zero.

Args:

df (pd.DataFrame): Transaction data with customer purchases.

Returns:

Dict[str, float]: Dictionary containing:

- 'avg_purchase': Average purchase amount

- 'total_revenue': Total revenue

- 'unique_customers': Number of unique customers

Example:

>>> metrics = calculate_customer_metrics(transaction_df)

>>> print(f"Average purchase: ${metrics['avg_purchase']:.2f}")

"""

df = df.copy()

df['purchase_amount'] = df['purchase_amount'].fillna(0)

metrics = {

'avg_purchase': df['purchase_amount'].mean(),

'total_revenue': df['purchase_amount'].sum(),

'unique_customers': df['customer_id'].nunique()

}

return metrics

```

Constraints

Never generate incomplete code with TODOs or placeholders

All functions must include type hints and Google-style docstrings

Code must pass PEP 8 and Ruff linting standards

Docstrings must be immediately usable for external documentation

Prioritize vectorized operations over explicit loops

Always include necessary imports at the top of code blocks

Python Data Science Best Practices

Python Data Science Best Practices

What This Skill Does

Instructions

General Style & Formatting

Type Hints & Documentation

Code Generation & Refactoring

Context Awareness & Integration

Data Science Specific Rules

Examples

Example 1: Data Loading Function

Example 2: Data Analysis Function

Constraints

Reviews (0)