Fast DataFrame operations with Polars library for data manipulation, transformation, and analysis using lazy/eager execution
Expert skill for working with Polars, the blazingly fast DataFrame library written in Rust. Use this for high-performance data manipulation, transformation, and analytical queries with datasets of any size.
Helps you leverage Polars' powerful query engine for DataFrame operations including:
Install Polars:
```bash
pip install polars
```
For optional features:
```bash
pip install polars[all]
pip install polars[streaming]
pip install polars[timezone]
pip install polars[excel]
```
Start by importing Polars and checking the installation:
```python
import polars as pl
pl.show_versions()
```
Read data from various sources:
```python
df = pl.read_csv("data.csv")
df = pl.read_parquet("data.parquet")
df = pl.read_json("data.json")
lazy_df = pl.scan_csv("large_data.csv")
lazy_df = pl.scan_parquet("large_data.parquet")
```
Perform common transformations:
```python
df.select(["column1", "column2"])
df.filter(pl.col("age") > 30)
df.with_columns([
(pl.col("price") * 1.1).alias("price_with_tax"),
pl.col("name").str.to_uppercase().alias("name_upper")
])
df.sort("column_name", descending=True)
df.group_by("category").agg([
pl.col("sales").sum().alias("total_sales"),
pl.col("quantity").mean().alias("avg_quantity")
])
```
Use lazy evaluation for automatic query optimization:
```python
lazy_df = pl.scan_csv("data.csv")
result = (
lazy_df
.filter(pl.col("year") >= 2020)
.group_by("region")
.agg(pl.col("revenue").sum())
.sort("revenue", descending=True)
.head(10)
)
final_df = result.collect()
final_df = result.collect(streaming=True)
```
Leverage Polars' powerful expression system:
```python
df.select([
# Conditional logic
pl.when(pl.col("age") < 18)
.then(pl.lit("minor"))
.otherwise(pl.lit("adult"))
.alias("age_group"),
# String operations
pl.col("email").str.extract(r"@(.+)$", 1).alias("domain"),
# Date operations
pl.col("date").dt.year().alias("year"),
pl.col("date").dt.month().alias("month"),
# Window functions
pl.col("sales").rank().over("region").alias("sales_rank"),
# Rolling operations
pl.col("value").rolling_mean(window_size=7).alias("7day_avg")
])
```
Combine DataFrames:
```python
result = df1.join(df2, on="id", how="inner")
result = df1.join(df2, left_on="user_id", right_on="id", how="left")
combined = pl.concat([df1, df2], how="vertical")
combined = pl.concat([df1, df2], how="horizontal")
```
Export results:
```python
df.write_parquet("output.parquet")
df.write_csv("output.csv")
df.write_json("output.json")
pandas_df = df.to_pandas()
```
Optimize your Polars queries:
```python
lazy_df = pl.scan_parquet("data.parquet")
result = lazy_df.collect(streaming=True)
df.write_parquet("data.parquet", compression="zstd")
lazy_df.filter(pl.col("date") > "2023-01-01") # Filter before other ops
lazy_df.select(["col1", "col2"]) # Only needed columns
```
```python
df.select([
pl.col("*").fill_null(strategy="forward"), # Fill nulls
pl.col("numeric_col").clip(0, 100), # Clip values
pl.col("text_col").str.strip(), # Remove whitespace
])
```
```python
df.sort("timestamp").select([
pl.col("timestamp"),
pl.col("value").diff().alias("change"),
pl.col("value").pct_change().alias("pct_change"),
pl.col("value").rolling_mean(window_size=7).alias("7day_ma")
])
```
```python
pivoted = df.pivot(values="sales", index="date", columns="product")
melted = df.melt(id_vars=["id", "date"], value_vars=["sales", "profit"])
```
✅ Large datasets (>1GB)
✅ Performance-critical pipelines
✅ Complex aggregations and joins
✅ Larger-than-RAM data processing
✅ New projects with no pandas dependency
⚠️ Consider pandas if you need extensive ecosystem compatibility or have legacy code
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/polars-dataframe-analysis/raw