Guide for using Dask library to parallelize data analysis tasks with NumPy, Pandas, and task scheduling
This skill helps you leverage Dask, a flexible parallel computing library for analytics, to scale your Python data analysis workflows across multiple cores or distributed clusters.
Guides you through using Dask to:
When a user asks to use Dask for parallel computing, data processing, or scaling analytics workloads, follow these steps:
Check if Dask is installed:
```bash
pip show dask
```
If not installed, install it:
```bash
pip install dask
```
For distributed computing or specific backends:
```bash
pip install "dask[complete]"
pip install "dask[dataframe]" # for dataframe support
pip install "dask[distributed]" # for distributed scheduler
```
Ask the user what they want to parallelize:
Based on the use case, provide appropriate code:
**For large CSV/Parquet files (DataFrame):**
```python
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').mean()
output = result.compute()
```
**For array operations:**
```python
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y.mean(axis=0)
result = z.compute()
```
**For custom functions:**
```python
import dask
from dask import delayed
@delayed
def process_file(filename):
# Your processing logic
return result
results = [process_file(f) for f in file_list]
output = dask.compute(*results)
```
Help user choose appropriate scheduler:
```python
import dask
dask.config.set(scheduler='threads')
dask.config.set(scheduler='processes')
from dask.distributed import Client
client = Client() # connects to local cluster
```
Provide optimization tips based on workload:
```python
result.visualize(filename='task-graph.png')
df_clean = df.dropna().persist()
result1 = df_clean.groupby('A').sum().compute()
result2 = df_clean.groupby('B').mean().compute()
```
**Reading multiple files:**
```python
df = dd.read_csv('data/*.csv')
```
**Writing results:**
```python
df.to_parquet('output/', compression='snappy')
```
**Converting from Pandas:**
```python
dask_df = dd.from_pandas(pandas_df, npartitions=4)
```
When user says "I need to process 50GB of CSV files and compute statistics":
1. Install dask with dataframe support
2. Read files with `dd.read_csv('data/*.csv')`
3. Perform aggregations with familiar Pandas API
4. Use `.compute()` to execute
5. Suggest using distributed scheduler if single machine is slow
6. Show how to save results to Parquet for faster subsequent reads
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/use-dask-for-parallel-computing/raw