Generates and manages metadata using LLMs (OpenAI/Gemini) and MongoDB in Jupyter notebooks with CSV sampling, JSON parsing, and error handling
A specialized skill for building notebook-centric metadata generation workflows using LLMs and MongoDB.
This skill guides GitHub Copilot to implement metadata generation pipelines in Jupyter notebooks. It handles CSV data sampling, LLM prompt construction (OpenAI/Google Gemini), JSON extraction from LLM responses, and MongoDB upload — all following production-grade error handling and security practices.
You are working in a notebook-centric workspace for generating and managing metadata using LLMs and MongoDB. All main logic is implemented in Jupyter notebooks with reusable functions.
**Example:**
```python
import pandas as pd
def load_and_sample_csv(filepath, max_rows=20):
"""Load CSV and sample up to max_rows for LLM input."""
try:
df = pd.read_csv(filepath)
if df.empty:
print(f"Warning: {filepath} is empty")
return ""
sample = df.head(max_rows)
return sample.to_json(orient='records', force_ascii=False)
except Exception as e:
print(f"Error loading {filepath}: {e}")
return ""
```
Support both **OpenAI** and **Google Gemini** with consistent prompting patterns.
**OpenAI Configuration:**
**Google Gemini Configuration:**
**JSON Extraction:**
**Example:**
```python
import re
import json
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
def generate_metadata_openai(data_json, api_key):
"""Generate metadata using OpenAI."""
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.2,
max_tokens=2000,
api_key=api_key
)
system_prompt = "Generate metadata in JSON format with snake_case keys. Output ONLY valid JSON, no explanations."
human_prompt = f"Generate metadata for this data:\n{data_json}"
try:
response = llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=human_prompt)
])
# Extract JSON from response
match = re.search(r'\{.*\}', response.content, re.DOTALL)
if not match:
print(f"No JSON found in response: {response.content}")
return None
json_str = match.group(0)
return json.loads(json_str)
except Exception as e:
print(f"LLM error: {e}")
return None
```
Handle errors gracefully with clear user feedback:
**Example:**
```python
def safe_llm_call(func, *args, **kwargs):
"""Wrapper for LLM calls with error handling."""
try:
return func(*args, **kwargs)
except Exception as e:
error_type = type(e).__name__
if "quota" in str(e).lower():
print(f"API quota exceeded: {e}")
elif "rate" in str(e).lower():
print(f"Rate limit hit: {e}")
elif "token" in str(e).lower():
print(f"Token limit exceeded: {e}")
elif "auth" in str(e).lower() or "credential" in str(e).lower():
print(f"Authentication error: {e}")
else:
print(f"{error_type}: {e}")
return None
```
**CRITICAL:** Never hardcode secrets.
**Example:**
```python
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
MONGODB_URI = os.getenv('MONGODB_URI')
if not OPENAI_API_KEY:
print("Warning: OPENAI_API_KEY not found in .env")
```
**Example:**
```python
from pymongo import MongoClient
def upload_to_mongodb(metadata_dict, mongodb_uri):
"""Upload metadata to MongoDB."""
try:
client = MongoClient(mongodb_uri)
db = client['airspace']
collection = db['metadata_full']
result = collection.insert_one(metadata_dict)
print(f"✓ Inserted document ID: {result.inserted_id}")
return result.inserted_id
except Exception as e:
print(f"MongoDB error: {e}")
return None
finally:
client.close()
```
Structure notebooks with clear, reusable cells:
1. **Setup cell:** Imports and environment loading
2. **Function definitions:** Reusable functions for data loading, LLM calls, MongoDB upload
3. **Execution cells:** Main workflow with markdown documentation
4. **Markdown cells:** Explain assumptions, workflow steps, and results
**Example structure:**
```markdown
[import cell]
[data loading functions]
[LLM functions]
[MongoDB functions]
[execution cell with markdown explanation]
[execution cell with markdown explanation]
[execution cell with markdown explanation]
```
**Example:**
```python
def build_prompt_with_context(data_json, additional_info=""):
"""Build LLM prompt with optional additional context."""
base_prompt = f"Generate metadata for this data:\n{data_json}"
if additional_info:
return f"{base_prompt}\n\nAdditional context:\n{additional_info}"
return base_prompt
```
```python
load_dotenv()
data_json = load_and_sample_csv('data/input.csv', max_rows=20)
metadata = generate_metadata_openai(data_json, os.getenv('OPENAI_API_KEY'))
if metadata:
upload_to_mongodb(metadata, os.getenv('MONGODB_URI'))
```
```python
with open('context.txt', 'r') as f:
additional_info = f.read()
prompt = build_prompt_with_context(data_json, additional_info)
metadata = generate_metadata_openai(prompt, os.getenv('OPENAI_API_KEY'))
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/github-copilot-metadata-generator/raw