GitHub Copilot Metadata Generator

A specialized skill for building notebook-centric metadata generation workflows using LLMs and MongoDB.

What This Skill Does

This skill guides GitHub Copilot to implement metadata generation pipelines in Jupyter notebooks. It handles CSV data sampling, LLM prompt construction (OpenAI/Google Gemini), JSON extraction from LLM responses, and MongoDB upload — all following production-grade error handling and security practices.

Instructions for AI Agent

You are working in a notebook-centric workspace for generating and managing metadata using LLMs and MongoDB. All main logic is implemented in Jupyter notebooks with reusable functions.

1. Data Loading and Sampling

Use `pandas.read_csv` to load CSV files

**Always sample up to 20 rows** for LLM input (use all rows if fewer than 20)

Convert samples to JSON using `df.to_json(orient='records', force_ascii=False)`

If a file is empty, return an empty string and print a warning

**Example:**

```python

import pandas as pd

def load_and_sample_csv(filepath, max_rows=20):

"""Load CSV and sample up to max_rows for LLM input."""

try:

df = pd.read_csv(filepath)

if df.empty:

print(f"Warning: {filepath} is empty")

return ""

sample = df.head(max_rows)

return sample.to_json(orient='records', force_ascii=False)

except Exception as e:

print(f"Error loading {filepath}: {e}")

return ""

```

2. LLM Integration

Support both **OpenAI** and **Google Gemini** with consistent prompting patterns.

**OpenAI Configuration:**

Use `langchain_openai.ChatOpenAI`

Default: `model="gpt-4o"`, `temperature=0.2`, `max_tokens=2000`

Construct prompts with system + human messages

**Google Gemini Configuration:**

Use `google-genai` SDK

Use latest model (e.g., `gemini-2.0-flash`)

Same prompt structure as OpenAI

**JSON Extraction:**

Extract JSON from LLM output using regex: `re.search(r'\{.*\}', response, re.DOTALL)`

If extraction fails, print raw response and error

Parse extracted string with `json.loads()`

**Example:**

```python

import re

import json

from langchain_openai import ChatOpenAI

from langchain.schema import SystemMessage, HumanMessage

def generate_metadata_openai(data_json, api_key):

"""Generate metadata using OpenAI."""

llm = ChatOpenAI(

model="gpt-4o",

temperature=0.2,

max_tokens=2000,

api_key=api_key

)

system_prompt = "Generate metadata in JSON format with snake_case keys. Output ONLY valid JSON, no explanations."

human_prompt = f"Generate metadata for this data:\n{data_json}"

try:

response = llm.invoke([

SystemMessage(content=system_prompt),

HumanMessage(content=human_prompt)

])

# Extract JSON from response

match = re.search(r'\{.*\}', response.content, re.DOTALL)

if not match:

print(f"No JSON found in response: {response.content}")

return None

json_str = match.group(0)

return json.loads(json_str)

except Exception as e:

print(f"LLM error: {e}")

return None

```

3. Error Handling

Handle errors gracefully with clear user feedback:

**API Errors:** Quota limits, rate limits, token limits, credential errors

**Parsing Errors:** Invalid JSON from LLM responses

**Missing Credentials:** Check for API keys before proceeding

**Example:**

```python

def safe_llm_call(func, *args, **kwargs):

"""Wrapper for LLM calls with error handling."""

try:

return func(*args, **kwargs)

except Exception as e:

error_type = type(e).__name__

if "quota" in str(e).lower():

print(f"API quota exceeded: {e}")

elif "rate" in str(e).lower():

print(f"Rate limit hit: {e}")

elif "token" in str(e).lower():

print(f"Token limit exceeded: {e}")

elif "auth" in str(e).lower() or "credential" in str(e).lower():

print(f"Authentication error: {e}")

else:

print(f"{error_type}: {e}")

return None

```

4. Secrets and API Keys

**CRITICAL:** Never hardcode secrets.

Load API keys from `.env` using `dotenv.load_dotenv()`

Required keys: `OPENAI_API_KEY`, `GEMINI_API_KEY`, `MONGODB_URI`

Print warning and abort if required keys are missing

**Example:**

```python

import os

from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

MONGODB_URI = os.getenv('MONGODB_URI')

if not OPENAI_API_KEY:

print("Warning: OPENAI_API_KEY not found in .env")

```

5. MongoDB Integration

Use `pymongo` to connect with `MONGODB_URI` from `.env`

Store metadata in `airspace` database, `metadata_full` collection

Use `insert_one()` to upload metadata JSON

Print inserted document ID on success

**Example:**

```python

from pymongo import MongoClient

def upload_to_mongodb(metadata_dict, mongodb_uri):

"""Upload metadata to MongoDB."""

try:

client = MongoClient(mongodb_uri)

db = client['airspace']

collection = db['metadata_full']

result = collection.insert_one(metadata_dict)

print(f"✓ Inserted document ID: {result.inserted_id}")

return result.inserted_id

except Exception as e:

print(f"MongoDB error: {e}")

return None

finally:

client.close()

```

6. Notebook Workflow

Structure notebooks with clear, reusable cells:

1. **Setup cell:** Imports and environment loading

2. **Function definitions:** Reusable functions for data loading, LLM calls, MongoDB upload

3. **Execution cells:** Main workflow with markdown documentation

4. **Markdown cells:** Explain assumptions, workflow steps, and results

**Example structure:**

```markdown

Metadata Generation Notebook

Setup

[import cell]

Functions

[data loading functions]

[LLM functions]

[MongoDB functions]

Workflow

Step 1: Load and Sample Data

[execution cell with markdown explanation]

Step 2: Generate Metadata

[execution cell with markdown explanation]

Step 3: Upload to MongoDB

[execution cell with markdown explanation]

```

7. Additional Info Handling

If additional context is provided (e.g., from `.txt` files), append to LLM prompt

Use a template for incorporating extra context consistently

**Example:**

```python

def build_prompt_with_context(data_json, additional_info=""):

"""Build LLM prompt with optional additional context."""

base_prompt = f"Generate metadata for this data:\n{data_json}"

if additional_info:

return f"{base_prompt}\n\nAdditional context:\n{additional_info}"

return base_prompt

```

8. General Best Practices

**Keep LLM input small:** Always sample data (max 20 rows)

**Use snake_case:** All JSON keys in generated metadata must use snake_case

**JSON-only output:** LLM responses should contain ONLY valid JSON, no explanations

**Document everything:** Use markdown cells to explain workflow and assumptions

**Reusable functions:** All logic should be in well-named, documented functions

Usage Examples

Basic Metadata Generation Workflow

```python

1. Load environment

load_dotenv()

2. Load and sample CSV

data_json = load_and_sample_csv('data/input.csv', max_rows=20)

3. Generate metadata with OpenAI

metadata = generate_metadata_openai(data_json, os.getenv('OPENAI_API_KEY'))

4. Upload to MongoDB

if metadata:

upload_to_mongodb(metadata, os.getenv('MONGODB_URI'))

```

With Additional Context

```python

Load additional info

with open('context.txt', 'r') as f:

additional_info = f.read()

Build enhanced prompt

prompt = build_prompt_with_context(data_json, additional_info)

Generate metadata

metadata = generate_metadata_openai(prompt, os.getenv('OPENAI_API_KEY'))

```

Constraints

All secrets must be loaded from `.env` file

CSV sampling must not exceed 20 rows (to manage LLM token limits)

LLM responses must be valid JSON only (no markdown, no explanations)

All metadata keys must use snake_case convention

MongoDB uploads target `airspace.metadata_full` collection only

Error handling is mandatory for all LLM and database operations

Important Notes

If LLM response is not valid JSON, print the raw response for debugging

Always check for API keys before making LLM calls

Document any new functions or workflow changes in markdown cells

Update this skill documentation if project patterns evolve

GitHub Copilot Metadata Generator

GitHub Copilot Metadata Generator

What This Skill Does

Instructions for AI Agent

1. Data Loading and Sampling

2. LLM Integration

3. Error Handling

4. Secrets and API Keys

5. MongoDB Integration

6. Notebook Workflow

Metadata Generation Notebook

Setup

Functions

Workflow

Step 1: Load and Sample Data

Step 2: Generate Metadata

Step 3: Upload to MongoDB

7. Additional Info Handling

8. General Best Practices

Usage Examples

Basic Metadata Generation Workflow

1. Load environment

2. Load and sample CSV

3. Generate metadata with OpenAI

4. Upload to MongoDB

With Additional Context

Load additional info

Build enhanced prompt

Generate metadata

Constraints

Important Notes

Reviews (0)