Production-ready tool for generating high-quality Q&A datasets optimized for Mistral 7B fine-tuning. Specializes in educational content with cultural authenticity, context grounding, and systematic quality validation. Built for Indian university guidance for Bangladeshi students.
This skill has safety concerns that you should review before use. Some patterns were detected that may pose a risk.Safety score: 60/100.
KillerSkills scans all public content for safety. Use caution before installing or executing flagged content.
Generate high-quality Q&A datasets for LLM fine-tuning with systematic quality validation and error analysis.
SetForge is a research-based dataset generator that produces 15K-20K high-quality question-answer pairs for fine-tuning language models. It emphasizes quality over quantity through context grounding, cultural authenticity, and systematic error analysis.
When implementing or using this skill, follow these research-backed quality principles:
1. **Error Analysis First**: Systematically examine failures to identify highest-ROI improvements
2. **Quality Over Quantity**: Clean, relevant data outperforms large, noisy datasets
3. **Context Grounding**: All answers must be extractable from source material (≥60% extractive content)
4. **Cultural Authenticity**: Domain-specific relevance for target audience (≥70% relevance score)
5. **Consistent Formatting**: Proper instruction format is critical for model performance
```
Source Files → Context Chunking → Q&A Generation → Quality Validation → Error Analysis → JSONL Dataset
```
1. **main_generator.py**: Single, clean Q&A generator with quality focus
2. **quality_checker.py**: Error analysis and quality validation
3. **cli.py**: Simple commands (generate, validate, analyze)
4. **config.yaml**: Configuration with quality thresholds
5. **utils.py**: Cultural context and text processing utilities
Create a `config.yaml` with:
```yaml
api_url: "https://inference.do-ai.run/v1/chat/completions"
model: "llama3-8b-instruct"
max_cost_usd: 200.0
cost_per_token: 0.0000002
quality_thresholds:
min_extractive_score: 0.6 # Anti-hallucination
min_cultural_score: 0.7 # Domain relevance
min_detail_score: 0.6 # Specific information
min_overall_score: 0.7 # Combined quality
```
Create meaningful context chunks that preserve semantic integrity:
```python
def create_quality_chunks(text: str) -> List[str]:
"""Create 500-1000 word chunks that preserve context integrity"""
chunks = []
paragraphs = text.split('\n\n')
current_chunk = ""
for paragraph in paragraphs:
if len(current_chunk + paragraph) > 1000:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = paragraph
else:
current_chunk += "\n\n" + paragraph if current_chunk else paragraph
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
```
Use anti-hallucination prompting pattern:
```python
async def generate_question_from_context(context: str) -> str:
prompt = f"""
Based on this context, generate a realistic question that can be answered
from the provided information.
Context: {context}
Generate a specific, actionable question.
"""
response = await call_api(prompt)
return extract_question(response)
async def generate_answer_from_context(context: str, question: str) -> str:
prompt = f"""
Answer this question using ONLY information from the provided context.
If the context doesn't contain enough information, say so clearly.
Question: {question}
Context: {context}
Provide a detailed, accurate answer based on the context.
"""
response = await call_api(prompt)
return extract_answer(response)
```
Validate each Q&A pair against multiple quality dimensions:
```python
def validate_qa_quality(question: str, answer: str, context: str) -> QualityMetrics:
"""Comprehensive quality validation"""
# Extractive score (anti-hallucination)
extractive_score = calculate_extractive_score(answer, context)
# Cultural authenticity
cultural_score = validate_domain_focus(question, answer)
# Specific details
detail_score = validate_specific_details(answer)
# Overall quality
overall_score = (extractive_score * 0.4 +
cultural_score * 0.3 +
detail_score * 0.3)
return QualityMetrics(
extractive_score=extractive_score,
cultural_score=cultural_score,
detail_score=detail_score,
overall_score=overall_score
)
```
Systematically categorize and analyze quality failures:
```python
def analyze_quality_failures(failed_pairs: List[QAPair]) -> ErrorAnalysis:
"""Analyze quality failures to identify improvement opportunities"""
error_categories = {
'low_extractive': [],
'poor_cultural_focus': [],
'missing_details': [],
'hallucination': [],
'irrelevant_content': []
}
for pair in failed_pairs:
if pair.quality.extractive_score < 0.6:
error_categories['low_extractive'].append(pair)
elif pair.quality.cultural_score < 0.7:
error_categories['poor_cultural_focus'].append(pair)
# Categorize other failures
return ErrorAnalysis(error_categories)
```
Generate properly formatted training data (example for Mistral 7B):
```python
def format_for_mistral(qa_pair: QAPair) -> dict:
"""Format Q&A pair for Mistral 7B fine-tuning"""
return {
"instruction": f"<s>[INST] {qa_pair.question} [/INST]",
"input": qa_pair.context,
"output": qa_pair.answer,
"context_source": qa_pair.source_file,
"quality_score": qa_pair.quality.overall_score
}
```
```bash
python cli.py generate data/ output/test_100.jsonl --target 100 --budget 5
python cli.py analyze-errors output/test_100.jsonl --output error_analysis.json
python cli.py generate data/ output/test_100_v2.jsonl --target 100 --budget 5
python cli.py validate output/test_100_v2.jsonl --threshold 0.7
```
```bash
python cli.py validate dataset.jsonl --detailed-report quality_report.json
python cli.py analyze-errors dataset.jsonl --output error_analysis.json
python cli.py validate-cultural dataset.jsonl --output cultural_report.json
```
```bash
python cli.py generate data/ output/dataset_15k.jsonl \
--target 15000 --budget 200 --quality-threshold 0.7
python cli.py monitor-quality output/dataset_15k.jsonl --live
```
Monitor these metrics during generation:
```python
logger.info(f"""
Quality Generation Status:
""")
```
Ensure generated content includes:
```yaml
❌ WRONG: Generic or incorrect API endpoints
✅ CORRECT: Use exact provider-specified endpoints
```
```python
❌ DANGEROUS: Skipping extractive validation
✅ MANDATORY: All answers must be grounded in context
❌ RISKY: Ignoring domain authenticity
✅ REQUIRED: Domain-specific validation
❌ POOR: Generic, broad answers
✅ ESSENTIAL: Specific, detailed information
```
```python
❌ FAILURE: Generating without error analysis
✅ SUCCESS: Systematic error analysis and improvement
❌ WASTE: Ignoring quality failures
✅ VALUE: Learning from failures to improve quality
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/setforge-quality-dataset-generator/raw