Firecrawl Web Scraping & Data Extraction
Turn entire websites into LLM-ready markdown or structured data. This skill uses Firecrawl to scrape, crawl, map, search, and extract information from websites with AI-powered precision.
What This Skill Does
This skill enables you to:
**Scrape** individual URLs and get content in markdown, structured data, or HTML**Crawl** entire websites and extract all pages**Map** websites to discover all available URLs**Search** the web and optionally scrape results**Extract** structured data from pages using AIMonitor scraping jobs and manage batch operationsInstructions
When a user requests web scraping, data extraction, or website analysis, follow these steps:
1. Understand the Request
First, determine what type of operation is needed:
**Single page scraping**: Use "Scrape" operation**Multiple specific URLs**: Use "Batch Scrape" operation**Entire website**: Use "Crawl" operation**Discover URLs**: Use "Map" operation**Find pages via search**: Use "Search" operation**Structured data extraction**: Use "Extract Data" operation2. Choose the Right Operation
**For scraping a single URL:**
Operation: **Scrape**Best for: Getting content from one specific pageReturns: Markdown, HTML, screenshot, or AI-extracted structured dataExample: "Scrape https://example.com/blog/post-1"**For mapping a website's structure:**
Operation: **Map**Best for: Discovering all URLs on a websiteReturns: List of all URLs foundExample: "Map all URLs on example.com"**For crawling an entire site:**
Operation: **Crawl**Best for: Scraping all pages of a websiteReturns: Content from all discovered pagesOptions: Set depth limits, URL patterns, exclusionsExample: "Crawl example.com/blog and get all articles"**For searching the web:**
Operation: **Search**Best for: Finding relevant pages via search engineReturns: Search results with optional page contentExample: "Search for 'climate change reports 2024' and scrape top 5 results"**For extracting structured data:**
Operation: **Extract Data**Best for: Getting specific fields from pages using AIReturns: Structured JSON data based on your schemaExample: "Extract product name, price, and description from these e-commerce pages"**For batch operations:**
Operation: **Batch Scrape** (start job), then **Batch Scrape Status** (check results)Best for: Scraping many URLs efficientlyExample: "Scrape these 50 product URLs"3. Configure Parameters
**Common parameters across operations:**
`url` or `urls`: Target URL(s) to process`formats`: Choose output format (markdown, html, screenshot, extract)`onlyMainContent`: Set to true to exclude headers/footers/nav`waitFor`: Milliseconds to wait for JavaScript to load`timeout`: Maximum time to wait for page load**For Crawl operation:**
`limit`: Maximum number of pages to crawl (default: 10000)`maxDepth`: How many levels deep to crawl`allowBackwardLinks`: Whether to follow links to parent pages`allowExternalLinks`: Whether to follow external domains`includePaths`: Array of URL patterns to include`excludePaths`: Array of URL patterns to exclude**For Extract operation:**
`prompt`: Natural language description of what to extract`schema`: JSON schema defining the structure you wantExample schema:```json
{
"type": "object",
"properties": {
"company_name": {"type": "string"},
"location": {"type": "string"},
"description": {"type": "string"}
}
}
```
4. Execute and Handle Results
**For immediate operations (Scrape, Map, Search):**
Execute the operation directlyParse the returned dataPresent results to the user in a clear format**For async operations (Crawl, Batch Scrape, Extract):**
1. Start the job and get the job ID
2. Use status operations to check progress:
- `Get Crawl Status` for crawl jobs
- `Batch Scrape Status` for batch jobs
- `Get Extract Status` for extraction jobs
3. Poll status until completed
4. Retrieve and present final results
5. If errors occur, use error operations to diagnose:
- `Get Crawl Errors`
- `Batch Scrape Errors`
5. Monitor Usage
Check API usage to stay within limits:
**Team Token Usage**: Check remaining tokens**Team Credit Usage**: Check remaining credits**Historical Usage**: Review past usage patterns**Team Queue Status**: Check current job queue6. Error Handling
If operations fail:
1. Check the error message for details
2. Verify the URL is accessible
3. Adjust timeout or waitFor parameters for slow sites
4. Use error retrieval operations to get detailed failure info
5. For crawls hitting limits, reduce scope with includePaths/excludePaths
Examples
**Example 1: Scrape a blog post**
```
User: "Get the content from https://example.com/blog/ai-trends-2024"
Agent: Uses Scrape operation with format=markdown, onlyMainContent=true
Result: Clean markdown of the article content
```
**Example 2: Crawl a documentation site**
```
User: "Crawl docs.example.com and get all pages under /api/"
Agent: Uses Crawl with includePaths=["/api/*"], limit=100, format=markdown
Result: All API documentation pages in markdown
```
**Example 3: Extract structured data**
```
User: "Get company name, industry, and description from these 10 company websites"
Agent: Uses Extract Data with schema defining those fields
Result: JSON array with structured company information
```
**Example 4: Search and scrape**
```
User: "Find recent articles about renewable energy and get their content"
Agent: Uses Search with query="renewable energy 2024", scrapeOptions enabled
Result: Search results with full article content
```
**Example 5: Map a website**
```
User: "What pages exist on example.com?"
Agent: Uses Map operation on example.com
Result: Complete list of URLs found on the site
```
Important Notes
**API Key Required**: User must provide Firecrawl API key from firecrawl.dev**Rate Limits**: Respect the user's plan limits (check with usage operations)**Async Operations**: Crawl, Batch Scrape, and Extract are async - poll for results**Timeouts**: Default is 30 seconds; increase for slow-loading sites**Formats**: Choose markdown for LLM processing, extract for structured data**Privacy**: Only scrape publicly accessible content; respect robots.txt**Cost**: Each operation consumes credits/tokens - monitor usageConstraints
Cannot scrape content behind authentication (unless cookies/headers provided)Cannot bypass CAPTCHAs or anti-bot measures automaticallyRespects robots.txt by defaultJavaScript rendering adds latency - use waitFor parameter carefullyLarge crawls may take significant time - set appropriate limits