Firecrawl Web Scraping & Data Extraction

Turn entire websites into LLM-ready markdown or structured data. This skill uses Firecrawl to scrape, crawl, map, search, and extract information from websites with AI-powered precision.

What This Skill Does

This skill enables you to:

**Scrape** individual URLs and get content in markdown, structured data, or HTML

**Crawl** entire websites and extract all pages

**Map** websites to discover all available URLs

**Search** the web and optionally scrape results

**Extract** structured data from pages using AI

Monitor scraping jobs and manage batch operations

Instructions

When a user requests web scraping, data extraction, or website analysis, follow these steps:

1. Understand the Request

First, determine what type of operation is needed:

**Single page scraping**: Use "Scrape" operation

**Multiple specific URLs**: Use "Batch Scrape" operation

**Entire website**: Use "Crawl" operation

**Discover URLs**: Use "Map" operation

**Find pages via search**: Use "Search" operation

**Structured data extraction**: Use "Extract Data" operation

2. Choose the Right Operation

**For scraping a single URL:**

Operation: **Scrape**

Best for: Getting content from one specific page

Returns: Markdown, HTML, screenshot, or AI-extracted structured data

Example: "Scrape https://example.com/blog/post-1"

**For mapping a website's structure:**

Operation: **Map**

Best for: Discovering all URLs on a website

Returns: List of all URLs found

Example: "Map all URLs on example.com"

**For crawling an entire site:**

Operation: **Crawl**

Best for: Scraping all pages of a website

Returns: Content from all discovered pages

Options: Set depth limits, URL patterns, exclusions

Example: "Crawl example.com/blog and get all articles"

**For searching the web:**

Operation: **Search**

Best for: Finding relevant pages via search engine

Returns: Search results with optional page content

Example: "Search for 'climate change reports 2024' and scrape top 5 results"

**For extracting structured data:**

Operation: **Extract Data**

Best for: Getting specific fields from pages using AI

Returns: Structured JSON data based on your schema

Example: "Extract product name, price, and description from these e-commerce pages"

**For batch operations:**

Operation: **Batch Scrape** (start job), then **Batch Scrape Status** (check results)

Best for: Scraping many URLs efficiently

Example: "Scrape these 50 product URLs"

3. Configure Parameters

**Common parameters across operations:**

`url` or `urls`: Target URL(s) to process

`formats`: Choose output format (markdown, html, screenshot, extract)

`onlyMainContent`: Set to true to exclude headers/footers/nav

`waitFor`: Milliseconds to wait for JavaScript to load

`timeout`: Maximum time to wait for page load

**For Crawl operation:**

`limit`: Maximum number of pages to crawl (default: 10000)

`maxDepth`: How many levels deep to crawl

`allowBackwardLinks`: Whether to follow links to parent pages

`allowExternalLinks`: Whether to follow external domains

`includePaths`: Array of URL patterns to include

`excludePaths`: Array of URL patterns to exclude

**For Extract operation:**

`prompt`: Natural language description of what to extract

`schema`: JSON schema defining the structure you want

Example schema:

```json

{

"type": "object",

"properties": {

"company_name": {"type": "string"},

"location": {"type": "string"},

"description": {"type": "string"}

}

```

4. Execute and Handle Results

**For immediate operations (Scrape, Map, Search):**

Execute the operation directly

Parse the returned data

Present results to the user in a clear format

**For async operations (Crawl, Batch Scrape, Extract):**

1. Start the job and get the job ID

2. Use status operations to check progress:

- `Get Crawl Status` for crawl jobs

- `Batch Scrape Status` for batch jobs

- `Get Extract Status` for extraction jobs

3. Poll status until completed

4. Retrieve and present final results

5. If errors occur, use error operations to diagnose:

- `Get Crawl Errors`

- `Batch Scrape Errors`

5. Monitor Usage

Check API usage to stay within limits:

**Team Token Usage**: Check remaining tokens

**Team Credit Usage**: Check remaining credits

**Historical Usage**: Review past usage patterns

**Team Queue Status**: Check current job queue

6. Error Handling

If operations fail:

1. Check the error message for details

2. Verify the URL is accessible

3. Adjust timeout or waitFor parameters for slow sites

4. Use error retrieval operations to get detailed failure info

5. For crawls hitting limits, reduce scope with includePaths/excludePaths

Examples

**Example 1: Scrape a blog post**

```

User: "Get the content from https://example.com/blog/ai-trends-2024"

Agent: Uses Scrape operation with format=markdown, onlyMainContent=true

Result: Clean markdown of the article content

```

**Example 2: Crawl a documentation site**

```

User: "Crawl docs.example.com and get all pages under /api/"

Agent: Uses Crawl with includePaths=["/api/*"], limit=100, format=markdown

Result: All API documentation pages in markdown

```

**Example 3: Extract structured data**

```

User: "Get company name, industry, and description from these 10 company websites"

Agent: Uses Extract Data with schema defining those fields

Result: JSON array with structured company information

```

**Example 4: Search and scrape**

```

User: "Find recent articles about renewable energy and get their content"

Agent: Uses Search with query="renewable energy 2024", scrapeOptions enabled

Result: Search results with full article content

```

**Example 5: Map a website**

```

User: "What pages exist on example.com?"

Agent: Uses Map operation on example.com

Result: Complete list of URLs found on the site

```

Important Notes

**API Key Required**: User must provide Firecrawl API key from firecrawl.dev

**Rate Limits**: Respect the user's plan limits (check with usage operations)

**Async Operations**: Crawl, Batch Scrape, and Extract are async - poll for results

**Timeouts**: Default is 30 seconds; increase for slow-loading sites

**Formats**: Choose markdown for LLM processing, extract for structured data

**Privacy**: Only scrape publicly accessible content; respect robots.txt

**Cost**: Each operation consumes credits/tokens - monitor usage

Constraints

Cannot scrape content behind authentication (unless cookies/headers provided)

Cannot bypass CAPTCHAs or anti-bot measures automatically

Respects robots.txt by default

JavaScript rendering adds latency - use waitFor parameter carefully

Large crawls may take significant time - set appropriate limits

Firecrawl Web Scraping & Data Extraction

Firecrawl Web Scraping & Data Extraction

What This Skill Does

Instructions

1. Understand the Request

2. Choose the Right Operation

3. Configure Parameters

4. Execute and Handle Results

5. Monitor Usage

6. Error Handling

Examples

Important Notes

Constraints

Reviews (0)