Crava: Automating web scraping with AI

June 18, 2025

Overview

In this project, I focused on automating the process of web scraping using LLM APIs. Basically, the idea is to let the LLM generate CSS selectors for the actual scraper (like Puppeteer) — these selectors then get cached for optimization purposes.

🚀 Features

AI-Powered Extraction: Automatically generates CSS selectors using Google Gemini AI
Stealth Scraping: Uses Puppeteer with stealth plugins to avoid bot detection
JSON Output: Clean, structured JSON data output
Smart Retry Logic: Built-in retry mechanism with exponential backoff
Extensible LLM Support: Ready for OpenAI, Anthropic, and other AI providers
TypeScript: Full TypeScript support with comprehensive type definitions
CLI Interface: Use via command line or programmatically
Global Installation: Available as crava command or npx crava

⚡ Technologies Used

Gemini API Client: For SDK communication with Gemini API.
React and Next.js: For building the front-end landing page.
Puppeteer: For managing web scraping.
GitHub Actions: For automating the pipeline and syncing design changes to the repository.

Now you're curious how it works. In detail, it goes as follows.

🔧 How It Works (In Detail)

Page Loading: Crava uses Puppeteer with stealth plugins to load the target webpage, avoiding bot detection
AI Analysis: The page HTML is cleaned and sent to AI (Gemini) to analyze content structure and generate extraction selectors
Smart Extraction: Generated selectors are used to extract structured data, with fallback strategies for dynamic content
Data Processing: Extracted data is cleaned, validated, and formatted as structured JSON
Output: Results can be displayed in console or saved to JSON/CSV files

Best Practices and How to Integrate

Typescript

// Use specific, descriptive field names
const goodConfig = {
    keys: ["Product Title", "Sale Price", "Customer Rating", "Stock Status"],
};

// Add context with custom prompts
const betterConfig = {
    keys: ["Product Title", "Sale Price"],
    customPrompt: "Extract only products that are currently on sale",
};

// Handle dynamic content
const robustConfig = {
    keys: ["Article Title", "Author"],
    timeout: 45000, // Longer timeout for slow sites
    maxRetries: 5, // More retries for unreliable sites
};

Limitations & Considerations

While Crava offers powerful AI-driven web scraping capabilities, it has several important limitations to consider. The tool requires an active AI provider API key and internet connection for operation, and its performance is directly tied to both page complexity and AI response times. Despite using stealth mode, some websites may still detect and block automated scraping attempts, and heavily JavaScript-dependent sites often require extended timeout values to function properly. Additionally, AI providers impose rate limits that can impact high-volume scraping operations, and the accuracy of data extraction depends heavily on the clarity and structure of the target page's content.

🚀 Performance Tips

Typescript

// For better performance on similar pages
const config = {
    keys: ["Title", "Price"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.1, // Lower temperature = more consistent results
    },
    timeout: 20000, // Shorter timeout for fast sites
    maxRetries: 2, // Fewer retries for reliable sites
};

// For complex or slow sites
const robustConfig = {
    keys: ["Article Title", "Full Content", "Author"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.3,
    },
    timeout: 60000, // Longer timeout
    maxRetries: 5, // More retries
    customPrompt:
        "Wait for all content to load. Focus on main article content.",
};