Asia/Manila
Projects

Crava: Automating web scraping with AI

image
June 18, 2025
In this project, I focused on automating the process of web scraping using LLM APIs. Basically, the idea is to let the LLM generate CSS selectors for the actual scraper (like Puppeteer) — these selectors then get cached for optimization purposes.
  • AI-Powered Extraction: Automatically generates CSS selectors using Google Gemini AI
  • Stealth Scraping: Uses Puppeteer with stealth plugins to avoid bot detection
  • JSON Output: Clean, structured JSON data output
  • Smart Retry Logic: Built-in retry mechanism with exponential backoff
  • Extensible LLM Support: Ready for OpenAI, Anthropic, and other AI providers
  • TypeScript: Full TypeScript support with comprehensive type definitions
  • CLI Interface: Use via command line or programmatically
  • Global Installation: Available as crava command or npx crava
  • Gemini API Client: For SDK communication with Gemini API.
  • React and Next.js: For building the front-end landing page.
  • Puppeteer: For managing web scraping.
  • GitHub Actions: For automating the pipeline and syncing design changes to the repository.
Now you're curious how it works. In detail, it goes as follows.
  1. Page Loading: Crava uses Puppeteer with stealth plugins to load the target webpage, avoiding bot detection
  2. AI Analysis: The page HTML is cleaned and sent to AI (Gemini) to analyze content structure and generate extraction selectors
  3. Smart Extraction: Generated selectors are used to extract structured data, with fallback strategies for dynamic content
  4. Data Processing: Extracted data is cleaned, validated, and formatted as structured JSON
  5. Output: Results can be displayed in console or saved to JSON/CSV files
Typescript
// Use specific, descriptive field names
const goodConfig = {
    keys: ["Product Title", "Sale Price", "Customer Rating", "Stock Status"],
};

// Add context with custom prompts
const betterConfig = {
    keys: ["Product Title", "Sale Price"],
    customPrompt: "Extract only products that are currently on sale",
};

// Handle dynamic content
const robustConfig = {
    keys: ["Article Title", "Author"],
    timeout: 45000, // Longer timeout for slow sites
    maxRetries: 5, // More retries for unreliable sites
};
While Crava offers powerful AI-driven web scraping capabilities, it has several important limitations to consider. The tool requires an active AI provider API key and internet connection for operation, and its performance is directly tied to both page complexity and AI response times. Despite using stealth mode, some websites may still detect and block automated scraping attempts, and heavily JavaScript-dependent sites often require extended timeout values to function properly. Additionally, AI providers impose rate limits that can impact high-volume scraping operations, and the accuracy of data extraction depends heavily on the clarity and structure of the target page's content.
Typescript
// For better performance on similar pages
const config = {
    keys: ["Title", "Price"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.1, // Lower temperature = more consistent results
    },
    timeout: 20000, // Shorter timeout for fast sites
    maxRetries: 2, // Fewer retries for reliable sites
};

// For complex or slow sites
const robustConfig = {
    keys: ["Article Title", "Full Content", "Author"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.3,
    },
    timeout: 60000, // Longer timeout
    maxRetries: 5, // More retries
    customPrompt:
        "Wait for all content to load. Focus on main article content.",
};