
Overview
🚀 Features
- AI-Powered Extraction: Automatically generates CSS selectors using Google Gemini AI
- Stealth Scraping: Uses Puppeteer with stealth plugins to avoid bot detection
- JSON Output: Clean, structured JSON data output
- Smart Retry Logic: Built-in retry mechanism with exponential backoff
- Extensible LLM Support: Ready for OpenAI, Anthropic, and other AI providers
- TypeScript: Full TypeScript support with comprehensive type definitions
- CLI Interface: Use via command line or programmatically
- Global Installation: Available as crava command or npx crava
⚡ Technologies Used
- Gemini API Client: For SDK communication with Gemini API.
- React and Next.js: For building the front-end landing page.
- Puppeteer: For managing web scraping.
- GitHub Actions: For automating the pipeline and syncing design changes to the repository.
🔧 How It Works (In Detail)
- Page Loading: Crava uses Puppeteer with stealth plugins to load the target webpage, avoiding bot detection
- AI Analysis: The page HTML is cleaned and sent to AI (Gemini) to analyze content structure and generate extraction selectors
- Smart Extraction: Generated selectors are used to extract structured data, with fallback strategies for dynamic content
- Data Processing: Extracted data is cleaned, validated, and formatted as structured JSON
- Output: Results can be displayed in console or saved to JSON/CSV files
Best Practices and How to Integrate
Typescript
// Use specific, descriptive field names
const goodConfig = {
keys: ["Product Title", "Sale Price", "Customer Rating", "Stock Status"],
};
// Add context with custom prompts
const betterConfig = {
keys: ["Product Title", "Sale Price"],
customPrompt: "Extract only products that are currently on sale",
};
// Handle dynamic content
const robustConfig = {
keys: ["Article Title", "Author"],
timeout: 45000, // Longer timeout for slow sites
maxRetries: 5, // More retries for unreliable sites
};
Limitations & Considerations
🚀 Performance Tips
Typescript
// For better performance on similar pages
const config = {
keys: ["Title", "Price"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
temperature: 0.1, // Lower temperature = more consistent results
},
timeout: 20000, // Shorter timeout for fast sites
maxRetries: 2, // Fewer retries for reliable sites
};
// For complex or slow sites
const robustConfig = {
keys: ["Article Title", "Full Content", "Author"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
temperature: 0.3,
},
timeout: 60000, // Longer timeout
maxRetries: 5, // More retries
customPrompt:
"Wait for all content to load. Focus on main article content.",
};