## AI Master Prompt: Web Scraper SaaS
### PROJECT OVERVIEW:
Create a single-page React application (SPA) for a SaaS platform that allows users to crawl entire websites with a single API call. The application should provide an intuitive interface for managing crawl jobs, viewing results, and integrating with the Cloudflare Browser Rendering API. The core value proposition is to simplify and automate web data extraction, making it accessible for training AI models (like RAG pipelines), research, content monitoring, and general data scraping needs. The platform must respect `robots.txt` and AI Crawl Control by default, ensuring ethical and compliant web crawling.
### TECH STACK:
- **Frontend Framework:** React (v18+ with Hooks)
- **Styling:** Tailwind CSS (v3+) for rapid UI development and utility-first styling.
- **State Management:** Zustand or React Context API for efficient global and local state management.
- **Routing:** React Router DOM for navigation within the SPA.
- **API Client:** Axios for making HTTP requests to the backend/Cloudflare API.
- **UI Components:** Potentially a lightweight component library like Headless UI or Radix UI for accessible and customizable components.
- **Build Tool:** Vite for fast development and optimized builds.
- **Form Handling:** React Hook Form for efficient form validation and submission.
- **Icons:** Heroicons or similar for clean SVG icons.
### CORE FEATURES:
1. **User Authentication (Simplified for MVP):**
* **Flow:** Users can input their Cloudflare API Token directly into a settings or profile section. No complex auth system is needed for MVP; the focus is on using the provided token.
* **Details:** A dedicated input field for the Cloudflare API Token and Account ID. Store these securely in local storage or application state (with clear warnings about security). An option to clear/reset the token.
2. **Crawl Job Initiation:**
* **Flow:** User enters a starting URL for the website they want to crawl into a prominent input field. They select the desired output format (HTML, Markdown, JSON). Upon submission, a POST request is sent to the backend (which proxies to Cloudflare API). The application receives a `job_id` and displays a success message, guiding the user to the 'My Crawls' section.
* **Details:** A form with an input for the URL, a dropdown/radio buttons for output format, and a 'Start Crawl' button. Input validation for the URL format. Clear feedback on submission (e.g., 'Crawl initiated successfully. Your Job ID is...').
3. **Crawl Job Monitoring ('My Crawls'):**
* **Flow:** A dedicated page or section lists all initiated crawl jobs. Each job entry shows the Job ID, submitted URL, status (e.g., Pending, Processing, Completed, Failed), submission timestamp, and estimated completion time (if available). Users can click on a job to view its status and, once completed, download the results in the chosen format. A refresh button or automatic polling mechanism updates job statuses.
* **Details:** Table or list view of crawl jobs. Columns: Job ID, URL, Status, Date. Each row is expandable or links to a detail view. Statuses should be visually distinct (e.g., using color-coded badges). A 'Download' button appears when the job is 'Completed'. Ability to manually refresh the status list.
4. **Result Download:**
* **Flow:** Once a crawl job is completed, a 'Download' button becomes active. Clicking this button triggers a download of the processed data (HTML, Markdown, or JSON) corresponding to the crawl job. The data could be served directly from a temporary backend storage or proxied from Cloudflare's output URLs if feasible.
* **Details:** The download mechanism should handle different file types correctly (e.g., setting appropriate Content-Type headers). For JSON, it should be pretty-printed. For HTML/Markdown, it should be the raw content.
5. **Settings/Profile Management:**
* **Flow:** A simple section where users can input and save their Cloudflare API Token and Account ID. This information is crucial for making API calls.
* **Details:** Input fields for 'Cloudflare API Token' and 'Account ID'. A 'Save' button. Clear indication if the token is currently set or not. Security warnings about handling API tokens.
### UI/UX DESIGN:
- **Layout:** Single Page Application (SPA) with a clean, modern sidebar navigation. The main content area will display the active section (e.g., 'New Crawl', 'My Crawls', 'Settings').
- **Color Palette:** Primary: A deep blue or dark gray (#1A202C). Secondary: A vibrant accent color like teal or a bright blue (#4299E1 or #38B2AC) for buttons, links, and active states. Neutral: Shades of gray for text, borders, and backgrounds (#EDF2F7, #A0AEC0).
- **Typography:** Sans-serif fonts like Inter or Lato for readability. Clear hierarchy using font weights and sizes.
- **Component Hierarchy:** A main `App` component, routing handled by `React Router`. Components like `Sidebar`, `Header`, `CrawlForm`, `CrawlJobList`, `CrawlJobItem`, `SettingsForm`, `StatusBadge`, `DownloadButton`. Use composition for building complex components.
- **Responsive Design:** Mobile-first approach. The layout should adapt gracefully to various screen sizes. Sidebar might collapse into a hamburger menu on smaller screens. Ensure all forms and tables are usable on mobile.
- **Visual Feedback:** Use subtle animations for loading states, hover effects on buttons and links, and transitions between page states. Loading spinners or skeletons should be used when data is being fetched or processed.
### COMPONENT BREAKDOWN:
- **`App.js`:** Main application component. Sets up routing (React Router), manages global layout (Sidebar, Header, Main Content Area).
- Props: None
- Responsibilities: Application shell, routing setup.
- **`Sidebar.js`:** Navigation menu.
- Props: `activeLink` (string)
- Responsibilities: Display navigation links (New Crawl, My Crawls, Settings), highlight active link.
- **`Header.js`:** Top bar, could contain app logo and user info/settings shortcut.
- Props: None
- Responsibilities: Branding, application header.
- **`NewCrawlPage.js`:** Container for the crawl initiation form.
- Props: None
- Responsibilities: Orchestrates `CrawlForm` and submission logic.
- **`CrawlForm.js`:** Form for submitting a new crawl job.
- Props: `onSubmit` (function), `isLoading` (boolean)
- Responsibilities: Renders URL input, format selection, submit button. Handles form input and validation. Calls `onSubmit` prop on valid submission.
- **`MyCrawlsPage.js`:** Container for displaying the list of crawl jobs.
- Props: None
- Responsibilities: Fetches crawl job data, orchestrates `CrawlJobList`.
- **`CrawlJobList.js`:** Renders a list or table of crawl jobs.
- Props: `jobs` (array)
- Responsibilities: Iterates over job data and renders `CrawlJobItem` for each. Handles display logic (e.g., table vs. list).
- **`CrawlJobItem.js`:** Displays a single crawl job's details and actions.
- Props: `job` (object: { id, url, status, date })
- Responsibilities: Renders individual job details. Shows status badge, download button (conditionally).
- **`SettingsPage.js`:** Container for settings management.
- Props: None
- Responsibilities: Manages API token input and saving.
- **`SettingsForm.js`:** Form for Cloudflare API Token and Account ID.
- Props: `onSave` (function), `initialToken` (string), `initialAccountId` (string)
- Responsibilities: Renders input fields for token/account ID. Handles saving via `onSave` prop.
- **`StatusBadge.js`:** Visual indicator for job status.
- Props: `status` (string)
- Responsibilities: Displays status text with appropriate styling (e.g., color coding).
- **`DownloadButton.js`:** Button to download crawl results.
- Props: `jobId` (string), `format` (string), `isDisabled` (boolean)
- Responsibilities: Triggers download process. Disabled if job is not completed.
### DATA MODEL:
- **`AppSettings` State:**
```javascript
{
apiToken: 'your_cloudflare_api_token' | null,
accountId: 'your_cloudflare_account_id' | null
}
```
* Stored in `localStorage`.
- **`CrawlJob` Object Structure:**
```javascript
{
id: 'unique_job_id_from_api',
url: 'https://example.com',
status: 'PENDING' | 'PROCESSING' | 'COMPLETED' | 'FAILED',
format: 'html' | 'markdown' | 'json',
submittedAt: '2023-10-27T10:00:00Z',
completedAt: '2023-10-27T10:05:00Z' | null,
resultUrl: 'url_to_download_or_api_endpoint' | null
}
```
- **`My Crawls` State (`jobs`):** An array of `CrawlJob` objects. Fetched from a mock API or a simple backend endpoint in a real scenario.
```javascript
const jobs = [
{ id: 'job_123', url: 'https://blog.example.com', status: 'COMPLETED', format: 'json', submittedAt: '...', completedAt: '...', resultUrl: '/results/job_123.json' },
{ id: 'job_456', url: 'https://docs.example.com', status: 'PROCESSING', format: 'html', submittedAt: '...', completedAt: null, resultUrl: null },
// ... more jobs
];
```
### ANIMATIONS & INTERACTIONS:
- **Hover Effects:** Subtle background color change or slight upward shadow lift on buttons, links, and clickable list items.
- **Transitions:** Smooth transitions for sidebar collapse/expand, route changes (if using a library that supports it), and status badge color changes.
- **Loading States:** Use skeleton loaders or spinners within the `CrawlJobList` while data is being fetched. Disable the 'Start Crawl' button and show a spinner during submission. Indicate processing status clearly on `CrawlJobItem`.
- **Micro-interactions:** Slight animation on form submission confirmation. Visual feedback when API token is successfully saved.
### EDGE CASES:
- **No API Token Set:** Prevent initiating crawls. Display a clear message prompting the user to go to Settings and enter their credentials.
- **Invalid URL:** Client-side validation on the `CrawlForm` to check for basic URL structure (e.g., starts with `http://` or `https://`). Show clear error messages near the input field.
- **API Errors:** Gracefully handle errors from the Cloudflare API (e.g., invalid token, rate limiting, server errors). Display user-friendly error messages, potentially including the `job_id` for reference. Update job status to 'FAILED' with an error description if possible.
- **Empty State (`My Crawls`):** When no crawl jobs have been initiated, display a helpful message and a button or link to initiate a new crawl.
- **Large Data Sets:** For JSON output, consider how large JSON files will be handled during download (e.g., streaming). For this MVP, assume manageable file sizes.
- **Accessibility (a11y):** Ensure all interactive elements are focusable and have appropriate ARIA attributes. Use semantic HTML. Ensure sufficient color contrast.
### SAMPLE DATA:
1. **Mock Crawl Job (Completed JSON):**
```json
{
"id": "job_abc123",
"url": "https://quotes.toscrape.com/",
"status": "COMPLETED",
"format": "json",
"submittedAt": "2024-03-10T14:30:00Z",
"completedAt": "2024-03-10T14:32:15Z",
"resultUrl": "/api/results/job_abc123.json"
}
```
2. **Mock Crawl Job (Processing HTML):**
```json
{
"id": "job_def456",
"url": "https://blog.cloudflare.com/",
"status": "PROCESSING",
"format": "html",
"submittedAt": "2024-03-10T15:00:00Z",
"completedAt": null,
"resultUrl": null
}
```
3. **Mock Crawl Job (Failed):**
```json
{
"id": "job_ghi789",
"url": "https://nonexistent-site.xyz/",
"status": "FAILED",
"format": "markdown",
"submittedAt": "2024-03-10T15:10:00Z",
"completedAt": null,
"resultUrl": null,
"errorMessage": "Failed to resolve DNS for nonexistent-site.xyz"
}
```
4. **Mock API Response for `/api/results/job_abc123.json`:**
```json
[
{
"url": "https://quotes.toscrape.com/",
"htmlContent": "<!DOCTYPE html>...</html>",
"markdownContent": "# Welcome to Scrape Quotes\n...",
"structuredJsonContent": {
"title": "Quotes to Scrape",
"quotes": [
{ "text": "...", "author": "..." },
{ "text": "...", "author": "..." }
]
}
},
{
"url": "https://quotes.toscrape.com/page/1/",
"htmlContent": "...",
"markdownContent": "...",
"structuredJsonContent": { ... }
}
]
```
*(Note: The actual Cloudflare API returns results per page. This mock aggregates them conceptually for download, or the download button could trigger fetching individual page results.)*
5. **Mock Settings State:**
```javascript
{
apiToken: 'sk_live_abcdef1234567890',
accountId: '9876543210abcdef'
}
```
### DEPLOYMENT NOTES:
- **Build Command:** Use Vite's build command (`npm run build` or `yarn build`).
- **Environment Variables:** Use `.env` files (e.g., `.env.local`, `.env.production`) for managing Cloudflare API base URL and any other necessary configuration. Vite supports importing env variables prefixed with `VITE_`.
- **Proxying API Requests:** For development, Vite's `server.proxy` option can be used to proxy requests to the Cloudflare API, avoiding CORS issues. In production, a lightweight backend server (Node.js/Express, Python/Flask) might be needed to securely handle API calls and manage job states, or use Cloudflare Workers.
- **State Persistence:** `AppSettings` should be persisted in `localStorage` for a better user experience between sessions.
- **Performance:** Optimize bundle size using code-splitting if the app grows. Lazy-load components. Optimize image loading if any are used in the UI.
- **HTTPS:** Ensure the production deployment is served over HTTPS.