## AI Master Prompt: GPU Bloom - MegaTrain-based LLM Training Platform MVP
**PROJECT OVERVIEW:**
GPU Bloom is a cutting-edge SaaS platform designed to democratize the training of large-scale (100B+ parameter) Large Language Models (LLMs). It addresses the prohibitive hardware costs and accessibility issues by implementing a memory-centric training system inspired by the MegaTrain paper. This system allows users to train LLMs at full precision using a single, powerful GPU, by leveraging host memory (CPU RAM) for parameter and optimizer state storage, treating the GPU as a transient compute engine. Users can upload their models, configure training parameters, monitor progress in real-time, and download their fully trained models. The core value proposition is enabling efficient, cost-effective, and accessible high-precision LLM training for researchers, developers, and small teams without requiring them to own or rent clusters of expensive GPUs.
**TECH STACK:**
- **Frontend:** React (Next.js 14+ with App Router), Tailwind CSS, shadcn/ui for components
- **Backend:** Next.js API Routes / Node.js
- **ORM:** Drizzle ORM with PostgreSQL (or a compatible SQL database like SQLite for local dev/simpler deployment)
- **Authentication:** NextAuth.js (using credentials provider and potentially OAuth for Google/GitHub)
- **State Management:** React Context API and Zustand for global state, component-local state for forms/UI elements.
- **Real-time Updates:** Server-Sent Events (SSE) or WebSockets for live monitoring dashboard.
- **Deployment:** Vercel (recommended for Next.js) or a similar platform supporting Node.js runtimes.
- **Other:** Zod for schema validation, React Hook Form for forms, charting library (e.g., Chart.js or Recharts) for monitoring graphs.
**DATABASE SCHEMA:**
1. **`users`**
* `id` (UUID, PK)
* `name` (VARCHAR)
* `email` (VARCHAR, UNIQUE)
* `emailVerified` (TIMESTAMP)
* `image` (VARCHAR, nullable)
* `createdAt` (TIMESTAMP)
* `updatedAt` (TIMESTAMP)
2. **`accounts`** (for NextAuth.js)
* `id` (BIGSERIAL, PK)
* `userId` (UUID, FK to users.id)
* `type` (TEXT)
* `provider` (TEXT)
* `providerAccountId` (TEXT)
* `refresh_token` (TEXT, nullable)
* `access_token` (TEXT, nullable)
* `expires_at` (BIGINT, nullable)
* `token_type` (TEXT, nullable)
* `scope` (TEXT, nullable)
* `id_token` (TEXT, nullable)
* `session_state` (TEXT, nullable)
3. **`sessions`** (for NextAuth.js)
* `sessionToken` (VARCHAR, PK)
* `userId` (UUID, FK to users.id)
* `expires` (TIMESTAMP)
4. **`verificationTokens`** (for NextAuth.js)
* `identifier` (TEXT)
* `token` (TEXT)
* `expires` (TIMESTAMP)
5. **`projects`**
* `id` (UUID, PK)
* `userId` (UUID, FK to users.id)
* `name` (VARCHAR, NOT NULL)
* `description` (TEXT, nullable)
* `modelArchitecture` (VARCHAR, e.g., 'transformer', 'custom')
* `modelSize` (INTEGER, e.g., 120 for 120B parameters)
* `trainingDataPath` (VARCHAR, reference to cloud storage or internal path)
* `createdAt` (TIMESTAMP)
* `updatedAt` (TIMESTAMP)
6. **`trainingJobs`**
* `id` (UUID, PK)
* `projectId` (UUID, FK to projects.id)
* `userId` (UUID, FK to users.id)
* `status` (VARCHAR, e.g., 'queued', 'running', 'completed', 'failed')
* `config` (JSONB, store training hyperparameters like learning rate, batch size, epochs etc.)
* `gpuType` (VARCHAR, e.g., 'H200')
* `hostMemory` (BIGINT, in GB)
* `startTime` (TIMESTAMP, nullable)
* `endTime` (TIMESTAMP, nullable)
* `errorMessage` (TEXT, nullable)
* `createdAt` (TIMESTAMP)
* `updatedAt` (TIMESTAMP)
7. **`trainingMetrics`** (Real-time data stream)
* `id` (BIGSERIAL, PK)
* `trainingJobId` (UUID, FK to trainingJobs.id)
* `timestamp` (TIMESTAMP)
* `loss` (FLOAT)
* `learningRate` (FLOAT)
* `gpuUtilization` (FLOAT)
* `cpuUtilization` (FLOAT)
* `memoryUsage` (BIGINT, in MB)
* `throughput` (FLOAT, e.g., samples/sec)
* `epoch` (INTEGER, nullable)
* `step` (BIGINT, nullable)
**CORE FEATURES & USER FLOW:**
1. **Authentication & User Management:**
* **Flow:** User lands on the homepage. Clicks 'Sign Up' or 'Login'. Redirected to Auth page. Options: Email/Password or OAuth (Google/GitHub). Upon successful login, user is redirected to their Dashboard. User profile page allows viewing/editing basic info and managing subscription (future).
* **Details:** Use NextAuth.js. Implement email verification for credential sign-up. Securely store passwords (hashed). Handle session management. Protect all routes except landing page and auth pages.
2. **Project Creation:**
* **Flow:** Logged-in user navigates to 'Projects' tab. Clicks 'New Project'. A modal or form appears asking for Project Name, Description, Model Architecture (dropdown: Transformer, Custom), Model Size (input: e.g., 120), Training Data Path (input/upload link). Click 'Create Project'. Project is saved to DB and user is redirected to the project details page.
* **Details:** Use `react-hook-form` and `zod` for validation. Store project metadata in `projects` table. Training data path will initially be a placeholder; actual data handling is complex for MVP.
3. **Training Job Configuration & Launch:**
* **Flow:** User views a specific project. Clicks 'Start New Training Job'. A detailed form appears: Training Configuration (JSON input or structured fields for learning rate, batch size, epochs, optimizer choice etc.), GPU Type (dropdown: H200), Host Memory (input, e.g., 1536 GB). User clicks 'Start Training'. A new entry is created in `trainingJobs` table with status 'queued'. A backend process (or separate worker service) picks up 'queued' jobs and initiates the MegaTrain simulation/process.
* **Details:** The actual MegaTrain execution will be simulated or integrated via an API call to a specialized backend service in MVP. The backend service will manage GPU allocation (simulated in MVP by assigning a job ID to a conceptual GPU). User config is stored in `trainingJobs.config` (JSONB).
4. **Real-time Training Monitoring:**
* **Flow:** User navigates to the 'Training Jobs' list or a specific job's detail page. If a job is 'running', a live dashboard appears. This dashboard fetches metrics from the `trainingMetrics` table via SSE/WebSockets. Graphs display Loss, Learning Rate, GPU/CPU Utilization, Memory Usage over time. Key stats (current epoch, step, throughput) are displayed prominently.
* **Details:** Backend process responsible for the training job will push metrics to the `trainingMetrics` table periodically. Frontend client subscribes to these updates via SSE or WebSockets, updating the UI dynamically without full page reloads.
5. **Model Download / Management:**
* **Flow:** Once a training job status is 'completed', the user sees a 'Download Model' button on the job details page. Clicking this initiates a download of the trained model weights (simulated as a downloadable file or a link to cloud storage). UI also shows completed jobs and their final metrics.
* **Details:** For MVP, this could be a placeholder button or download a small dummy file. Actual large model file transfer is complex; focus on the UI/UX indication of completion and readiness for download.
**API & DATA FETCHING:**
- **Authentication API:** Handled by NextAuth.js (e.g., `POST /api/auth/...`)
- **Project API (`/api/projects`)**
* `GET /`: Fetch all projects for the logged-in user.
* `POST /`: Create a new project.
* `GET /[projectId]`: Fetch details for a specific project.
* `PUT /[projectId]`: Update a project.
* `DELETE /[projectId]`: Delete a project.
- **Training Job API (`/api/training-jobs`)**
* `GET /?projectId=[projectId]`: Fetch all training jobs for a project.
* `POST /`: Start a new training job (takes project details and config).
* `GET /[jobId]`: Fetch details for a specific training job (including status).
* `GET /[jobId]/metrics`: Fetch historical metrics for a job (for initial load before SSE starts).
- **Metrics Streaming (`/api/training-jobs/[jobId]/stream`)**
* Use Server-Sent Events (SSE) to stream new metrics from `trainingMetrics` table.
- **Data Fetching:** Utilize Next.js `fetch` API within Server Components for initial data loads where possible. Use client-side fetching (e.g., `useEffect` with `fetch` or libraries like SWR/React Query) for dynamic data, form submissions, and real-time updates. Data validation using Zod on both client and server.
**COMPONENT BREAKDOWN (Next.js App Router Structure):**
- **`app/`**
* **`layout.tsx`**: Root layout (HTML, Head, Body, global providers, Tailwind CSS init).
* **`page.tsx`**: Landing Page (Marketing content, value proposition, CTA to sign up/login).
* **`auth/page.tsx`**: Authentication page (Login/Sign Up form).
* **`(app)/`**: Authenticated routes group.
* **`layout.tsx`**: Main app layout (Sidebar, Header, Content area).
* **`dashboard/page.tsx`**: User Dashboard (Overview of projects, recent jobs, quick stats).
* Components: `ProjectCard`, `RecentJobsTable`, `StatSummary`.
* **`projects/page.tsx`**: List of all user projects.
* Components: `ProjectList`, `CreateProjectButton`, `ProjectListItem`.
* **`projects/[projectId]/page.tsx`**: Project Detail Page (Project info, list of training jobs for this project).
* Components: `ProjectDetailsCard`, `TrainingJobsTable`.
* Components: `StartTrainingJobButton` (opens modal).
* **`projects/[projectId]/jobs/[jobId]/page.tsx`**: Training Job Detail Page (Real-time monitoring dashboard, job configuration, results).
* Components: `JobStatusBadge`, `TrainingConfigView`, `LiveMetricsDashboard` (uses `MetricsChart` and `StatDisplay`), `ModelDownloadButton`.
* **`settings/page.tsx`**: User Settings (Profile info, API keys - future).
* **`api/`**: API Routes (NextAuth, project CRUD, job CRUD, metrics stream).
- **`components/`**
* **`ui/`**: Re-usable UI components from `shadcn/ui` (Button, Input, Card, Table, Dialog, Form, etc.).
* **`shared/`**: Custom shared components.
* `Sidebar.tsx`: Navigation menu.
* `Header.tsx`: Top navigation/user menu.
* `Footer.tsx`.
* `LoadingSpinner.tsx`.
* `ErrorBoundary.tsx`: For error handling.
* `MetricsChart.tsx`: Re-usable chart component.
* `StatDisplay.tsx`: For displaying single metrics.
* `ProjectForm.tsx`: Modal/Form for creating/editing projects.
* `TrainingJobForm.tsx`: Modal/Form for configuring training jobs.
**UI/UX DESIGN & VISUAL IDENTITY:**
- **Design Style:** Modern, clean, professional, with a subtle tech-forward feel. Focus on clarity and data visualization.
- **Color Palette:**
* Primary: Deep Blue (`#1E3A8A` - Slate 800)
* Secondary: Teal (`#0694A2` - Emerald 500)
* Accent/Call to Action: Bright Purple (`#7C3AED` - Violet 600)
* Background: Dark Gray (`#1F2937` - Gray 800)
* Surface/Card Background: Slightly lighter Dark Gray (`#2D3748` - Gray 700)
* Text (Primary): Light Gray (`#E5E7EB` - Gray 200)
* Text (Secondary): Medium Gray (`#9CA3AF` - Gray 400)
* Success: Green (`#10B981` - Green 500)
* Error: Red (`#EF4444` - Red 500)
- **Typography:** Sans-serif. Use Inter or similar modern font. Clear hierarchy using font weights and sizes.
- **Layout:** Sidebar navigation on the left for authenticated app. Main content area takes up the rest of the space. Use a consistent grid system (e.g., 12-column). Cards for summaries and details. Clean forms with clear labels and validation feedback.
- **Responsiveness:** Mobile-first approach. Sidebar collapses into a hamburger menu on smaller screens. Content reflows to fit screen width. Tables should be responsive (e.g., horizontal scroll or column hiding).
**ANIMATIONS:**
- **Page Transitions:** Subtle fade-in/out using Next.js `transition` or a library like `Framer Motion` (optional for MVP).
- **Hover Effects:** Slight scale-up or background color change on interactive elements (buttons, links, cards).
- **Loading States:** Use `shadcn/ui` skeleton loaders or spinners (`LoadingSpinner.tsx`) while data is being fetched. Add shimmering effect to skeleton loaders.
- **Micro-interactions:** Smooth transitions for expanding/collapsing sections, form submission feedback.
- **Chart Animations:** Smooth animations for data updates and initial chart rendering.
**EDGE CASES & VALIDATIONS:**
- **Authentication:** Redirect unauthenticated users to `/auth`. Protect all `/app` routes. Handle expired sessions gracefully.
- **Empty States:** Display user-friendly messages and clear CTAs when projects list, job list, or monitoring data is empty (e.g., "No projects created yet. Click 'New Project' to start.").
- **Form Validation:** Use Zod for robust schema validation on all user inputs (project names, training parameters, etc.). Provide clear, inline error messages.
- **API Errors:** Implement centralized error handling. Display user-friendly error messages for API failures (e.g., "Failed to start training job. Please try again."). Log detailed errors on the server.
- **Data Integrity:** Use database transactions where appropriate (e.g., creating a project and its initial job). Ensure foreign key constraints are used.
- **Long-Running Operations:** Clearly indicate the status of training jobs ('queued', 'running', 'completed', 'failed'). Use SSE/WebSockets to ensure the monitoring dashboard reflects the latest status.
- **Resource Limits:** (Future) Implement checks for subscription limits (GPU hours, storage).
**SAMPLE DATA (for Mocking/Initial State):**
1. **User:**
```json
{
"id": "usr_12345abc",
"name": "Alice Smith",
"email": "alice.smith@example.com"
}
```
2. **Project:**
```json
{
"id": "proj_abcdef12",
"userId": "usr_12345abc",
"name": "QuantumLeap-130B",
"description": "Experimenting with Quantum NLP concepts on a 130B parameter model.",
"modelArchitecture": "transformer",
"modelSize": 130,
"trainingDataPath": "s3://my-bucket/datasets/quantum-corpus/",
"createdAt": "2023-10-26T10:00:00Z"
}
```
3. **Training Job (Queued):**
```json
{
"id": "job_qwert123",
"projectId": "proj_abcdef12",
"userId": "usr_12345abc",
"status": "queued",
"config": {"learningRate": 0.0001, "batchSize": 1, "epochs": 3, "optimizer": "AdamW"},
"gpuType": "H200",
"hostMemory": 1536,
"createdAt": "2023-10-26T11:00:00Z"
}
```
4. **Training Job (Running):**
```json
{
"id": "job_asdfg456",
"projectId": "proj_abcdef12",
"userId": "usr_12345abc",
"status": "running",
"config": {"learningRate": 0.0001, "batchSize": 1, "epochs": 3, "optimizer": "AdamW"},
"gpuType": "H200",
"hostMemory": 1536,
"startTime": "2023-10-26T11:05:00Z",
"createdAt": "2023-10-26T11:00:00Z"
}
```
5. **Training Job (Completed):**
```json
{
"id": "job_zxcvb789",
"projectId": "proj_abcdef12",
"userId": "usr_12345abc",
"status": "completed",
"config": {"learningRate": 0.0001, "batchSize": 1, "epochs": 3, "optimizer": "AdamW"},
"gpuType": "H200",
"hostMemory": 1536,
"startTime": "2023-10-26T12:00:00Z",
"endTime": "2023-10-26T14:00:00Z",
"createdAt": "2023-10-26T11:55:00Z"
}
```
6. **Training Metric (Sample for running job):**
```json
{
"trainingJobId": "job_asdfg456",
"timestamp": "2023-10-26T11:05:15Z",
"loss": 1.2345,
"learningRate": 0.000098,
"gpuUtilization": 95.5,
"cpuUtilization": 60.0,
"memoryUsage": 75000,
"throughput": 15.2,
"epoch": 1,
"step": 500
}
```
7. **Training Metric (Another sample):**
```json
{
"trainingJobId": "job_asdfg456",
"timestamp": "2023-10-26T11:05:30Z",
"loss": 1.1980,
"learningRate": 0.000097,
"gpuUtilization": 96.1,
"cpuUtilization": 61.2,
"memoryUsage": 75200,
"throughput": 15.5,
"epoch": 1,
"step": 510
}
```
8. **Training Job (Failed):**
```json
{
"id": "job_fail101",
"projectId": "proj_abcdef12",
"userId": "usr_12345abc",
"status": "failed",
"config": {"learningRate": 0.0001, "batchSize": 1, "epochs": 3, "optimizer": "AdamW"},
"gpuType": "H200",
"hostMemory": 1536,
"startTime": "2023-10-26T15:00:00Z",
"endTime": "2023-10-26T15:02:00Z",
"errorMessage": "CUDA out of memory error during optimizer step.",
"createdAt": "2023-10-26T14:59:00Z"
}
```