## AI Master Prompt: Local LLM Inference Platform (MVP)
**PROJECT OVERVIEW:**
Develop a cutting-edge SaaS/Desktop application that empowers AI enthusiasts, researchers, and developers to run large language models (LLMs) directly on their local machines (laptops/desktops), even with limited RAM. Inspired by breakthroughs like Flash-MoE, which demonstrated running a 397B parameter model on a laptop, this application will provide an optimized, high-performance inference engine. The core value proposition is democratizing access to powerful AI models by eliminating the need for expensive cloud GPUs and complex setup. Users will be able to select, configure (e.g., quantization, expert selection), and run various LLMs efficiently, monitoring their performance in real-time. The focus is on a streamlined, native experience using efficient C/Metal or similar low-level APIs, avoiding heavy Python dependencies for the core inference pipeline.
**Target Audience:** AI Researchers, ML Engineers, Students, AI Hobbyists, Developers looking to experiment with LLMs locally, and startups seeking cost-effective AI solutions. Particularly those interested in low-level optimizations and native performance.
**Value Proposition:** Run powerful LLMs locally, save on cloud costs, achieve high inference speeds, and gain deep insights into model performance without complex setups.
**TECH STACK:**
* **Frontend:** React (Next.js App Router)
* **UI Library:** shadcn/ui (for accessible, customizable components)
* **Styling:** Tailwind CSS
* **State Management:** Zustand or React Context API for global state, local component state for UI elements.
* **Backend/API (for user management, model catalog, etc.):** Next.js API Routes or a separate lightweight backend (e.g., Node.js/Express if needed for more complex ops, but ideally keep within Next.js).
* **Database:** PostgreSQL with Drizzle ORM (for type-safe SQL queries).
* **Authentication:** NextAuth.js (for robust authentication - email/password, OAuth).
* **Inference Engine Integration:** The core inference engine will likely be a compiled C/C++ or Rust binary, exposed via native bindings or a local IPC/HTTP server that the Next.js app communicates with. For a pure frontend MVP, we might simulate this with WebAssembly or mock responses, but the prompt should guide towards native integration for the full vision.
* **Charting:** Recharts or Chart.js for performance monitoring.
**DATABASE SCHEMA (PostgreSQL with Drizzle ORM):**
1. **`users` table:**
* `id` (UUID, primary key)
* `name` (TEXT)
* `email` (TEXT, unique, not null)
* `emailVerified` (TIMESTAMPZ)
* `image` (TEXT)
* `createdAt` (TIMESTAMPZ, default now())
* `updatedAt` (TIMESTAMPZ, default now())
2. **`accounts` table (for NextAuth.js):**
* `id` (BIGSERIAL, primary key)
* `userId` (UUID, foreign key to `users.id`)
* `type` (TEXT)
* `provider` (TEXT)
* `providerAccountId` (TEXT)
* `refresh_token` (TEXT)
* `access_token` (TEXT)
* `expires_at` (BIGINT)
* `token_type` (TEXT)
* `scope` (TEXT)
* `id_token` (TEXT)
* `session_state` (TEXT)
3. **`sessions` table (for NextAuth.js):**
* `id` (BIGSERIAL, primary key)
* `sessionToken` (TEXT, unique, not null)
* `userId` (UUID, foreign key to `users.id`)
* `expires` (TIMESTAMPZ, not null)
4. **`verificationTokens` table (for NextAuth.js):**
* `identifier` (TEXT, not null)
* `token` (TEXT, not null)
* `expires` (TIMESTAMPZ, not null)
5. **`models` table:**
* `id` (UUID, primary key)
* `name` (TEXT, not null)
* `description` (TEXT)
* `filePath` (TEXT, path to model files on user's system or accessible location)
* `sizeGB` (FLOAT)
* `defaultQuantization` (TEXT, e.g., '4-bit', '2-bit')
* `defaultExperts` (INTEGER, K value for MoE)
* `parameterCount` (BIGINT)
* `sourceUrl` (TEXT, optional URL for download/info)
* `createdAt` (TIMESTAMPZ, default now())
6. **`user_models` table (linking users to their downloaded/added models):**
* `id` (UUID, primary key)
* `userId` (UUID, foreign key to `users.id`)
* `modelId` (UUID, foreign key to `models.id`)
* `localPath` (TEXT, specific path if different from default)
* `addedAt` (TIMESTAMPZ, default now())
7. **`inference_sessions` table (to track running inferences):**
* `id` (UUID, primary key)
* `userId` (UUID, foreign key to `users.id`)
* `modelId` (UUID, foreign key to `models.id`)
* `quantization` (TEXT)
* `expertsActivated` (INTEGER)
* `startTime` (TIMESTAMPZ, default now())
* `endTime` (TIMESTAMPZ)
* `status` (TEXT, e.g., 'running', 'completed', 'failed')
* `tokensPerSecond` (FLOAT)
* `ramUsageMB` (INTEGER)
* `gpuUsagePercent` (FLOAT)
* `cpuUsagePercent` (FLOAT)
* `prompt` (TEXT)
* `completion` (TEXT)
**CORE FEATURES & USER FLOW:**
1. **Authentication Flow:**
* User visits the landing page.
* Options: Sign Up / Log In.
* Uses NextAuth.js for email/password and potentially OAuth (Google).
* Upon successful login, redirects to the main dashboard.
* Protected routes ensure only logged-in users can access core features.
2. **Model Management:**
* **User Flow:** Dashboard -> "My Models" -> "Add Model" / "Browse Models".
* **Browse Models:** Displays a list of available LLMs from the `models` table. Each entry shows name, size, parameter count, default quantization. Filter/search functionality.
* **Add Model:** User can manually add a model by providing its name, parameter count, and importantly, the *local path* to the model files. A download URL can also be provided.
* **Download (Future/Advanced):** For MVP, assume user downloads models manually and provides the path. Future versions could integrate direct downloads.
* **My Models:** Lists models added/downloaded by the user (`user_models` table). User can select a model to run.
3. **Inference Configuration & Execution:**
* **User Flow:** Dashboard -> Select Model -> "Configure & Run" -> Input Prompt -> "Run Inference".
* **Configuration Screen:** Displays the selected model's details. User can adjust:
* Quantization (dropdown: e.g., '4-bit', '2-bit', '8-bit', 'FP16' - based on model support).
* Experts Activated (K value, for MoE models only, e.g., 2, 4).
* Other potential parameters like temperature, top-p (for generation quality).
* **Prompt Input:** A textarea for the user to enter their prompt.
* **Run Button:** Triggers the inference process.
* **Backend Interaction:** The Next.js frontend sends a request to a dedicated API route (e.g., `/api/inference/run`). This API route (or a separate backend service/binary) interacts with the local inference engine.
* **Inference Engine:** The C/Metal engine (or its wrapper) loads the specified model, applies the selected quantization and expert settings, processes the prompt, and returns the generated output and performance metrics.
* **Real-time Updates:** The frontend polls the status or receives updates (e.g., via WebSockets if implemented) and displays the output and metrics as they become available.
4. **Performance Monitoring Dashboard:**
* **User Flow:** Dashboard -> "Monitor" Tab.
* **Display:** Shows key metrics for the *currently running* or *last completed* inference session (from `inference_sessions` table).
* Tokens/Second (current/average).
* RAM Usage (MB).
* CPU Usage (%).
* GPU Usage (%).
* Model Name, Quantization, Experts Used.
* **Visualizations:** Uses charts (Recharts) to show token/s over time, resource usage spikes.
* **History:** A table view of past inference sessions.
5. **Output Display & Tool Calling:**
* **User Flow:** During/after inference, the generated text is displayed in a dedicated output area.
* **Tool Calling:** If the model supports tool calling and the output includes a structured tool call (e.g., JSON), it should be parsed and displayed clearly. A placeholder function/UI element can simulate executing the tool for the MVP.
* **Example:** If output is `{