PROJECT OVERVIEW:
Develop a privacy-focused, fully functional macOS desktop application named 'Voice Command Hub' using Next.js. This application will leverage 100% local, on-device speech-to-text models to convert user's voice commands into text. The core value proposition is enabling users to control their macOS applications, automate tasks, and interact with other AI agents securely, without any data leaving their computer. It aims to provide a robust, open-source alternative to cloud-based voice services, prioritizing user privacy and local processing power. The MVP will focus on core speech-to-text functionality, basic command execution for macOS apps, and customizable command mapping.
TECH STACK:
- Frontend Framework: React with Next.js (App Router)
- Styling: Tailwind CSS
- UI Components: shadcn/ui
- State Management: Zustand or Jotai for global state, React Context API for local component state.
- Authentication: Not required for MVP as it's a local desktop app, but will include mechanisms to manage local application settings and potentially user profiles if future cloud sync features are added.
- Local Speech-to-Text: Utilize platform-specific APIs or open-source libraries that can run locally on macOS (e.g., leveraging Apple's Speech Recognition APIs or exploring libraries like whisper.cpp if feasible within a Next.js desktop app context - a more realistic approach might involve a small native bridge if direct JS access is limited).
- Database: Drizzle ORM for managing local application settings and command configurations. SQLite will be used as the local database engine.
- Build/Packaging: Electron or Tauri for packaging the Next.js app into a cross-platform (initially macOS) desktop application.
- Other: lucide-react for icons.
DATABASE SCHEMA (SQLite via Drizzle ORM):
1. `users` (For future extensibility, not strictly needed for MVP but good practice):
- `id` (UUID, primary key)
- `email` (TEXT, unique)
- `hashedPassword` (TEXT)
- `createdAt` (TIMESTAMP)
- `updatedAt` (TIMESTAMP)
2. `settings`:
- `id` (INTEGER, primary key, auto-increment)
- `micDeviceId` (TEXT, NULLABLE) // Stores the selected microphone device ID
- `outputLang` (TEXT, default: 'en-US') // Default language for speech recognition
- `autoStart` (BOOLEAN, default: false) // Whether to start on system login
- `createdAt` (TIMESTAMP)
- `updatedAt` (TIMESTAMP)
3. `commands`:
- `id` (UUID, primary key)
- `triggerPhrase` (TEXT, NOT NULL, unique) // The voice phrase that triggers the command (e.g., 'open terminal')
- `actionType` (TEXT, NOT NULL) // 'openApp', 'runScript', 'pasteText', 'agentInteraction'
- `actionValue` (TEXT, NOT NULL) // e.g., '/Applications/iTerm.app', 'osascript ./myScript.applescript', 'This is some text to paste', '{ "agent": "aiAssistant", "prompt": "Summarize this document." }'
- `isEnabled` (BOOLEAN, default: true)
- `createdAt` (TIMESTAMP)
- `updatedAt` (TIMESTAMP)
CORE FEATURES & USER FLOW:
1. **Local Speech-to-Text Conversion**:
* **Flow**: User clicks a 'Start Listening' button or a global hotkey is pressed (configured in settings). The app starts capturing audio from the selected microphone.
* The audio stream is processed by a local speech recognition model.
* The recognized text is displayed in a transcript area in real-time or near real-time.
* When the user stops speaking (or after a short pause), the recognized text for that utterance is finalized.
* **Edge Case**: Handling background noise, multiple speakers, or no speech detected. Displaying an 'inactive' or 'listening' status.
2. **Command Triggering & Execution**:
* **Flow**: As the user speaks, the application continuously checks the recognized text against the `triggerPhrase` in the `commands` table.
* If a `triggerPhrase` matches a significant portion of the recognized text (e.g., starts with the phrase), the application triggers the associated `actionType` and `actionValue`.
* **Action Types & Execution**:
* `openApp`: Uses macOS's `open` command or AppleScript to launch the application specified in `actionValue`.
* `runScript`: Executes a local shell script specified in `actionValue`.
* `pasteText`: Uses system event simulation to paste the text from `actionValue` into the currently focused application.
* `agentInteraction` (Future/Advanced): Sends the command content or a structured JSON payload to a local agent or a designated API endpoint.
* **Edge Case**: Command not found, invalid `actionValue` (e.g., app path incorrect), permissions issues for running scripts.
3. **Command Management UI**:
* **Flow**: A dedicated page/section allows users to view, add, edit, and delete commands.
* **Add/Edit Form**: Fields for `triggerPhrase` (text input), `actionType` (dropdown: Open App, Run Script, Paste Text), `actionValue` (dynamic input based on `actionType`), `isEnabled` (toggle switch).
* **View**: A table or list displaying existing commands with their trigger, action, and status. Includes options to enable/disable commands.
* **Validation**: `triggerPhrase` must not be empty and should ideally be unique. `actionValue` must be valid based on `actionType`.
4. **Settings Management**:
* **Flow**: A settings page allows users to configure the application.
* **Options**: Select microphone device, set default recognition language, toggle 'auto-start on login', configure global hotkey for listening.
* **Persistence**: Settings are saved to the local SQLite database using the `settings` table.
* **Edge Case**: No microphones available, invalid language code.
API & DATA FETCHING:
- **Local Data Access**: All data (settings, commands) is accessed via Drizzle ORM interacting with the local SQLite database. No external API calls are needed for core MVP functionality.
- **IPC (Inter-Process Communication)**: If using Electron/Tauri, communication between the main process (handling native functions like speech recognition, file access, system commands) and the renderer process (React UI) will be crucial. API routes are not applicable in the traditional Next.js sense for the desktop app's core logic.
- **Data Flow**:
* UI components read data (settings, commands list) directly from the local DB via ORM functions.
* When a command is triggered, the main process handler executes the action based on data fetched from the DB.
* Settings changes in the UI trigger ORM functions to update the `settings` table.
UI/UX DESIGN & VISUAL IDENTITY:
- **Design Style**: Minimalist Clean with a hint of futuristic tech. Focus on clarity, ease of use, and subtle visual feedback.
- **Color Palette**:
* Primary (Dark Mode): `#1a1a2e` (Background), `#0f3460` (Secondary Background/Card), `#e94560` (Accent/Action), `#00a8cc` (Accent/Info), `#ffffff` (Primary Text)
* Primary (Light Mode): `#ffffff` (Background), `#f0f4f8` (Secondary Background/Card), `#e94560` (Accent/Action), `#00a8cc` (Accent/Info), `#1a1a2e` (Primary Text)
* Accent Gradients (Subtle): `#00a8cc` to `#e94560` for buttons or highlights.
- **Typography**: A clean, sans-serif font family like Inter or Poppins for readability. Use varying weights for hierarchy.
- **Layout**:
* Sidebar Navigation (App Router layout group) for Dashboard, Commands, Settings.
* Main content area displaying information relevant to the selected section.
* A persistent status indicator (e.g., a subtle microphone icon in the tray or header) showing if the app is listening, idle, or processing.
- **Responsive Rules**: While primarily a desktop app, ensure UI elements adapt gracefully to different window sizes.
- **Icons**: Use a consistent icon set like lucide-react.
COMPONENT BREAKDOWN (Next.js App Router Structure):
- `app/`
- `(dashboard)/` (Layout Group for authenticated/main app views)
- `layout.tsx`: Main app layout with sidebar navigation.
- `page.tsx`: Dashboard overview (e.g., current status, recent commands, quick settings).
- `commands/`
- `page.tsx`: Displays the list of commands (using a DataTable component).
- `new/page.tsx` or `edit/[id]/page.tsx`: Form page for adding/editing commands.
- `settings/`
- `page.tsx`: Settings form (microphone selection, language, hotkeys, auto-start).
- `layout.tsx`: Root layout (defines html, body, providers).
- `page.tsx`: (Optional) Initial landing/welcome page if not packaged as a pure app on first run.
- `components/ui/`
- `Button.tsx`: Interactive buttons (shadcn/ui).
- `Input.tsx`: Text inputs (shadcn/ui).
- `Label.tsx`: Form labels (shadcn/ui).
- `Select.tsx`: Dropdown selections (shadcn/ui).
- `Switch.tsx`: Toggle switches (shadcn/ui).
- `DataTable.tsx`: Component to display commands list, with sorting/filtering.
- `Card.tsx`, `CardHeader.tsx`, `CardContent.tsx`: For structuring content sections.
- `AlertDialog.tsx`: For confirmations (e.g., delete command).
- `Tooltip.tsx`: For explaining icons or actions.
- `Progress.tsx`: For indicating processing/listening state.
- `CommandInput.tsx`: Custom input for the trigger phrase, possibly with suggestions.
- `AgentActionConfig.tsx`: Dynamic form component for agent interaction settings.
- `lib/`
- `db.ts`: Drizzle ORM configuration and database connection (SQLite).
- `schemas.ts`: Drizzle schema definitions for tables.
- `speech.ts`: Functions for initializing and interacting with the local speech recognition engine.
- `commands.ts`: Functions for CRUD operations on commands.
- `settings.ts`: Functions for CRUD operations on settings.
- `system.ts`: Helper functions for interacting with macOS (e.g., opening apps, running scripts via AppleScript or shell).
- `hotkeys.ts`: Logic for managing global hotkeys.
- `hooks/`
- `useSpeechRecognition.ts`: Custom hook to manage the state and logic of speech recognition.
- `useCommands.ts`: Hook for fetching and managing commands.
- `useSettings.ts`: Hook for fetching and managing settings.
ANIMATIONS:
- **Transitions**: Use Tailwind CSS's transition utilities for smooth changes on hover, focus, and element appearance/disappearance (e.g., `transition-all duration-300 ease-in-out`).
- **Hover Effects**: Subtle background color changes or scaling on interactive elements like buttons and list items.
- **Loading States**: Use `Progress` components or spinners (from lucide-react or shadcn/ui) with opacity transitions when performing actions like saving settings or processing audio.
- **Microphone Status**: Animate the microphone icon in the status indicator to show listening (pulsing), processing (spinning), or idle (static) states.
EDGE CASES:
- **No Microphone Available**: Display a clear message in settings and disable listening functionality.
- **Permissions**: Guide the user on how to grant microphone and potentially accessibility permissions required for controlling other apps.
- **Initial State**: When no commands are set up, display a helpful message and a prominent 'Add Command' button.
- **Authentication Flow**: Not applicable for local MVP, but plan for potential future cloud sync requiring auth.
- **Error Handling**: Gracefully handle errors during speech recognition, command execution (e.g., invalid paths, permissions), and display user-friendly messages.
- **Validation**: Implement robust client-side and server-side (within the app's logic) validation for all user inputs, especially command triggers and action values.
- **Hotkeys Conflict**: Detect and warn the user if a chosen global hotkey might conflict with existing system or application hotkeys.
- **CPU/Memory Usage**: Monitor and optimize local model performance to avoid excessive resource consumption.
SAMPLE DATA:
1. **Settings**: `{ micDeviceId: '1234:Apple Inc.:Apple
Microphone:XYZ', outputLang: 'en-US', autoStart: false }`
2. **Command 1 (Open App)**: `{ id: 'uuid-1', triggerPhrase: 'open terminal', actionType: 'openApp', actionValue: '/System/Applications/Utilities/Terminal.app', isEnabled: true }`
3. **Command 2 (Paste Text)**: `{ id: 'uuid-2', triggerPhrase: 'paste my signature', actionType: 'pasteText', actionValue: 'Best regards,
John Doe', isEnabled: true }`
4. **Command 3 (Run Script)**: `{ id: 'uuid-3', triggerPhrase: 'run build script', actionType: 'runScript', actionValue: '~/scripts/build_project.sh', isEnabled: false }`
5. **Command 4 (Complex Trigger)**: `{ id: 'uuid-4', triggerPhrase: 'create new file named', actionType: 'runScript', actionValue: 'touch ${filePath}', isEnabled: true }` (Requires advanced parsing to extract 'filePath' from voice)
6. **Command 5 (Agent Interaction - future MVP)**: `{ id: 'uuid-5', triggerPhrase: 'summarize this', actionType: 'agentInteraction', actionValue: '{ "agent": "localAIAssistant", "task": "summarize" }', isEnabled: true }`
7. **Empty Commands List**: `[]`
8. **Default Settings**: `{ micDeviceId: null, outputLang: 'en-US', autoStart: false }`
9. **In-progress Speech**: `{ transcript: 'Open the...', isFinal: false }`
10. **Finalized Utterance**: `{ transcript: 'Open the terminal window', isFinal: true }`