https://claude.ai/chat/ea0a80da-56cc-44e2-b59e-ab81cd8d127c [build from obsidian plugin](https://claude.ai/chat/844dcc1c-d16f-4c2a-af9a-e02936ade305) ### Vision Modern AI interaction is constrained by outdated interfaces. We type into text boxes, switch between apps, copy-paste responses, and lose context between conversations. This creates artificial friction between human thought and AI capability. Extended Cognition reimagines this relationship. Instead of adapting our thinking to fit the constraints of keyboards and screens, we speak naturally - capturing thoughts as they emerge. The system maintains context across conversations, recalls relevant information from your personal knowledge base, and responds conversationally. This isn't just about convenience. When AI interaction becomes as fluid as thought itself, it fundamentally changes what's possible: - **Stream of consciousness brainstorming** without breaking flow to type - **Ambient intelligence** that's always available but never intrusive - **Personalized responses** informed by your accumulated context and preferences - **Multi-threaded conversations** that can branch and merge naturally - **Hands-free operation** enabling AI assistance during physical tasks The goal is cognitive augmentation - making AI feel less like using a computer and more like having enhanced thinking capabilities. By removing interface friction, we unlock the true potential of human-AI collaboration. ### Description **Extended Cognition** is a voice-first AI interface that enables seamless, hands-free interaction with AI systems through natural speech and voice commands. The system captures audio, transcribes it, processes it through a local LLM server with RAG-based memory injection, and speaks responses back - making AI interaction as natural as thinking out loud. By eliminating the friction of typing and app-switching while providing intuitive voice control, it transforms AI from a tool you use into an extension of your cognitive process. ### Architecture ``` ┌───────────────────────────────┐ ┌─────────────────────────────────────────────────────┐ │ Mobile App (iOS/Android) │ │ Local Server (Total Response: ~800ms) │ ├───────────────────────────────┘ ├─────────────────────────────────────────────────────┤ | ┌────────────────────┐ | │ Core Pipeline │ | | Voice Capture ~30ms|─|──async──►│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ | Output:└───────────┬────────┘ | │ │ Audio │ │ Prompt │ │ LLM │ │ | ┌─────────┐ │ | │ │ Archive │ │ Builder │───►│ Interface │ │ | | Command |┌───────▼─────────┐| │ │ ~50ms │ │ ~30ms │ │ ~400ms │ │ | | Parser || Local STT ~80ms ||──sync──► │ └──────────┘ └─────▲────┘ └──────┬───────┘ │ | └─────────┘└─────────────────┘| │ │ │ │ ├───────────────────────────────┤ │ ┌──────────┐ ┌─────┴────┐ ┌──────▼───────┐ │ │ ┌──────────────┐ | │ │ TTS ◄────│ Response ◄────│ Output │ │ | | Audio Player ◄─────|──mp3──── │ │ Engine │ │ Handler │ │ Processor │ │ | └──────────────┘ | │ │ ~200ms │ │ ~20ms │ │ ~50ms │ │ | Input: | │ └──────────┘ └──────────┘ └──────────────┘ │ | ┌──────────────┐ | │ │ | | Text Display ◄─────|──txt──── ├─────────────────────────────────────────────────────┤ | └──────────────┘ | │ Support Services │ └───────────────────────────────┘ │ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ Context │ │ RAG │ │ Correction │ │ │ │ Manager │ │ Engine │ │ Logger │ │ │ │ ~10ms │ │ ~100ms │ │ & Trainer │ │ │ └──────────┘ └──────────┘ └────────────┘ │ └─────────────────────────────────────────────────────┘ ``` #### Core Modules: 1. **Voice Capture** - Continuous audio recording on mobile device 2. **Audio Stream** - Raw audio transmission to server via WebSocket 3. **Speech-to-Text** - Server-side transcription (stores audio for verification) 4. **Prompt Builder** - Constructs prompts with context and RAG results 5. **LLM Interface** - Manages connections to local/remote LLMs 6. **Output Processor** - Structures LLM responses for voice/display 7. **Response Handler** - Routes processed output to appropriate channels 8. **TTS Engine** - Server-side voice synthesis (returns audio stream) 9. **Audio Player** - Client-side playback of TTS audio 10. **Context Manager** - Maintains conversation state and history 11. **RAG Engine** - Retrieves relevant information from knowledge base 12. **Voice Command Parser** - Interprets control commands vs. queries #### Design Decision: Server-Side Audio Processing - **Audio flows**: Device → Server (text + audio) → Server (audio response) → Device - **Benefits**: Consistent processing, audio archival, model verification, lighter mobile app - **Trade-off**: Requires reliable connection, slight latency increase ### Features #### Commands (programmed) - start/stop a task - log data [[stool data]], hfat ##### Agentic (inferred and tracked) - make or recommend versioned changes of core project documents that are reviewable (reviewed nightly initially)