- **User speaks for 30 seconds to 2 minutes** per semantic segment - Each user segment contains approximately **300 words** (based on 180 words/minute speech rate) - LLM responds with approximately **100 words** per segment (30 seconds at TTS rate) - **Cognitive continuity**: AI maintains access to all previous thoughts and reactions throughout conversation - Creates a continuous chain of thought rather than isolated responses #### Key Assumptions - **Speech rate**: 180-200 words per minute (faster than average) - **Token-to-word ratio**: 1 token ≈ 0.75 words - **RAG context per query**: 1,000 words of retrieved information - **Segment pair size**: 400 words (300 user + 100 LLM) - **TTS playback**: 200 words per minute (30 seconds for LLM response) - Context is pulled fresh for each segment, not accumulated in history #### Desired System Requirements - **Target capacity**: At least 20 semantic segments - **Required context**: 20 segments × 400 words + 1,000 RAG = **9,000 words minimum** - **Needed context window**: ~12,000 tokens (with buffer) - **Conversation length**: 10-40 minutes of continuous dialogue with full cognitive continuity #### Suitable Open Source Models - **Mixtral 8x7B**: 32,768 tokens, good performance/RAM ratio - **LLaMA 3.1 70B**: 128,000 tokens, but requires 40-45GB RAM (Q4) - **Yi 34B**: 200,000 tokens, longest context available - **Qwen 2.5 models**: 32,768 tokens, efficient inference