- **User speaks for 30 seconds to 2 minutes** per semantic segment
- Each user segment contains approximately **300 words** (based on 180 words/minute speech rate)
- LLM responds with approximately **100 words** per segment (30 seconds at TTS rate)
- **Cognitive continuity**: AI maintains access to all previous thoughts and reactions throughout conversation
- Creates a continuous chain of thought rather than isolated responses
#### Key Assumptions
- **Speech rate**: 180-200 words per minute (faster than average)
- **Token-to-word ratio**: 1 token ≈ 0.75 words
- **RAG context per query**: 1,000 words of retrieved information
- **Segment pair size**: 400 words (300 user + 100 LLM)
- **TTS playback**: 200 words per minute (30 seconds for LLM response)
- Context is pulled fresh for each segment, not accumulated in history
#### Desired System Requirements
- **Target capacity**: At least 20 semantic segments
- **Required context**: 20 segments × 400 words + 1,000 RAG = **9,000 words minimum**
- **Needed context window**: ~12,000 tokens (with buffer)
- **Conversation length**: 10-40 minutes of continuous dialogue with full cognitive continuity
#### Suitable Open Source Models
- **Mixtral 8x7B**: 32,768 tokens, good performance/RAM ratio
- **LLaMA 3.1 70B**: 128,000 tokens, but requires 40-45GB RAM (Q4)
- **Yi 34B**: 200,000 tokens, longest context available
- **Qwen 2.5 models**: 32,768 tokens, efficient inference