Fine-Tuning Research: Local Model Expansion
Context
TARX’s local model runs on port 11435. As TARX expands local model use cases — agentic tool calling, multi-step reasoning, Workbench chat — fine-tuning becomes important for reliability and identity.
Open Questions to Research
1. Tool calling reliability
Out-of-box function calling accuracy on complex multi-step tasks vs fine-tuned versions. At what tool complexity does the base model fail?
Benchmark: tarx_recall → tarx_remember → tarx_save_conversation chain.
2. TARX identity consistency
Current fix: system prompt injection. Fine-tune v2 (330 examples planned) — what training data format works best for identity/persona fine-tuning vs instruction fine-tuning?
3. Context window use
The local model has a 32K context window. With tool definitions (~2K tokens) + Space context
- conversation history, effective user budget is ~25K. Does fine-tuning help the model better prioritize context, or is this a prompting problem?
4. Quantization trade-offs
Current: Q4_K_M quantization (~5GB, 18 tok/s on Apple Silicon).
Fine-tuning requires full precision or QLoRA. Post-fine-tune quantization path: full → QLoRA adapter → merge → re-quantize to Q4_K_M.
Test: does re-quantized fine-tuned model outperform the base Q4_K_M?
5. Training data sources
tarx_save_conversationoutputs (real session summaries)- Tool call logs from Workbench (once live)
- Existing 330 identity examples
- Synthetic tool-use trajectories for agentic scenarios
Recommended Next Step
Before writing any training data: instrument the Workbench chat route to
log every tool_call and tool_result to SQLite. 2-4 weeks of real usage
data is more valuable than synthetic data. Let the product generate its
own training set.
Timeline
| Phase | Timing | Action |
|---|---|---|
| Ship | Now | Workbench + tool calling (WS1) |
| Log | Weeks 2-4 | Collect real tool call logs |
| Eval | Month 2 | Evaluate base model tool accuracy on logged scenarios |
| Tune | Month 2+ | If accuracy <85% on multi-step chains, begin fine-tune v2 |