Fine-Tuning Research: Local Model Expansion

Context

TARX’s local model runs on port 11435. As TARX expands local model use cases — agentic tool calling, multi-step reasoning, Workbench chat — fine-tuning becomes important for reliability and identity.

Open Questions to Research

1. Tool calling reliability

Out-of-box function calling accuracy on complex multi-step tasks vs fine-tuned versions. At what tool complexity does the base model fail?

Benchmark: tarx_recall → tarx_remember → tarx_save_conversation chain.

2. TARX identity consistency

Current fix: system prompt injection. Fine-tune v2 (330 examples planned) — what training data format works best for identity/persona fine-tuning vs instruction fine-tuning?

3. Context window use

The local model has a 32K context window. With tool definitions (~2K tokens) + Space context

conversation history, effective user budget is ~25K. Does fine-tuning help the model better prioritize context, or is this a prompting problem?

4. Quantization trade-offs

Current: Q4_K_M quantization (~5GB, 18 tok/s on Apple Silicon).

Fine-tuning requires full precision or QLoRA. Post-fine-tune quantization path: full → QLoRA adapter → merge → re-quantize to Q4_K_M.

Test: does re-quantized fine-tuned model outperform the base Q4_K_M?

5. Training data sources

tarx_save_conversation outputs (real session summaries)
Tool call logs from Workbench (once live)
Existing 330 identity examples
Synthetic tool-use trajectories for agentic scenarios

Recommended Next Step

Before writing any training data: instrument the Workbench chat route to log every tool_call and tool_result to SQLite. 2-4 weeks of real usage data is more valuable than synthetic data. Let the product generate its own training set.

Timeline

Phase	Timing	Action
Ship	Now	Workbench + tool calling (WS1)
Log	Weeks 2-4	Collect real tool call logs
Eval	Month 2	Evaluate base model tool accuracy on logged scenarios
Tune	Month 2+	If accuracy <85% on multi-step chains, begin fine-tune v2