Skip to Content
Fine-Tuning Research

Fine-Tuning Research: Local Model Expansion

Context

TARX’s local model runs on port 11435. As TARX expands local model use cases — agentic tool calling, multi-step reasoning, Workbench chat — fine-tuning becomes important for reliability and identity.

Open Questions to Research

1. Tool calling reliability

Out-of-box function calling accuracy on complex multi-step tasks vs fine-tuned versions. At what tool complexity does the base model fail?

Benchmark: tarx_recalltarx_remembertarx_save_conversation chain.

2. TARX identity consistency

Current fix: system prompt injection. Fine-tune v2 (330 examples planned) — what training data format works best for identity/persona fine-tuning vs instruction fine-tuning?

3. Context window use

The local model has a 32K context window. With tool definitions (~2K tokens) + Space context

  • conversation history, effective user budget is ~25K. Does fine-tuning help the model better prioritize context, or is this a prompting problem?

4. Quantization trade-offs

Current: Q4_K_M quantization (~5GB, 18 tok/s on Apple Silicon).

Fine-tuning requires full precision or QLoRA. Post-fine-tune quantization path: full → QLoRA adapter → merge → re-quantize to Q4_K_M.

Test: does re-quantized fine-tuned model outperform the base Q4_K_M?

5. Training data sources

  • tarx_save_conversation outputs (real session summaries)
  • Tool call logs from Workbench (once live)
  • Existing 330 identity examples
  • Synthetic tool-use trajectories for agentic scenarios

Before writing any training data: instrument the Workbench chat route to log every tool_call and tool_result to SQLite. 2-4 weeks of real usage data is more valuable than synthetic data. Let the product generate its own training set.

Timeline

PhaseTimingAction
ShipNowWorkbench + tool calling (WS1)
LogWeeks 2-4Collect real tool call logs
EvalMonth 2Evaluate base model tool accuracy on logged scenarios
TuneMonth 2+If accuracy <85% on multi-step chains, begin fine-tune v2
Last updated on