All guides
MASTER CLASS · HERMES · INTERMEDIATE

Hermes 4: the open-source agent brain.

NousResearch's Hermes models are tuned specifically for tool use and agent loops. Run them locally or via API. Free from rate limits, free from per-token billing once you own the GPU time.

▸ When you're done

Hermes 4 running on your Mac (via Ollama) or on a cloud GPU (via Together/Replicate). A working agent loop with tool calls, identical interface to OpenAI/Anthropic — drop-in replacement.

16 min walkthrough
2 tools · all free tier
Copy-paste ready · no theory
The stack
◢ The build · 5 steps · 16 min

Follow these in order. Don't skip.

Step 01 / 05

Pick your runtime — local vs hosted

  • LOCAL via Ollama — runs on M-series Mac (32GB+ RAM recommended for 8B model, 64GB+ for 70B). Zero API cost. Good for privacy.
  • HOSTED via Together AI — pay per token, no GPU needed. Best price/perf for agents that don't need to be local.
  • HOSTED via Replicate — same as Together, slightly higher latency, easier dashboards.
  • DON'T run Hermes 405B locally unless you have an H100. Use the API.
Step 02 / 05

Run Hermes locally with Ollama

Terminal
1# Install Ollama
2brew install ollama
3ollama serve &
4
5# Pull Hermes 4 — pick the size that fits your RAM
6ollama pull nous-hermes-2:34b # 64GB+ RAM
7ollama pull nous-hermes-2:10.7b # 32GB RAM
8ollama pull nous-hermes-2:8b # 16GB RAM
9
10# Quick smoke test
11ollama run nous-hermes-2:8b "What is the capital of France?"
Step 03 / 05

Use Hermes via the OpenAI-compatible endpoint

Ollama exposes /v1/chat/completions in OpenAI format. Any code that uses the OpenAI SDK works against Hermes with one URL change.

agent/hermes_local.py
1from openai import OpenAI
2
3# Point the OpenAI SDK at your local Ollama
4client = OpenAI(
5 base_url="http://localhost:11434/v1",
6 api_key="ollama", # Ollama doesn't check this, but the SDK requires a string
7)
8
9resp = client.chat.completions.create(
10 model="nous-hermes-2:8b",
11 messages=[
12 {"role": "system", "content": "You are an agent that uses tools. Always think step by step."},
13 {"role": "user", "content": "What's 17 * 23?"},
14 ],
15 tools=[{
16 "type": "function",
17 "function": {
18 "name": "calculator",
19 "description": "Run an arithmetic expression and return the result",
20 "parameters": {
21 "type": "object",
22 "properties": {"expression": {"type": "string"}},
23 "required": ["expression"],
24 },
25 },
26 }],
27)
28
29print(resp.choices[0].message)
Step 04 / 05

Call hosted Hermes (Together AI) instead

agent/hermes_hosted.py
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://api.together.xyz/v1",
5 api_key="<your-together-api-key>",
6)
7
8resp = client.chat.completions.create(
9 model="NousResearch/Hermes-3-Llama-3.1-70B",
10 messages=[
11 {"role": "system", "content": "You are a tool-using research agent."},
12 {"role": "user", "content": "Search the web for the price of ETH and tell me."},
13 ],
14)
15print(resp.choices[0].message.content)
Step 05 / 05

When Hermes wins, when it loses

  • WINS: tool-call reliability is excellent — Hermes was fine-tuned specifically for agent loops
  • WINS: cost — 70B at Together AI is ~$0.88/M tokens vs Sonnet at $3/M input. 3-5× cheaper.
  • WINS: privacy — local runs never leave your machine. Compliance-friendly.
  • LOSES: long-context reasoning — Sonnet/GPT-5 still outperform on 50k+ token tasks.
  • LOSES: code review nuance — Claude wins for senior-engineer-level feedback.
  • Use Hermes for: routing, classification, tool-calling agents, RAG synthesis. Use Claude/GPT-5 for: hard reasoning, code review, long-doc analysis.
Ship-it checklist
5 CHECKS
  • Ollama installed and serving
  • At least one Hermes model pulled (size matched to your RAM)
  • OpenAI-SDK code calling Hermes locally — same SDK works for both
  • Optional: a Together AI key for the 70B+ models you can't run locally
  • You compared one task's output across Hermes vs Claude vs GPT-5 — you know what each is good at