API Documentation

Node SDK Python SDK Download Skill

Summarization controls and endpoints for compressing long-running conversations into durable context artifacts with lower token overhead.

Summarization

Long conversations can exceed context window limits and increase costs. Mnexium's Summarization feature automatically compresses older messages into concise summaries while preserving recent messages verbatim.

When enabled, Mnexium generates rolling summaries of your conversation history. Summaries are cached and reused across requests, so you only pay for summarization once per conversation segment.

Use the summarize parameter in your mnx object to enable automatic summarization. Choose a preset mode based on your cost/fidelity tradeoff:

Mode	Start At	Keep Recent	Summary Target	Best For
off	—	All	—	Maximum fidelity (default)
light	70K tokens	25 msgs	~1,800 tokens	Safe compression
balanced	55K tokens	15 msgs	~1,100 tokens	Best cost/performance
aggressive	35K tokens	8 msgs	~700 tokens	Cheapest possible

Using a preset mode

json

{
  "model": "gpt-4o-mini",
  "messages": [{ "role": "user", "content": "..." }],
  "mnx": {
    "subject_id": "user_123",
    "chat_id": "550e8400-e29b-41d4-a716-446655440000",
    "summarize": "balanced"
  }
}

Using custom config

json

{
  "model": "gpt-4o-mini",
  "messages": [{ "role": "user", "content": "..." }],
  "mnx": {
    "subject_id": "user_123",
    "chat_id": "550e8400-e29b-41d4-a716-446655440000",
    "summarize_config": {
      "start_at_tokens": 40000,
      "chunk_size": 15000,
      "keep_recent_messages": 10,
      "summary_target": 800
    }
  }
}

start_at_tokens— Token threshold to trigger summarization. History below this is sent verbatim.

chunk_size— How many tokens to summarize at a time when history exceeds threshold.

keep_recent_messages— Always keep this many recent messages verbatim (not summarized).

summary_target— Target token count for each generated summary.

When a chat request comes in, Mnexium counts tokens in the conversation history using tiktoken.
If history exceeds start_at_tokens, older messages are summarized.
The summary is generated using gpt-4o-mini and cached in the database.
Future requests reuse the cached summary until new messages push past the threshold again.
The final context sent to the LLM is: [Summary] + [Recent Messages] + [New Message]

Mnexium uses a rolling summary by default: we maintain a single condensed memory block for older messages and inject that plus the most recent turns into the model.

This is the most token-efficient strategy and is recommended for almost all workloads.

For specialized use cases that need more detailed historical context inside the prompt (at higher token cost), granular summaries can be enabled in a future release, which keep multiple smaller summary blocks instead of one.