API Documentation
Summarization controls and endpoints for compressing long-running conversations into durable context artifacts with lower token overhead.
Long conversations can exceed context window limits and increase costs. Mnexium's Summarization feature automatically compresses older messages into concise summaries while preserving recent messages verbatim.
When enabled, Mnexium generates rolling summaries of your conversation history. Summaries are cached and reused across requests, so you only pay for summarization once per conversation segment.
Use the summarize parameter in your mnx object to enable automatic summarization. Choose a preset mode based on your cost/fidelity tradeoff:
| Mode | Start At | Keep Recent | Summary Target | Best For |
|---|---|---|---|---|
| off | — | All | — | Maximum fidelity (default) |
| light | 70K tokens | 25 msgs | ~1,800 tokens | Safe compression |
| balanced | 55K tokens | 15 msgs | ~1,100 tokens | Best cost/performance |
| aggressive | 35K tokens | 8 msgs | ~700 tokens | Cheapest possible |
{
"model": "gpt-4o-mini",
"messages": [{ "role": "user", "content": "..." }],
"mnx": {
"subject_id": "user_123",
"chat_id": "550e8400-e29b-41d4-a716-446655440000",
"summarize": "balanced"
}
}{
"model": "gpt-4o-mini",
"messages": [{ "role": "user", "content": "..." }],
"mnx": {
"subject_id": "user_123",
"chat_id": "550e8400-e29b-41d4-a716-446655440000",
"summarize_config": {
"start_at_tokens": 40000,
"chunk_size": 15000,
"keep_recent_messages": 10,
"summary_target": 800
}
}
}start_at_tokens— Token threshold to trigger summarization. History below this is sent verbatim.chunk_size— How many tokens to summarize at a time when history exceeds threshold.keep_recent_messages— Always keep this many recent messages verbatim (not summarized).summary_target— Target token count for each generated summary.- When a chat request comes in, Mnexium counts tokens in the conversation history using tiktoken.
- If history exceeds
start_at_tokens, older messages are summarized. - The summary is generated using
gpt-4o-miniand cached in the database. - Future requests reuse the cached summary until new messages push past the threshold again.
- The final context sent to the LLM is:
[Summary] + [Recent Messages] + [New Message]
Mnexium uses a rolling summary by default: we maintain a single condensed memory block for older messages and inject that plus the most recent turns into the model.
This is the most token-efficient strategy and is recommended for almost all workloads.
For specialized use cases that need more detailed historical context inside the prompt (at higher token cost), granular summaries can be enabled in a future release, which keep multiple smaller summary blocks instead of one.