Skip to main content

GLM-4.6 Supports Reasoning and Interleaved Thinking

· 4 min read

Enabling Reasoning in Claude Code with GLM-4.6

Starting from version 4.5, GLM has supported Claude Code. I've been following its progress closely, and many users have reported that reasoning could not be enabled within Claude Code. Recently, thanks to sponsorship from Zhipu, I decided to investigate this issue in depth. According to the official documentation, the/chat/completions endpoint has reasoning enabled by default, but the model itself decides whether to think:

thinking.type enum<string> default:enabled

Whether to enable the chain of thought(When enabled, GLM-4.6, GLM-4.5 and others will automatically determine whether to think, while GLM-4.5V will think compulsorily), default: enabled

Available options: enabled, disabled

However, within Claude Code, its heavy system prompt interference disrupts GLM's internal reasoning judgment, causing the model to rarely think. Therefore, we need to explicitly guide the model to believe reasoning is required. Since claude-code-router functions as a proxy, the only feasible approach is modifying prompts or parameters.

Initially, I tried completely removing Claude Code's system prompt — and indeed, the model started reasoning — but that broke Claude Code's workflow. So instead, I used prompt injection to clearly instruct the model to think step by step.

// transformer.ts
import { UnifiedChatRequest } from "../types/llm";
import { Transformer } from "../types/transformer";

export class ForceReasoningTransformer implements Transformer {
name = "forcereasoning";

async transformRequestIn(
request: UnifiedChatRequest
): Promise<UnifiedChatRequest> {
const systemMessage = request.messages.find(
(item) => item.role === "system"
);
if (Array.isArray(systemMessage?.content)) {
systemMessage.content.push({
type: "text",
text: "You are an expert reasoning model.\nAlways think step by step before answering. Even if the problem seems simple, always write down your reasoning process explicitly.\nNever skip your chain of thought.\nUse the following output format:\n<reasoning_content>(Write your full detailed thinking here.)</reasoning_content>\n\nWrite your final conclusion here.",
});
}
const lastMessage = request.messages[request.messages.length - 1];
if (lastMessage.role === "user" && Array.isArray(lastMessage.content)) {
lastMessage.content.push({
type: "text",
text: "You are an expert reasoning model.\nAlways think step by step before answering. Even if the problem seems simple, always write down your reasoning process explicitly.\nNever skip your chain of thought.\nUse the following output format:\n<reasoning_content>(Write your full detailed thinking here.)</reasoning_content>\n\nWrite your final conclusion here.",
});
}
if (lastMessage.role === "tool") {
request.messages.push({
role: "user",
content: [
{
type: "text",
text: "You are an expert reasoning model.\nAlways think step by step before answering. Even if the problem seems simple, always write down your reasoning process explicitly.\nNever skip your chain of thought.\nUse the following output format:\n<reasoning_content>(Write your full detailed thinking here.)</reasoning_content>\n\nWrite your final conclusion here.",
},
],
});
}
return request;
}
}

Why use <reasoning_content> instead of the <think> tag? Two reasons:

  1. Using the <think> tag doesn't effectively trigger reasoning — likely because the model was trained on data where <think> had special behavior.

  2. If we use <think>, the reasoning output is split into a separate field, which directly relates to the chain-of-thought feedback problem discussed below.

Chain-of-Thought Feedback

Recently, Minimax released Minimax-m2, along with an article explaining interleaved thinking. While the idea isn't entirely new, it's a good opportunity to analyze it.

Why do we need to interleaved thinking? Minimax's article mentions that the Chat Completion API does not support passing reasoning content between requests. We know ChatGPT was the first to support reasoning, but OpenAI initially didn't expose the chain of thought to users. Therefore, the Chat Completion API didn't need to support it. Even the CoT field was first introduced by DeepSeek.

Do we really need explicit CoT fields? What happens if we don't have them? Will it affect reasoning? By inspecting sglang's source code, we can see that reasoning content is naturally emitted in messages with specific markers. If we don't split it out, the next-round conversation will naturally include it. Thus, the only reason we need interleaved thinking is because we separated the reasoning content from the normal messages.

With fewer than 40 lines of code above, I implemented a simple exploration of enabling reasoning and chain-of-thought feedback for GLM-4.5/4.6. (It's only simple because I haven't implemented parsing logic yet — you could easily modify the transformer to split reasoning output on response and merge it back on request, improving Claude Code's frontend display compatibility.)

If you have better ideas, feel free to reach out — I'd love to discuss further.

Maybe We Can Do More with the Router

· 5 min read

Since the release of claude-code-router, I've received a lot of user feedback, and quite a few issues are still open. Most of them are related to support for different providers and the lack of tool usage from the deepseek model.

Originally, I created this project for personal use, mainly to access claude code at a lower cost. So, multi-provider support wasn't part of the initial design. But during troubleshooting, I discovered that even though most providers claim to be compatible with the OpenAI-style /chat/completions interface, there are many subtle differences. For example:

  1. When Gemini's tool parameter type is string, the format field only supports date and date-time, and there's no tool call ID.

  2. OpenRouter requires cache_control for caching.

  3. The official DeepSeek API has a max_output of 8192, but Volcano Engine's limit is even higher.

Aside from these, smaller providers often have quirks in their parameter handling. So I decided to create a new project, musistudio/llms, to deal with these compatibility issues. It uses the OpenAI format as a base and introduces a generic Transformer interface for transforming both requests and responses.

Once a Transformer is implemented for each provider, it becomes possible to mix-and-match requests between them. For example, I implemented bidirectional conversion between Anthropic and OpenAI formats in AnthropicTransformer, which listens to the /v1/messages endpoint. Similarly, GeminiTransformer handles Gemini <-> OpenAI format conversions and listens to /v1beta/models/:modelAndAction.

When both requests and responses are transformed into a common format, they can interoperate seamlessly:

AnthropicRequest -> AnthropicTransformer -> OpenAIRequest -> GeminiTransformer -> GeminiRequest -> GeminiServer
GeminiResponse -> GeminiTransformer -> OpenAIResponse -> AnthropicTransformer -> AnthropicResponse

Using a middleware layer to smooth out differences may introduce some performance overhead, but the main goal here is to enable claude-code-router to support multiple providers.

As for the issue of DeepSeek's lackluster tool usage — I found that it stems from poor instruction adherence in long conversations. Initially, the model actively calls tools, but after several rounds, it starts responding with plain text instead. My first workaround was injecting a system prompt to remind the model to use tools proactively. But in long contexts, the model tends to forget this instruction.

After reading the DeepSeek documentation, I noticed it supports the tool_choice parameter, which can be set to "required" to force the model to use at least one tool. I tested this by enabling the parameter, and it significantly improved the model's tool usage. We can remove the setting when it's no longer necessary. With the help of the Transformer interface in musistudio/llms, we can modify the request before it's sent and adjust the response after it's received.

Inspired by the Plan Mode in claude code, I implemented a similar Tool Mode for DeepSeek:

export class TooluseTransformer implements Transformer {
name = "tooluse";

transformRequestIn(request: UnifiedChatRequest): UnifiedChatRequest {
if (request.tools?.length) {
request.messages.push({
role: "system",
content: `<system-reminder>Tool mode is active. The user expects you to proactively execute the most suitable tool to help complete the task.
Before invoking a tool, you must carefully evaluate whether it matches the current task. If no available tool is appropriate for the task, you MUST call the \`ExitTool\` to exit tool mode — this is the only valid way to terminate tool mode.
Always prioritize completing the user's task effectively and efficiently by using tools whenever appropriate.</system-reminder>`,
});
request.tool_choice = "required";
request.tools.unshift({
type: "function",
function: {
name: "ExitTool",
description: `Use this tool when you are in tool mode and have completed the task. This is the only valid way to exit tool mode.
IMPORTANT: Before using this tool, ensure that none of the available tools are applicable to the current task. You must evaluate all available options — only if no suitable tool can help you complete the task should you use ExitTool to terminate tool mode.
Examples:
1. Task: "Use a tool to summarize this document" — Do not use ExitTool if a summarization tool is available.
2. Task: "What's the weather today?" — If no tool is available to answer, use ExitTool after reasoning that none can fulfill the task.`,
parameters: {
type: "object",
properties: {
response: {
type: "string",
description:
"Your response will be forwarded to the user exactly as returned — the tool will not modify or post-process it in any way.",
},
},
required: ["response"],
},
},
});
}
return request;
}

async transformResponseOut(response: Response): Promise<Response> {
if (response.headers.get("Content-Type")?.includes("application/json")) {
const jsonResponse = await response.json();
if (
jsonResponse?.choices[0]?.message.tool_calls?.length &&
jsonResponse?.choices[0]?.message.tool_calls[0]?.function?.name ===
"ExitTool"
) {
const toolArguments = JSON.parse(toolCall.function.arguments || "{}");
jsonResponse.choices[0].message.content = toolArguments.response || "";
delete jsonResponse.choices[0].message.tool_calls;
}

// Handle non-streaming response if needed
return new Response(JSON.stringify(jsonResponse), {
status: response.status,
statusText: response.statusText,
headers: response.headers,
});
} else if (response.headers.get("Content-Type")?.includes("stream")) {
// ...
}
return response;
}
}

This transformer ensures the model calls at least one tool. If no tools are appropriate or the task is finished, it can exit using ExitTool. Since this relies on the tool_choice parameter, it only works with models that support it.

In practice, this approach noticeably improves tool usage for DeepSeek. The tradeoff is that sometimes the model may invoke irrelevant or unnecessary tools, which could increase latency and token usage.

This update is just a small experiment — adding an "agent" to the router. Maybe there are more interesting things we can explore from here.

Project Motivation and Principles

· 6 min read

As early as the day after Claude Code was released (2025-02-25), I began and completed a reverse engineering attempt of the project. At that time, using Claude Code required registering for an Anthropic account, applying for a waitlist, and waiting for approval. However, due to well-known reasons, Anthropic blocks users from mainland China, making it impossible for me to use the service through normal means. Based on known information, I discovered the following:

  1. Claude Code is installed via npm, so it's very likely developed with Node.js.
  2. Node.js offers various debugging methods: simple console.log usage, launching with --inspect to hook into Chrome DevTools, or even debugging obfuscated code using d8.

My goal was to use Claude Code without an Anthropic account. I didn't need the full source code—just a way to intercept and reroute requests made by Claude Code to Anthropic's models to my own custom endpoint. So I started the reverse engineering process:

  1. First, install Claude Code:
npm install -g @anthropic-ai/claude-code
  1. After installation, the project is located at ~/.nvm/versions/node/v20.10.0/lib/node_modules/@anthropic-ai/claude-code(this may vary depending on your Node version manager and version).

  2. Open the package.json to analyze the entry point:

{
"name": "@anthropic-ai/claude-code",
"version": "1.0.24",
"main": "sdk.mjs",
"types": "sdk.d.ts",
"bin": {
"claude": "cli.js"
},
"engines": {
"node": ">=18.0.0"
},
"type": "module",
"author": "Boris Cherny <boris@anthropic.com>",
"license": "SEE LICENSE IN README.md",
"description": "Use Claude, Anthropic's AI assistant, right from your terminal. Claude can understand your codebase, edit files, run terminal commands, and handle entire workflows for you.",
"homepage": "https://github.com/anthropics/claude-code",
"bugs": {
"url": "https://github.com/anthropics/claude-code/issues"
},
"scripts": {
"prepare": "node -e \"if (!process.env.AUTHORIZED) { console.error('ERROR: Direct publishing is not allowed.\\nPlease use the publish-external.sh script to publish this package.'); process.exit(1); }\"",
"preinstall": "node scripts/preinstall.js"
},
"dependencies": {},
"optionalDependencies": {
"@img/sharp-darwin-arm64": "^0.33.5",
"@img/sharp-darwin-x64": "^0.33.5",
"@img/sharp-linux-arm": "^0.33.5",
"@img/sharp-linux-arm64": "^0.33.5",
"@img/sharp-linux-x64": "^0.33.5",
"@img/sharp-win32-x64": "^0.33.5"
}
}

The key entry is "claude": "cli.js". Opening cli.js, you'll see the code is minified and obfuscated. But using WebStorm's Format File feature, you can reformat it for better readability: webstorm-formate-file

Now you can begin understanding Claude Code's internal logic and prompt structure by reading the code. To dig deeper, you can insert console.log statements or launch in debug mode with Chrome DevTools using:

NODE_OPTIONS="--inspect-brk=9229" claude

This command starts Claude Code in debug mode and opens port 9229. Visit chrome://inspect/ in Chrome and click inspect to begin debugging: chrome-devtools chrome-devtools

By searching for the keyword api.anthropic.com, you can easily locate where Claude Code makes its API calls. From the surrounding code, it's clear that baseURL can be overridden with the ANTHROPIC_BASE_URL environment variable, and apiKey and authToken can be configured similarly: search

So far, we've discovered some key information:

  1. Environment variables can override Claude Code's baseURL and apiKey.

  2. Claude Code adheres to the Anthropic API specification.

Therefore, we need:

  1. A service to convert OpenAI API-compatible requests into Anthropic API format.

  2. Set the environment variables before launching Claude Code to redirect requests to this service.

Thus, claude-code-router was born. This project uses Express.js to implement the /v1/messages endpoint. It leverages middlewares to transform request/response formats and supports request rewriting (useful for prompt tuning per model).

Back in February, the full DeepSeek model series had poor support for Function Calling, so I initially used qwen-max. It worked well—but without KV cache support, it consumed a large number of tokens and couldn't provide the native Claude Code experience.

So I experimented with a Router-based mode using a lightweight model to dispatch tasks. The architecture included four roles: router, tool, think, and coder. Each request passed through a free lightweight model that would decide whether the task involved reasoning, coding, or tool usage. Reasoning and coding tasks looped until a tool was invoked to apply changes. However, the lightweight model lacked the capability to route tasks accurately, and architectural issues prevented it from effectively driving Claude Code.

Everything changed at the end of May when the official Claude Code was launched, and DeepSeek-R1 model (released 2025-05-28) added Function Call support. I redesigned the system. With the help of AI pair programming, I fixed earlier request/response transformation issues—especially the handling of models that return JSON instead of Function Call outputs.

This time, I used the DeepSeek-V3 model. It performed better than expected: supporting most tool calls, handling task decomposition and stepwise planning, and—most importantly—costing less than one-tenth the price of Claude 3.5 Sonnet.

The official Claude Code organizes agents differently from the beta version, so I restructured my Router mode to include four roles: the default model, background, think, and longContext.

  • The default model handles general tasks and acts as a fallback.

  • The background model manages lightweight background tasks. According to Anthropic, Claude Haiku 3.5 is often used here, so I routed this to a local ollama service.

  • The think model is responsible for reasoning and planning mode tasks. I use DeepSeek-R1 here, though it doesn't support cost control, so Think and UltraThink behave identically.

  • The longContext model handles long-context scenarios. The router uses tiktoken to calculate token lengths in real time, and if the context exceeds 32K, it switches to this model to compensate for DeepSeek's long-context limitations.

This describes the evolution and reasoning behind the project. By cleverly overriding environment variables, we can forward and modify requests without altering Claude Code's source—allowing us to benefit from official updates while using our own models and custom prompts.

This project offers a practical approach to running Claude Code under Anthropic's regional restrictions, balancing cost, performance, and customizability. That said, the official Max Plan still offers the best experience if available.