How a model file you downloaded can execute code on your AI server before you run a single prompt
A model file is supposed to be data. Weights, parameters, configuration. You download it, you load it, your AI system starts serving requests. Nothing about that workflow sounds dangerous. But CVE-2026-5760, disclosed on April 20, 2026, reveals that a GGUF model file can contain a hidden instruction set that executes arbitrary operating system commands on your server the moment any API request arrives. The attacker does not need your password. They do not need to touch your network. They just need you to download their file.
The mechanism is a template engine called Jinja2, embedded inside SGLang's reranking service. SGLang reads a metadata field from the model file and passes it directly to Jinja2 for rendering. Jinja2 is a full programming environment. Without a sandbox, it can traverse Python's object hierarchy and call any system command. The attacker puts their payload inside the model file's metadata. You load the model. The trap is set. The next API call pulls the trigger.
What makes this alarming beyond the immediate vulnerability is the pattern it represents. The same root cause, an unsandboxed Jinja2 renderer processing attacker-controlled model metadata, has now been independently discovered in three major AI inference frameworks: llama-cpp-python in 2024, vLLM in 2025, and SGLang in 2026. This is not a one-off mistake. The AI ecosystem has built a supply chain where model files are treated as inert data, but they contain fields powerful enough to run code. As of the disclosure date, SGLang's maintainers had not responded to CERT/CC's coordination attempts and no patch exists.
Narrative · 6 min read
The Context
SGLang is an open-source framework for running large language models (LLMs) in production. Think of it as the engine that sits between your application and the AI model, handling the high-volume, high-speed serving of requests. It is one of the most widely deployed tools of its kind: over 26,100 GitHub stars, more than 5,500 forks, deployed on over 400,000 GPUs worldwide, and used in production by xAI, LinkedIn, NVIDIA, AMD, Intel, Oracle Cloud, Google Cloud, Microsoft Azure, and AWS, among others. It generates trillions of tokens daily.
The specific feature at risk is SGLang's reranking endpoint, /v1/rerank. Reranking is a technique used in AI search and retrieval systems: you retrieve a set of candidate results, then run a second model to re-score and reorder them by relevance. It is a common pattern in enterprise AI applications. The model format involved is GGUF, a binary file format widely used to distribute and share AI models, particularly on public repositories like Hugging Face.
The Attack, Phase by Phase
Phase 1: Weaponizing the Model File
The attacker starts with a standard GGUF model file and modifies one metadata field: tokenizer.chat_template. This field is meant to store a formatting template, a pattern for how conversations should be structured before being sent to the model. The attacker replaces that innocent template with two things: a Jinja2 payload designed to execute operating system commands, and a specific trigger phrase required to activate the vulnerable code path.
The trigger phrase is: "The answer can only be 'yes' or 'no'." SGLang uses this phrase internally to detect when a Qwen3-style reranker model is loaded, which routes the request through the vulnerable rendering function. Without this phrase, the payload sits dormant. With it, the trap is armed.
The attacker uploads this file to a public model repository like Hugging Face, where it looks identical to any other GGUF model file.
Phase 2: Delivery via Trusted Supply Chain
A system administrator or an automated deployment pipeline downloads the poisoned model. This is the normal workflow for loading a new model into SGLang. There is no prompt, no warning, no verification step. The model loads. The malicious tokenizer.chat_template is now resident in the running SGLang process.
The attacker has no further interaction with the target. They do not need credentials. They do not need network access. The payload is already inside.
Phase 3: Trigger and Code Execution
When any API request reaches the /v1/rerank endpoint, SGLang reads the tokenizer.chat_template from the loaded model and passes it to the Jinja2 rendering function in serving_rerank.py. The trigger phrase activates the Qwen3 detection logic. The SSTI payload executes.
The payload uses a known Jinja2 escape technique to break out of the template context and call Python's operating system interface directly. Because SGLang's Jinja2 environment has no sandbox, there is nothing to stop it. The payload runs with the full privileges of the SGLang service process.
Phase 4: Post-Exploitation
With arbitrary code execution on the server, the attacker's options are broad. They can exfiltrate sensitive data processed by the inference server, including model inputs and outputs that may contain proprietary or personal information. They can install persistent backdoors that survive restarts. They can move laterally to other systems on the same network. Or they can destroy the service entirely.
SGLang deployments typically run with elevated privileges and handle sensitive workloads. The blast radius of a successful exploit reflects that.
What Made This Possible
-
Model files were never treated as untrusted input. The GGUF format includes metadata fields that are rendered by a full programming environment. No verification, sanitization, or sandboxing was applied to that metadata before rendering. The assumption was that model files are data, not code.
-
The safe alternative was available and not used. Jinja2 ships with a class called
ImmutableSandboxedEnvironmentspecifically designed to prevent template payloads from accessing dangerous Python functions. SGLang's_get_jinja_env()function used the standard, unrestrictedjinja2.Environment()instead. One import, one class name, different outcome. -
This mistake has been made before, repeatedly. The identical root cause was discovered in
llama-cpp-pythonin May 2024 (CVE-2024-34359, CVSS 9.7) and in vLLM in October 2025 (CVE-2025-61620). Three independent teams building AI inference infrastructure made the same insecure default choice. That is a pattern, not a coincidence.
The AI ecosystem built a supply chain where downloading a model file is treated with less scrutiny than downloading a software package, even though model files can now execute code.
What Should Have Stopped This
No single defense here depends on the model file being trustworthy. That shared trait is the unifying principle: every effective control assumes the artifact is hostile and limits what it can do regardless.
- Sandboxed template rendering. Replacing
jinja2.Environment()withImmutableSandboxedEnvironmentin_get_jinja_env()would have blocked the payload from accessing Python's OS interface. This is the fix CERT/CC recommended. It requires changing one line of code. - Cryptographic model verification. Verifying a model file's cryptographic signature before loading it would prevent a tampered or malicious file from reaching the renderer. Hugging Face supports model card hashes; automated pipelines should enforce them.
- Network isolation for inference servers. If the
/v1/rerankendpoint is not reachable from untrusted networks, the attack's trigger cannot be pulled by an external caller. Inference servers should not be internet-facing without explicit justification. - Least-privilege service accounts. Running the SGLang process with the minimum permissions needed limits what an attacker can do after code execution. If the process cannot write to sensitive directories or reach other network segments, the post-exploitation options shrink.
The Takeaway
The AI model supply chain has the same trust problem that the software supply chain had a decade ago, before the industry learned to treat third-party packages as potential attack vectors. Model files downloaded from public repositories are not inert. They contain metadata fields rendered by powerful template engines, and those engines can execute code if they are not sandboxed. The practice of downloading GGUF files without verification has become an unauthenticated remote code execution vector at industrial scale.
This is the same class of failure as the Axios supply chain attack: a trusted artifact from a public repository carries a payload that executes inside the victim's environment. The delivery mechanism is different (a model file instead of an npm package), but the trust collapse is identical. The ecosystem assumed the artifact was safe because it came from a familiar place.
Pattern to remember: Any system that renders attacker-controlled content through a Turing-complete engine without a sandbox is a remote code execution vulnerability waiting to be discovered.
What changed: Model files have joined executable code as a delivery mechanism for remote compromise, meaning the act of downloading a model now carries the same risk profile as running an installer from an unverified source.
Technical Deep Dive · 3 min
The Technical Mechanism
The vulnerability is located in python/sglang/srt/entrypoints/openai/serving_rerank.py, lines 128-132. The function _get_jinja_env() instantiates a jinja2.Environment() with autoescape=False and no sandbox class specified. This returns a fully permissive Jinja2 environment with access to Python's global object hierarchy.
When a request arrives at /v1/rerank, SGLang's _render_jinja_chat_template() function reads the tokenizer.chat_template field from the loaded model and passes it to this environment for rendering. The Qwen3 reranker detection logic is triggered by the presence of the string 'The answer can only be "yes" or "no"' in the template, which routes execution through the vulnerable rendering path.
The SSTI payload uses the lipsum.__globals__["os"].popen() escape chain, a well-documented Jinja2 sandbox bypass technique that traverses Python's object graph from a globally accessible Jinja2 variable (lipsum) to the os module, then calls popen() to execute arbitrary shell commands. Because no sandbox is present, no attribute access restrictions block this traversal.
The fix is a one-line change: replace jinja2.Environment() with jinja2.sandbox.ImmutableSandboxedEnvironment(), which blocks access to dangerous attributes and prevents the object graph traversal that makes the exploit possible.
The vulnerability is classified under:
CWE-1336: Improper Neutralization of Special Elements Used in a Template EngineCWE-94: Improper Control of Generation of Code (Code Injection)
CVE and Advisories
- CVE-2026-5760: CVSS 9.8 (AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H). Affects SGLang
v0.5.9. No patch as of April 20, 2026. - CERT/CC Advisory VU#915947: Published April 20, 2026. Authored by Christopher Cullen. No vendor response received during coordination.
- Related: CVE-2024-34359 ("Llama Drama"), CVSS 9.7, same vulnerability class in
llama-cpp-python. Disclosed May 2024. - Related: CVE-2025-61620, CVSS 6.5, Jinja2 template abuse in vLLM's OpenAI-compatible server. Patched in vLLM
0.11.0, October 2025.
MITRE ATT&CK Mapping
| Technique ID | ATT&CK name | How it appeared |
|---|---|---|
| T1195.001 | Supply Chain Compromise: Compromise Software Dependencies and Development Tools | Attacker distributes a malicious GGUF model file through a public repository, compromising the victim's environment at the model-loading stage. |
| T1059.006 | Command and Scripting Interpreter: Python | The SSTI payload uses Python's object hierarchy via Jinja2 to invoke os.popen() and execute arbitrary shell commands. |
| T1190 | Exploit Public-Facing Application | The /v1/rerank endpoint is the trigger surface. Any request to this endpoint while the malicious model is loaded causes code execution. |
| T1548 | Abuse Elevation Control Mechanism | Code executes with the full privileges of the SGLang service process, which in many deployments runs with elevated or root-equivalent permissions. |
| T1041 | Exfiltration Over C2 Channel | Post-exploitation, the attacker can exfiltrate inference data, model inputs and outputs, or host credentials via the established code execution channel. |
Indicators of Compromise
Detection is difficult because the malicious content is embedded in a binary model file and the exploit is triggered by normal API traffic. No network-level indicators distinguish a malicious /v1/rerank request from a legitimate one.
Potential detection approaches:
Model File Integrity
Compare SHA-256 hashes of loaded GGUF files against known-good values from the original source. Any mismatch warrants investigation.
Template Content Inspection
Audit the tokenizer.chat_template field of all loaded GGUF models for Jinja2 expressions containing __globals__, __class__, __mro__, popen, subprocess, or os.system.
Process Behavior
Monitor the SGLang service process for unexpected child process spawning, outbound network connections to unknown hosts, or file writes outside expected directories.
Log Anomalies
Unexpected errors or unusual output from the /v1/rerank endpoint may indicate payload execution, though a well-crafted payload may suppress visible errors.
A public proof-of-concept exploit is available at github.com/Stuub/SGLang-0.5.9-RCE, which accepts arbitrary shell commands and generates a ready-to-deploy malicious GGUF file. The existence of this proof-of-concept significantly lowers the barrier to exploitation.
Attribution
This is a vulnerability disclosure, not a tracked intrusion campaign. The vulnerability was discovered and responsibly reported to CERT/CC by security researcher Stuart Beck (GitHub: Stuub), who also authored the public proof-of-concept. CERT/CC coordinated the disclosure under sponsorship from CISA. No threat actor group, nation-state, or criminal organization has been attributed to active exploitation of this vulnerability. No evidence of in-the-wild exploitation has been reported as of the disclosure date.
Primary Sources
- 01.VU#915947 - SGLang is vulnerable to remote code execution when rendering chat templates from a model file
CERT Coordination Center (CERT/CC) · April 20, 2026
- 02.SGLang-0.5.9-RCE: Proof of Concept exploitation of CVE-2026-5760
Stuart Beck (Stuub), independent security researcher · April 20, 2026
- 03.SGLang CVE-2026-5760 (CVSS 9.8) Enables RCE via Malicious GGUF Model Files
The Hacker News · April 20, 2026
- 04.Malicious GGUF Models Could Trigger Remote Code Execution on SGLang Servers
GBHackers · April 20, 2026
- 05.Hackers Could Weaponize GGUF Models to Achieve RCE on SGLang Inference Servers
CyberPress · April 20, 2026
- 06.SGLang GitHub Repository (sgl-project/sglang)
LMSYS / SGLang Project · April 2026