Bleeding Llama: How Ollama's Critical Memory Leak Undermines Healthcare AI Privacy

AI Security Series #39

On May 5, 2026, cybersecurity researchers at Cyera disclosed a critical vulnerability in Ollama that allows unauthenticated attackers to leak the entire process memory of exposed servers. The flaw, tracked as CVE-2026-7482 and dubbed "Bleeding Llama," affects approximately 300,000 publicly accessible Ollama instances worldwide and requires no credentials to exploit—just three API calls. For healthcare organizations that adopted Ollama specifically to avoid sending protected health information to cloud-based AI services, this represents a catastrophic failure of the privacy model that justified local deployment in the first place.

What Ollama Is and Why Healthcare Adopted It

Ollama is an open-source framework that enables organizations to run large language models locally on their own hardware instead of relying on cloud services like OpenAI, Anthropic, or Google. With over 170,000 GitHub stars, 100 million Docker Hub downloads, and widespread enterprise adoption, Ollama has become the de facto standard for local AI inference. The value proposition is straightforward: keep sensitive data on-premise, avoid per-token API costs, maintain full control over model versions, and eliminate dependency on external providers.

Healthcare organizations embraced Ollama for data sovereignty. Running models locally means patient data never leaves the organization's network, avoiding the compliance complexity of Business Associate Agreements, data residency requirements, and third-party access to protected health information. A hospital using Ollama for clinical documentation, discharge summary generation, or diagnostic coding assistance can process patient records without transmitting PHI to cloud APIs. This aligned perfectly with HIPAA Security Rule requirements and CMS data protection expectations.

The irony is brutal: the tool adopted to prevent cloud data leakage is now leaking local data to anyone on the internet.

The Technical Mechanics of Bleeding Llama

The vulnerability exists in Ollama's GGUF model loader, the component responsible for processing model files in GPT-Generated Unified Format. GGUF files package model weights, metadata, and tokenizer information for local inference. Ollama accepts GGUF files through its /api/create endpoint, which allows users to build custom models by uploading files or pulling them from the Ollama registry.

The flaw is a heap out-of-bounds read triggered during model quantization. When Ollama processes a GGUF file, it reads tensor metadata that declares the tensor's shape (dimensions), offset (where the tensor data starts in the file), and size (how much data to read). The vulnerability stems from insufficient validation: Ollama does not verify that the declared tensor dimensions match the actual file size. An attacker can craft a GGUF file with an inflated tensor shape—claiming, for example, that a tensor contains millions of elements when the file actually contains only a few kilobytes.

During quantization, Ollama uses Go's unsafe package to perform low-level memory operations that bypass the language's memory safety guarantees. The WriteTo() function in fs/ggml/gguf.go and server/quantization.go reads tensor data based on the attacker-supplied dimensions, not the actual file boundaries. This triggers an out-of-bounds heap read that captures whatever happens to be in adjacent memory—environment variables, API keys, user prompts, system prompts, and conversation data from concurrent users.

The leaked memory is not discarded. Instead, it gets preserved in the newly created model artifact. The attacker then uses the /api/push endpoint to upload this artifact to an attacker-controlled registry, exfiltrating the heap memory contents. The entire attack requires three HTTP requests: one to upload the crafted GGUF file, one to trigger model creation, and one to push the resulting artifact to an external server. No authentication is required at any step.

What Gets Leaked and Why It Matters

Heap memory in a running Ollama process contains everything the application has touched during its lifetime. This includes:

Environment variables: API keys for cloud services, database credentials, service account tokens, and configuration secrets. Organizations often inject sensitive credentials into container environments via environment variables, and Ollama's heap memory exposes all of them.

User prompts and conversations: Every query sent to Ollama, including patient names, diagnoses, medication lists, lab results, and clinical notes. If a healthcare organization is using Ollama to generate discharge summaries or assist with clinical coding, the prompts contain verbatim PHI.

System prompts: Instructions that define how the model behaves, which often include proprietary business logic, clinical decision rules, or internal policies. These are intellectual property, and their exposure enables competitors to reverse-engineer the organization's AI workflows.

Tool outputs: When Ollama is integrated with tools like Claude Code, the outputs from those tools flow through Ollama's memory. This can include source code, database queries, infrastructure configurations, and anything else the tools generate.

The scope of exposure is not limited to a single user or session. Because heap memory is shared across the process, an attacker exploiting Bleeding Llama can leak data from concurrent users. If five clinicians are simultaneously using an Ollama instance to generate documentation, the attacker's memory dump may contain prompts and outputs from all five, multiplying the PHI exposure.

The Attack Surface Is Larger Than You Think

Cyera's research identified approximately 300,000 Ollama servers exposed on the public internet. This is not a misconfiguration outlier; it is a consequence of Ollama's design and documentation. Ollama binds to localhost (127.0.0.1) by default, meaning it only accepts connections from the local machine. However, the documentation instructs users who want to access Ollama from other machines to set OLLAMA_HOST=0.0.0.0, which tells Ollama to listen on all network interfaces.

The documentation does not warn that this configuration exposes Ollama to the entire network, including the internet if the host has a public IP or port forwarding enabled. Organizations deploying Ollama in cloud environments (AWS EC2, Azure VMs, Google Cloud Compute) often configure it to listen on 0.0.0.0 so that application servers can reach the inference endpoint. If the security group or firewall rules are not properly scoped, this results in a publicly accessible Ollama instance.

Ollama does not provide authentication by default. The REST API accepts any connection from any source without requiring credentials. This design assumes the service is running on a trusted network or localhost. When OLLAMA_HOST is set to 0.0.0.0 and the instance is internet-facing, any attacker can interact with the full API surface—uploading models, creating new model instances, pushing artifacts, and executing any other Ollama command.

The combination of unauthenticated access and the Bleeding Llama vulnerability means that an attacker can:

1. Scan the internet for exposed Ollama instances (port 11434 by default)
2. Identify live instances by querying the /api/tags endpoint
3. Upload a crafted GGUF file via /api/create
4. Trigger the memory leak and exfiltrate the resulting model via /api/push

This sequence takes minutes. Shodan, Censys, and similar internet scanning platforms make it trivial to enumerate exposed Ollama instances, and the attack itself is simple enough to automate at scale.

Healthcare-Specific Implications

For healthcare organizations, Bleeding Llama represents a worst-case scenario: the mitigation deployed to protect patient privacy is actively undermining it. The vulnerability creates several compliance and operational risks:

HIPAA Breach Notification

If an attacker exploits Bleeding Llama to exfiltrate prompts containing PHI, the healthcare organization must determine whether the exposure constitutes a breach requiring notification under HIPAA. The analysis hinges on whether the data was encrypted in a manner that renders it unusable, and heap memory dumps are plaintext. If the leaked data includes patient names, diagnoses, treatment details, or other identifiers, the organization has 60 days to notify affected individuals, HHS, and potentially the media if more than 500 people are impacted.

The complication is that heap memory is a mixed bag. Not every byte in the dump is PHI, and parsing the exfiltrated data to identify which patients are affected is non-trivial. Heap dumps contain fragmented text, binary data, and garbage. Reconstructing coherent patient records from this noise is difficult, but regulators will not accept "we couldn't figure out who was exposed" as a defense. The organization needs to demonstrate a good-faith effort to identify affected individuals.

Business Associate Agreements and Third-Party Risk

Many healthcare organizations use Ollama through third-party integrations or vendor-provided platforms. If the Ollama instance is operated by a Business Associate (a vendor providing services on behalf of the covered entity), the vendor's failure to patch the vulnerability or properly secure the instance may constitute a breach of the BAA. Covered entities are required to obtain satisfactory assurances that BAs will safeguard PHI, and allowing an unauthenticated memory leak vulnerability to persist on an internet-facing server is not satisfactory.

Covered entities should review BAA terms to determine whether the vendor is contractually obligated to patch known vulnerabilities within a specific timeframe, and whether the vendor has liability for breaches resulting from unpatched vulnerabilities. If the BAA is silent on these points, the covered entity may have limited recourse.

Incident Response and Forensics Challenges

Determining whether an Ollama instance was exploited is difficult. The attack leaves minimal forensic traces. The attacker uploads a file, triggers a model creation job, and pushes the result to an external server. Standard web server logs will record HTTP requests to /api/create and /api/push, but these are legitimate API endpoints used by normal operations. Distinguishing malicious exploitation from legitimate use requires inspecting the content of the uploaded GGUF files and the pushed model artifacts, which may no longer be accessible if the attacker deleted them after exfiltration.

Network traffic analysis can help. The /api/push endpoint sends data to an external registry, and outbound connections to unexpected domains may indicate exfiltration. However, if the attacker uses a domain that appears legitimate (a public Ollama registry, a Docker Hub account, a GitHub repository), the traffic may not trigger alerts.

The absence of evidence of exploitation is not evidence of absence. Organizations with internet-exposed Ollama instances should assume compromise and act accordingly: threat hunting for indicators, reviewing access logs for anomalous /api/create and /api/push activity, and performing a breach risk assessment even if no definitive proof of exploitation exists.

Patching Is Not Sufficient

Ollama version 0.17.1, released on May 2, 2026, patches CVE-2026-7482 by adding validation to ensure tensor metadata does not exceed file boundaries. Organizations running Ollama should upgrade immediately. However, patching alone does not address the broader security architecture issues that allowed this vulnerability to be broadly exploitable:

No Authentication by Default

Even with Bleeding Llama patched, Ollama's default configuration accepts unauthenticated connections. Any future vulnerability in the API surface—another memory leak, a code execution flaw, a denial-of-service bug—will be remotely exploitable without credentials. Healthcare organizations must implement authentication in front of Ollama. This can be done via:

Reverse proxy with authentication: Deploy Nginx, Caddy, or Traefik in front of Ollama and require HTTP Basic Auth, API key validation, or OAuth tokens. The proxy handles authentication, and Ollama never sees unauthenticated requests.

VPN or network isolation: Ensure Ollama is accessible only from trusted networks via VPN, and never expose it directly to the internet. If remote access is required, use a VPN that enforces device compliance and multi-factor authentication.

Firewall rules: Restrict inbound connections to specific IP ranges or security groups. If Ollama only needs to serve requests from application servers within a VPC, configure the firewall to deny all external traffic.

Network Exposure Without Warnings

Ollama's documentation should explicitly warn that setting OLLAMA_HOST=0.0.0.0 exposes the service to all network interfaces, and that this configuration requires additional security controls. The default behavior (binding to localhost) is secure, but the documented pattern for enabling remote access bypasses all network-level protections unless the administrator explicitly adds them. This is a documentation and design issue that needs upstream fixes beyond individual deployments.

Monitoring and Logging Gaps

Most Ollama deployments do not have comprehensive logging enabled. The default configuration logs to stdout, which may not be persisted or forwarded to a SIEM. Organizations should configure structured logging with fields for:

- Source IP address
- API endpoint called
- File names and sizes for /api/create and /api/push
- Model names and registries for push operations
- Errors and exceptions

Forward logs to a centralized system where they can be correlated with network traffic, firewall logs, and threat intelligence feeds. Alert on anomalous patterns: unusually large /api/push operations, connections from unexpected geographies, or repeated /api/create calls from a single source.

The Two Additional Windows Vulnerabilities

While Bleeding Llama affects all platforms, Ollama for Windows has two additional vulnerabilities that remain unpatched as of May 10, 2026. These flaws, disclosed by researchers at Striga on January 27, 2026, enable persistent code execution when chained together:

CVE-2026-42248 (CVSS 7.7): Missing signature verification. Ollama for Windows does not verify the signature of update binaries before installation, unlike the macOS version which performs cryptographic validation. This allows an attacker who controls the update response to supply an arbitrary executable that Ollama will install without integrity checks.

CVE-2026-42249 (CVSS 7.7): Path traversal in the updater. The Windows updater constructs the local path for the installer's staging directory directly from HTTP response headers without sanitizing input. An attacker can inject path traversal sequences (e.g., ../../) to write the update binary to arbitrary locations, including the Windows Startup folder.

The attack chain works as follows: override the OLLAMA_UPDATE_URL environment variable to point to an attacker-controlled server, wait for Ollama's automatic update check, serve a malicious installer with HTTP headers containing a path traversal payload, and cause the installer to be written to the Startup folder. On the next login, Windows executes the attacker's payload, achieving persistent code execution.

This attack requires control over the update server, which can be achieved via DNS poisoning, man-in-the-middle attacks on networks without TLS inspection (the update check uses HTTP by default), or exploiting trust relationships if the attacker has already compromised a system that can influence DNS or proxy settings.

Ollama has not released patches for CVE-2026-42248 or CVE-2026-42249. CERT Polska, which coordinated disclosure, recommends that Windows users disable automatic updates and remove the Ollama shortcut from the Startup folder (%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup) to prevent silent execution on login. This is a partial mitigation that blocks the persistence mechanism but does not address the underlying signature and path traversal vulnerabilities.

What Healthcare Security Teams Should Do Now

Organizations running Ollama need to take immediate action:

Audit All Ollama Deployments

Identify every Ollama instance in the environment—on developer workstations, application servers, research environments, cloud VMs, and containerized deployments. For each instance:

- Verify the version. If < 0.17.1, upgrade immediately.
- Check the OLLAMA_HOST configuration. If set to 0.0.0.0 or a non-localhost IP, determine whether the instance is internet-accessible.
- Review firewall rules and security groups. Ensure Ollama is not exposed to the public internet unless absolutely necessary and protected by authentication.
- Examine network logs for inbound connections to port 11434 from unexpected sources.

Implement Authentication

Do not rely on Ollama's default unauthenticated configuration. Deploy a reverse proxy or API gateway in front of every Ollama instance that accepts remote connections, and enforce authentication. This is not optional even after patching; future vulnerabilities will be remotely exploitable without authentication.

Perform Breach Risk Assessment

For any Ollama instance that was internet-accessible prior to patching, perform a breach risk assessment:

- Search access logs for /api/create and /api/push requests from external IPs.
- Investigate outbound connections to unfamiliar registries or domains.
- Analyze heap memory contents (if accessible via crash dumps or forensic snapshots) for PHI.
- If exploitation is confirmed or cannot be ruled out, engage legal and compliance teams to determine HIPAA notification obligations.

Disable Auto-Updates on Windows Pending Patches

For Windows deployments, disable automatic updates until Ollama releases patches for CVE-2026-42248 and CVE-2026-42249. Remove the Startup folder shortcut to prevent auto-launch on login. Monitor the Ollama GitHub repository for security releases.

Review Third-Party and Vendor Deployments

If Ollama is deployed by a Business Associate or vendor, verify that they have patched the vulnerability, implemented authentication, and secured network exposure. Request evidence of compliance (version numbers, configuration screenshots, firewall rules) and include this in ongoing vendor risk assessments.

The Broader Lesson: Local AI Is Not Inherently Secure

Bleeding Llama exposes a dangerous misconception: that local AI deployments are inherently more secure than cloud services. Healthcare organizations adopted Ollama to avoid sending PHI to Anthropic, OpenAI, or Google, assuming that keeping data on-premise eliminates third-party risk. This is only true if the local infrastructure is properly secured, and the Ollama deployment pattern (unauthenticated API, network exposure encouraged by documentation, no signature verification on Windows updates) demonstrates that "local" does not mean "safe."

Cloud AI providers implement authentication, network segmentation, signature verification, and incident response capabilities as baseline features. Local deployments shift these responsibilities to the customer, and if the customer does not implement them—either due to lack of expertise, documentation gaps, or operational shortcuts—the local deployment becomes less secure than the cloud alternative it was meant to replace.

Healthcare security teams evaluating local AI options need to apply the same rigor to local infrastructure that they would to cloud providers: authentication requirements, network isolation, patch management, logging and monitoring, and incident response procedures. Ollama is a powerful tool, but it is not secure by default, and the assumption that running it locally protects patient data is now demonstrably false.

The fix for Bleeding Llama is available, but the fix for the broader security architecture issues—unauthenticated APIs, network exposure without warnings, and insufficient default logging—requires systemic changes to how Ollama is documented, configured, and deployed. Until those changes happen, healthcare organizations using Ollama need to treat it as high-risk infrastructure requiring defense-in-depth controls.

This is entry #39 in the AI Security series. For related coverage, see Securing Local AI Infrastructure: Lessons from Ollama and AnythingLLM Deployments.

Key Links

Original disclosure: Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama
CVE details: CVE-2026-7482
Ollama security patch: Ollama v0.17.1 Release Notes
Windows update vulnerabilities: Ollama Windows Auto-Update RCE
CERT Polska advisory: CVE-2026-42248 and CVE-2026-42249
SecurityWeek coverage: Critical Bug Could Expose 300,000 Ollama Deployments