Small Models, Big Impact: Why Healthcare Doesn't Need Mythos to Find Vulnerabilities

AI Security Series #33

One day after Anthropic announced that Claude Mythos was too dangerous to release publicly, security startup AISLE published research showing that models costing $0.11 per million tokens found the same vulnerabilities Anthropic used to headline the announcement. More striking: AISLE built a simple scanner in one Python file, pointed it at the FreeBSD kernel without hints, and discovered new bugs that had survived 20-26 years of human and automated review. The research challenges the narrative that only restricted frontier models threaten software security, and introduces a more nuanced reality: for defensive cybersecurity, system design matters more than model capability.

For healthcare organizations watching the Mythos announcements and wondering whether they need Project Glasswing access to defend their systems, AISLE's work provides a clear answer: the tools already exist, they're accessible today, and they cost almost nothing. The barrier to AI-augmented security has never been lower, and the advantage currently belongs to defenders who build the right systems around adequate models.

The AISLE Challenge to Mythos Framing

AISLE is a security startup that has been running AI-driven vulnerability discovery in production since mid-2025, accumulating 15 CVEs in OpenSSL, 5 in curl, and over 180 validated CVEs across 30 projects. When Anthropic announced Mythos with the message that this model was so capable it required restricted access and a coordinated industry effort, AISLE decided to test the underlying claim about exclusivity.

They took the specific vulnerabilities Anthropic showcased in the Mythos announcement, isolated the relevant code, and ran them through more than 25 models from every major AI lab. The results contradicted the exclusivity narrative: eight out of eight models detected Mythos's flagship FreeBSD vulnerability CVE-2026-4747, including GPT-OSS-20B with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1 billion-parameter open-weights model recovered the analysis chain of a 27-year-old OpenBSD bug that headlined the Mythos announcement.

The capability gap between frontier models like Mythos and small accessible models turned out to be much smaller than Anthropic's framing suggested, at least for the defensive use case of finding vulnerabilities. AISLE's research introduced the concept of a "jagged frontier" in AI cybersecurity: model rankings reshuffle completely across different tasks, and no single model dominates everything. A model that aces one vulnerability class fails the next.

The Hypothesis: Throughput Versus Intelligence

AISLE's core insight is that the zero-day discovery production function has at least two inputs: intelligence per token and raw throughput measured as tokens per dollar or tokens per unit of time. Anthropic's Mythos maximizes the first input, presumably achieving extraordinary reasoning depth based on their track record. AISLE's question was whether enough throughput, applied systematically with modest models, could compensate for less per-token intelligence.

The analogy is simple: a single brilliant security researcher may reason more deeply about each piece of code they examine, but a much cheaper analyst can look at literally every piece of code. A thousand adequate eyes looking everywhere should find things that one brilliant eye looking selectively misses, even if each individual eye is less perceptive. The question is whether the trade works in practice when searching for real vulnerabilities in production code.

AISLE built nano-analyzer to test this hypothesis at scale. The scanner is deliberately simple: one Python file with roughly 1,700 lines, one dependency, no agentic loop beyond simple grep commands. The pipeline is embarrassingly parallel, scanning every file independently with the same generic prompts. The default model is gpt-5.4-nano, OpenAI's smallest and cheapest variant at $0.20 per million input tokens, roughly 100 times cheaper than speculative Mythos pricing.

The Three-Stage Pipeline

The nano-analyzer workflow demonstrates how system design amplifies signal from small models. Stage one generates context where a cheap model writes a security briefing for every file, covering what it does, where untrusted input enters, which buffers are fixed-size, and which parameters could be NULL. This is a single API call with grep access to the repository, providing the model with just enough orientation to understand each file's security posture.

Stage two performs vulnerability scanning where a second API call, enriched with the LLM-generated context from stage one, hunts for bugs using few-shot prompts tuned for common vulnerability classes. The prompts are currently optimized for C and C++ memory safety issues including buffer overflows, use-after-free, missing bounds checks, and integer problems. This is where most critical infrastructure attack surface lives.

Stage three applies skeptical triage where each finding gets reviewed in multiple rounds with grep access to the full repository. An arbiter model makes the final call on which findings survive. This filtering is critical because small models generate false positives, and a security tool that cannot discriminate between real vulnerabilities and noise drowns reviewers in alerts. The triage stage uses the same cheap models but structures the workflow to catch their mistakes.

The entire pipeline is 100 percent parallel. Every file is scanned and triaged independently, so wall time is just a function of how aggressively you push the API. A scan-plus-triage cycle completes in about 60 seconds per file with relatively full context windows. At 100 files in parallel, you can sweep a thousand-file codebase in 10 minutes on a laptop.

Validation on Known Vulnerabilities

Before scanning anything new, AISLE validated that the pipeline could detect known vulnerabilities and, equally important, correctly ignore patched ones. A good scanner must do both: flag real bugs and stay quiet on patched code. They tested both directions across six models using CVE-2026-4747, the 17-year-old FreeBSD remote code execution vulnerability that Anthropic used as their flagship Mythos finding.

The results showed detection capability across a surprising range of model sizes. GPT-OSS-120B, an open-weights model with 5.1 billion active parameters costing roughly $0.04 per million input tokens, was the most consistent: three out of three runs detected the vulnerability before the patch, and three out of three correctly produced no finding related to CVE-2026-4747 after the patch. This model costs approximately 600 times less than Mythos.

Every model from gpt-5.4-nano upward detected the vulnerability at least two out of three times in repeated experiments and correctly ignored the patched version. Even GPT-OSS-20B with only 3.6 billion active parameters found it in two out of three runs. The detection of this buffer overflow appears well within reach of several tiny open-weights and closed models when provided with the right context and workflow.

The validation proved a critical point: for straightforward memory safety issues like buffer overflows in network-facing code, the capability is commoditized. You do not need restricted access to a frontier model priced at multiples of Opus 4.6 to see a stack buffer overflow where attacker-controlled credentials are copied into a fixed buffer without validating length against available space. Small models with the right prompts and context can spot these patterns reliably.

The Real Test: Scanning the Full FreeBSD Kernel

With validation complete, AISLE pointed nano-analyzer at the entire FreeBSD sys directory: roughly 35,000 files containing 7.5 million lines of dense kernel code. They did not preprocess the codebase in any way to make things easier for the models beyond discarding files based on extensions unlikely to contain code and capping maximum file size to fit comfortably in the context window of even small models. Every source file went through the same pipeline with the same generic prompts.

The scanner ran conservatively for 10 hours to stay within modest API rate limits, but with more aggressive parallelism the same scan could complete much faster without performance degradation. The pipeline produced hundreds of surviving findings after internal triage. Surviving triage does not mean a finding is correct: the triage itself has both false positives where findings survive but are not real bugs, and false negatives where real bugs get rejected. This is expected when using tiny models for triage.

AISLE sorted candidates by confidence score and took the highest-ranked ones for deeper manual review. They used coding assistants to examine the top 30 to 40 findings, which is a manageable number even for human review. A number of these turned out to be real bugs they reported to FreeBSD maintainers, with some already confirmed. While human and AI review was essential, the quantity was surprisingly manageable: 30 to 40 bugs for the full FreeBSD kernel is well worth the manual effort if the prize is new zero-days.

Real Bugs Found by Tiny Models

The scanner discovered multiple real bugs in production FreeBSD kernel code. Two bugs were found in NFS RPCsec_gss, the same subsystem where Mythos found CVE-2026-4747. The first is a missing 2017 Coverity fix involving undefined behavior when offset equals 32. The second is a TOCTOU race condition enabling an out-of-bounds write. Rick Macklem, the FreeBSD NFS maintainer, confirmed the first bug and is committing the fix with AISLE listed as author. The full exchange is public on the FreeBSD kernel mailing list.

AISLE-2026-8073 is a memory safety bug in FreeBSD networking code that appears to have been in the codebase for approximately 26 years. The finding was originally detected by gpt-5.4-nano, the $0.20 per million token model that serves as nano-analyzer's default. AISLE's analysis and reproducer indicate unsafe buffer handling in kernel memory corruption. Practical exploitability appears configuration-dependent and final severity is pending vendor analysis, so AISLE responsibly disclosed it to the FreeBSD security team with a detailed root-cause analysis and working AddressSanitizer-based reproducer.

The FreeBSD security team acknowledged the report and is actively investigating. AISLE verified whether other models in their test suite would have found AISLE-2026-8073 in the same file: every model detected it at least two out of three times, including open-weights models you can run locally. GPT-OSS-20B with 3.6 billion active parameters found it in all three runs. This pattern holds regardless of the final security determination: simple parallel approaches can surface novel bugs that prior review, human or automated, has not caught.

The scanner also ran against the OpenBSD kernel and reported several bug candidates through both public and responsible disclosure channels. Details will follow as the maintainers' processes allow. The total API cost for all of AISLE's work including the FreeBSD kernel scan, the OpenBSD kernel scan, and all benchmarking experiments across six models was under $100.

What This Means for Healthcare

The AISLE research fundamentally changes the calculus for healthcare organizations evaluating AI-augmented security programs.

The Defensive Capability is Already Accessible

Healthcare organizations do not need Project Glasswing invitations or Mythos access to start finding vulnerabilities in their software supply chain. The models that can detect memory safety issues, compute severity, and flag security-relevant code patterns are publicly available today. GPT-5.4-nano at $0.20 per million tokens found a 26-year-old kernel bug. GPT-OSS-120B at $0.04 per million tokens with 5.1 billion parameters reliably detected CVE-2026-4747 across multiple runs.

For healthcare development teams building EHR integrations, patient portals, clinical decision support tools, or administrative automation, this means vulnerability scanning with frontier-adjacent capability is economically feasible right now. Scanning a moderately sized healthcare application codebase costs less than a single penetration test engagement. The question is not whether you can afford the technology but whether you have the expertise to build the workflow around it.

System Design is the Moat

AISLE's research proves that the competitive advantage in AI cybersecurity is not the model but the system built around the model. The security domain knowledge baked into prompts, the orchestration and triage workflow, the validation pipeline that catches false positives, and the trust relationships with maintainers and defenders are what separate effective security programs from expensive noise generators.

Healthcare security teams building SDL processes for AI-assisted development should focus investment on workflow design rather than chasing the most expensive API. The marginal security value of Mythos over gpt-5.4 for defensive vulnerability discovery is unclear based on AISLE's research, but the difference in cost is two orders of magnitude. A healthcare organization spending $10,000 on gpt-5.4-nano can scan significantly more code than spending the same budget on Mythos, and coverage matters for finding the long-tail of vulnerabilities.

Open-Weights Models for Sensitive Code

One of the most striking results from AISLE's research is that open-weights models like GPT-OSS-120B perform comparably to commercial APIs for vulnerability detection. Healthcare organizations with particularly sensitive codebases handling PHI, proprietary clinical algorithms, or competitive IP can run these models entirely locally without sending code to external APIs.

GPT-OSS-120B with 5.1 billion active parameters costs approximately $0.04 per million input tokens when self-hosted, detected CVE-2026-4747 in three out of three runs, and correctly ignored the patched version in three out of three runs. This model can run on healthcare organization infrastructure with no external data transmission, no vendor lock-in, and full control over the scanning process. For security teams concerned about sending medical device firmware or EHR integration code to commercial APIs, this is the path forward.

The Jagged Frontier Creates Opportunity

The fact that model rankings reshuffle completely across different vulnerability classes means healthcare organizations should deploy multiple models rather than depending on a single frontier model. A model that excels at spotting buffer overflows may miss logic bugs in authentication flows. A model that correctly traces data flow through complex call chains may confidently declare vulnerable code safe on the next test.

The jagged frontier works in defenders' favor: run several cheap models in parallel, aggregate their findings, and use the disagreements between models as a signal for where human review should focus. This ensemble approach leverages the strengths of multiple models while using their weaknesses to prioritize manual analysis. Total cost remains far below a single frontier model, and coverage improves because each model has different blind spots.

False Positive Management is Critical

AISLE's research emphasizes that the ability to discriminate between real vulnerabilities and false positives is not a minor feature but a precondition for production use at scale. They note that false positive overload was precisely what killed curl's bug bounty program. Healthcare security teams cannot afford alert fatigue when every false positive consumes developer time and delays legitimate feature development.

The nano-analyzer's multi-stage triage workflow demonstrates how to structure false positive filtering even when using small models that individually make mistakes. The skeptical review rounds with grep access to the full repository allow the system to catch claims that do not hold up under scrutiny. Healthcare organizations adapting this approach should invest heavily in the triage stage, potentially using a more capable model for final arbitration while keeping the initial scanning cheap and comprehensive.

The Offensive Versus Defensive Distinction

One critical nuance in AISLE's research is the distinction between offensive and defensive use cases for AI vulnerability discovery. Offensive capability involves not just finding vulnerabilities but exploiting them: writing working proof-of-concept exploits, chaining multiple vulnerabilities together for privilege escalation, and developing reliable weaponization for arbitrary target systems. This may still require frontier model capability, sophisticated tool use, and extensive domain knowledge.

Defensive capability focuses on discovery, severity assessment, and enabling patches. The goal is finding the vulnerability so maintainers can fix it before attackers exploit it. AISLE's research shows this defensive use case works well with cheap accessible models deployed systematically. For healthcare organizations, the defensive use case is the only legitimate application, and it happens to be the one where small models are most adequate.

The asymmetry matters for threat modeling. Healthcare security teams should assume that well-funded attackers with access to frontier models can find and exploit vulnerabilities in healthcare software. But defenders do not need frontier models to find the same vulnerabilities first and patch them. The race goes to whoever scans comprehensively and patches quickly, not whoever has the most expensive API key.

Implementation Guidance for Healthcare Teams

Healthcare organizations looking to implement AISLE's approach should start with a pilot on a single high-value codebase. Choose an application that handles PHI, integrates with EHR systems, or provides patient-facing functionality where vulnerabilities have direct impact. Run nano-analyzer against the full codebase using gpt-5.4-nano or an open-weights model, implement the three-stage pipeline with context generation, vulnerability scanning, and skeptical triage, and budget time for manual review of the top 30 to 50 findings that survive triage.

The pilot will reveal which vulnerability classes your codebase exhibits and which models perform best on your specific code patterns. Use this learning to tune prompts for your domain, adjust triage thresholds to manage false positive rates, and determine optimal parallelism for your API rate limits and budget. The total cost for a pilot should be well under $500 in API calls, making this an extremely low-risk experiment with potentially high security value.

For organizations with compliance requirements that prohibit sending code to external APIs, prioritize open-weights models like GPT-OSS-120B that can run entirely on-premises. The infrastructure requirements are modest compared to typical healthcare IT systems: a server with a capable GPU can run these models locally with reasonable performance. The tradeoff between API convenience and data residency is worth making for codebases handling particularly sensitive healthcare data.

Looking Forward: The Defensive Advantage

Taken together, the UK AISI evaluation showing that Mythos can only exploit poorly defended systems and AISLE's research showing that defensive capability is broadly accessible today paint a clear picture: defenders currently have the advantage if they act on it. The tools to find vulnerabilities exist, they cost almost nothing, and they work on real production code.

What healthcare organizations need is not access to restricted frontier models but commitment to running the tools they already have access to. The barrier is not technological capability or economic feasibility. The barrier is organizational: dedicating engineering time to build the workflow, training security teams to interpret results, and establishing processes to prioritize remediation of discovered vulnerabilities.

AISLE's decision to open-source nano-analyzer removes even the implementation barrier. The code is available on GitHub, the prompts are public, and the methodology is documented. Healthcare organizations can start scanning their codebases today using models that cost less than a fancy coffee per million tokens. The bugs are in the code right now, the tools to find them have never been more accessible, and the window to patch before attackers exploit is closing.

The question is not whether AI will change the offense-defense balance in healthcare software security. The question is whether healthcare organizations will use the tools available today to find and fix vulnerabilities before attackers with the same tools find and exploit them. AISLE's research proves the defensive tools work. What happens next depends on whether defenders choose to use them.

This is entry #33 in the AI Security series. For related coverage, see The AI Gateway Everyone Uses Just Got Backdoored: LiteLLM and the Healthcare Supply Chain Risk and UK Government Reality-Checks Claude Mythos: Why Healthcare's Cyber Basics Just Became Non-Negotiable.