DESTRUCTURED - Critical Vulnerability in Unstructured.io (CVE-2025–64712)

TL;DR

We discovered a critical vulnerability (CVE-2025–64712, CVSS 9.8) in Unstructured.io - an ETL product used by 87% of Fortune 1000 companies (Amazon, Google, Bank of America, ..)

The vulnerability allows a threat actor to achieve arbitrary file write and potentially remote code execution on the machine that runs the unstructured library.

From Chaos to Intelligence

“Unstructured data” refers to information that lacks a predefined model or structure, making it difficult to store and analyze in traditional relational databases - like PDFs, emails, Word documents, slide decks, and images.

Unstructured data makes up about 80–90% of all business data, but AI systems can’t naturally understand it.
To make this data useful, it must go through a transformation process.

First, data is extracted from the source using different conversion techniques, depending on the format -
Optical Character Recognition (OCR) for PDFs, speech-to-text for audio, and more.
Then, the output is broken into smaller pieces (chunks) and converted into a format AI can understand.

These pieces get stored in a searchable system (vector database), so when you ask an AI assistant a question, it can quickly find the most relevant information from your documents and use that to give you an accurate answer.


Unstructured.io

Unstructured.io is a company that helps businesses turn messy, unstructured data into clean, structured information that AI systems can understand and use.

Unstructured.io offers a hybrid approach with both open-source and managed products. Their core offering is an open-source library that provides components for ingesting and pre-processing documents.
They also offer several managed services, SaaS API with either shared cloud hosting, dedicated instances, or VPC deployment.

In addition, they launched an Enterprise Platform that provides a no-code interface where you can connect data sources, process documents automatically, and deliver structured outputs to your chosen destination.

The platform can ingest documents from sources like S3, Google Drive, OneDrive, and Salesforce, then deliver processed data to a chosen destination (from a long list of supported services).


The Vulnerability

Summary
The vulnerability we found is a classic path traversal bug that leads to arbitrary file write. In simple words, it allows a threat actor to write a file of any type, with any content, to anywhere on the file system of the machine running the unstructured library.

The impact of these types of vulnerabilities is critical, as arbitrary file write almost always leads to code execution. For example, an attacker could inject malicious code into startup scripts (like
init.d), create cron jobs for persistent access, overwrite SSH’s authorized_keys file to establish backdoor access and gain full control over the machine, or - depending on the web server and programming language (PHP, JSP, etc.) - upload webshells for remote control.

File types support
As mentioned earlier, the unstructured library serves as an ETL pipeline for AI innovation in enterprises. It takes a file as input, extracts the text content, splits it into chunks, passes those chunks through an embedding model, and stores them in a vector database.

The library supports a wide range of input file types.

One of the file types the unstructured library supports is the .msg file type. These are Microsoft Outlook email message files that store individual emails, including the message body, attachments, sender/recipient info, and metadata.

.msg In A Bottle (Technical Details)
The function in the unstructured library that in charge of processing a Microsoft’s Outlook email messages (
.msg files) is the partition_msg function.

The partition_msg function calls to MsgPartitioner.iter_message_elements - a function that in charge of partitioning the email into its different elements

As you can see, the function first extracts text from the message body elements (_iter_message_body_elements) - like the email content itself, sender, and recipients. Depending on the email format, the unstructured library calls the corresponding function (partition_html for HTML format, partition_text for plain text).

Next, for email messages that contain attachments, the function tries to extracts text from each attachment by calling AttachmentPartitioner.iter_elements.

Let’s break this function down. First, it stores the attachment in a temp directory so it can process it later.

The second part handles text extraction from the attachment. Since the function doesn’t know the attachment type in advance, it uses unstructured.partition.auto.partition() function to figure it out dynamically.

This function determines the file type based on the extension, content-type, or headers within the file alongside with extracting the text itself from the attachment.

So, can you guess where the vulnerability is?
Here’s a hint - when the function stores the attachment in a temporary directory, it uses the following code

The function creates a temporary file path and writes the attachment’s content directly to it. The issue lies in how that path is constructed.

As you can see, the function simply concatenates the temporary directory path (for example,
/tmp/) with self._attachment_file_name.

But what exactly is the value of self._attachment_file_name?

As you can see, according to the comment in the top of the function - the value of self._attachment_file_name is "The original name of the attached file.".
What does that mean? It means that if the original name of the attachment is something like ../../root/.ssh/authorized_keys, the temporary path will become - /tmp/../../root/.ssh/authorized_keys.

As a result, when the write action is invoked, we'll overwrite the authorized_keys file with the content of our attachment (which we fully control), ultimately achieving arbitrary file write.


A Supply Chain Nightmare

Unstructured is one of the most widely deployed libraries in the AI ecosystem, with over 4 million monthly downloads.
Major open-source applications like OpenWebUI use it as their default ETL pipeline, and 87% of Fortune 1000 companies rely on it - whether in internal environments or production systems.

But the exposure doesn’t stop there. Popular frameworks like LlamaIndex and LangChain act as wrappers around unstructured, exposing API functions that ultimately call unstructured under the hood.

This creates a sprawling dependency chain that’s nearly impossible to track.

A vulnerability in unstructured doesn’t just affect direct users -  it ripples through countless downstream libraries and applications (e.g., unstructured → LangChain → library_x → …)

It’s difficult to accurately measure the full blast radius of this vulnerability due to the complexity of these nested dependencies.
This is a supply chain issue that likely affects far more systems than we can easily identify

For reference, the unstructured library is used directly (according to GitHub) in ~10K files, whereas langchain_community.document_loaders is used in ~100K file

Why Should You Care?

The unstructured library is widely deployed across enterprises, both as a managed service and as a direct dependency in countless applications.

It’s become the de facto standard for document parsing and text extraction - you’ll find it referenced in Azure, AWS, and GCP documentation, tutorials, and sample architectures.

If this vulnerability exists in your enterprise application, an attacker could achieve complete takeover of the instance running the library.

The impact could be severe: exfiltration of sensitive customer data, theft of API keys and credentials, lateral movement across your infrastructure and much more.


Call To Action

Make sure to update unstructured library to version 0.18.18 or newer in all of your applications and workloads


Responsible Disclosure Timeline

  • October 29, 2025: reported the vulnerability to Unstructured.io.
  • November 6, 2025: Unstructured.io acknowledged the report.
  • November 7, 2025: Unstructured.io published patched version (0.18.18).
  • December 1, 2025: The researcher requested an update regarding the publication status of the vulnerability report.
  • December 1, 2025: Unstructured.io team confirms fix deployed to SaaS; CVE issued but not published, awaiting on-prem enterprise customers updates before public disclosure.
  • December 29, 2025: The researcher requested an update regarding the publication status of the vulnerability report.
  • December 29, 2025: Unstructured.io team responds that CVE publication is delayed pending completion of on-prem upgrades by large enterprise customers.
  • February 1, 2026: Researcher notifies team of intent to publish on Feb 8, 2026, citing 95-day timeline and concern for non-enterprise users still at risk.
  • February 3, 2026: Unstructured.io published CVE details of the vulnerability: CVE-2025–64712.
  • February 12, 2026: Cyera published this blog post.
Download Report

DESTRUCTURED - Critical Vulnerability in Unstructured.io (CVE-2025–64712)

TL;DR

We discovered a critical vulnerability (CVE-2025–64712, CVSS 9.8) in Unstructured.io - an ETL product used by 87% of Fortune 1000 companies (Amazon, Google, Bank of America, ..)

The vulnerability allows a threat actor to achieve arbitrary file write and potentially remote code execution on the machine that runs the unstructured library.

From Chaos to Intelligence

“Unstructured data” refers to information that lacks a predefined model or structure, making it difficult to store and analyze in traditional relational databases - like PDFs, emails, Word documents, slide decks, and images.

Unstructured data makes up about 80–90% of all business data, but AI systems can’t naturally understand it.
To make this data useful, it must go through a transformation process.

First, data is extracted from the source using different conversion techniques, depending on the format -
Optical Character Recognition (OCR) for PDFs, speech-to-text for audio, and more.
Then, the output is broken into smaller pieces (chunks) and converted into a format AI can understand.

These pieces get stored in a searchable system (vector database), so when you ask an AI assistant a question, it can quickly find the most relevant information from your documents and use that to give you an accurate answer.


Unstructured.io

Unstructured.io is a company that helps businesses turn messy, unstructured data into clean, structured information that AI systems can understand and use.

Unstructured.io offers a hybrid approach with both open-source and managed products. Their core offering is an open-source library that provides components for ingesting and pre-processing documents.
They also offer several managed services, SaaS API with either shared cloud hosting, dedicated instances, or VPC deployment.

In addition, they launched an Enterprise Platform that provides a no-code interface where you can connect data sources, process documents automatically, and deliver structured outputs to your chosen destination.

The platform can ingest documents from sources like S3, Google Drive, OneDrive, and Salesforce, then deliver processed data to a chosen destination (from a long list of supported services).


The Vulnerability

Summary
The vulnerability we found is a classic path traversal bug that leads to arbitrary file write. In simple words, it allows a threat actor to write a file of any type, with any content, to anywhere on the file system of the machine running the unstructured library.

The impact of these types of vulnerabilities is critical, as arbitrary file write almost always leads to code execution. For example, an attacker could inject malicious code into startup scripts (like
init.d), create cron jobs for persistent access, overwrite SSH’s authorized_keys file to establish backdoor access and gain full control over the machine, or - depending on the web server and programming language (PHP, JSP, etc.) - upload webshells for remote control.

File types support
As mentioned earlier, the unstructured library serves as an ETL pipeline for AI innovation in enterprises. It takes a file as input, extracts the text content, splits it into chunks, passes those chunks through an embedding model, and stores them in a vector database.

The library supports a wide range of input file types.

One of the file types the unstructured library supports is the .msg file type. These are Microsoft Outlook email message files that store individual emails, including the message body, attachments, sender/recipient info, and metadata.

.msg In A Bottle (Technical Details)
The function in the unstructured library that in charge of processing a Microsoft’s Outlook email messages (
.msg files) is the partition_msg function.

The partition_msg function calls to MsgPartitioner.iter_message_elements - a function that in charge of partitioning the email into its different elements

As you can see, the function first extracts text from the message body elements (_iter_message_body_elements) - like the email content itself, sender, and recipients. Depending on the email format, the unstructured library calls the corresponding function (partition_html for HTML format, partition_text for plain text).

Next, for email messages that contain attachments, the function tries to extracts text from each attachment by calling AttachmentPartitioner.iter_elements.

Let’s break this function down. First, it stores the attachment in a temp directory so it can process it later.

The second part handles text extraction from the attachment. Since the function doesn’t know the attachment type in advance, it uses unstructured.partition.auto.partition() function to figure it out dynamically.

This function determines the file type based on the extension, content-type, or headers within the file alongside with extracting the text itself from the attachment.

So, can you guess where the vulnerability is?
Here’s a hint - when the function stores the attachment in a temporary directory, it uses the following code

The function creates a temporary file path and writes the attachment’s content directly to it. The issue lies in how that path is constructed.

As you can see, the function simply concatenates the temporary directory path (for example,
/tmp/) with self._attachment_file_name.

But what exactly is the value of self._attachment_file_name?

As you can see, according to the comment in the top of the function - the value of self._attachment_file_name is "The original name of the attached file.".
What does that mean? It means that if the original name of the attachment is something like ../../root/.ssh/authorized_keys, the temporary path will become - /tmp/../../root/.ssh/authorized_keys.

As a result, when the write action is invoked, we'll overwrite the authorized_keys file with the content of our attachment (which we fully control), ultimately achieving arbitrary file write.


A Supply Chain Nightmare

Unstructured is one of the most widely deployed libraries in the AI ecosystem, with over 4 million monthly downloads.
Major open-source applications like OpenWebUI use it as their default ETL pipeline, and 87% of Fortune 1000 companies rely on it - whether in internal environments or production systems.

But the exposure doesn’t stop there. Popular frameworks like LlamaIndex and LangChain act as wrappers around unstructured, exposing API functions that ultimately call unstructured under the hood.

This creates a sprawling dependency chain that’s nearly impossible to track.

A vulnerability in unstructured doesn’t just affect direct users -  it ripples through countless downstream libraries and applications (e.g., unstructured → LangChain → library_x → …)

It’s difficult to accurately measure the full blast radius of this vulnerability due to the complexity of these nested dependencies.
This is a supply chain issue that likely affects far more systems than we can easily identify

For reference, the unstructured library is used directly (according to GitHub) in ~10K files, whereas langchain_community.document_loaders is used in ~100K file

Why Should You Care?

The unstructured library is widely deployed across enterprises, both as a managed service and as a direct dependency in countless applications.

It’s become the de facto standard for document parsing and text extraction - you’ll find it referenced in Azure, AWS, and GCP documentation, tutorials, and sample architectures.

If this vulnerability exists in your enterprise application, an attacker could achieve complete takeover of the instance running the library.

The impact could be severe: exfiltration of sensitive customer data, theft of API keys and credentials, lateral movement across your infrastructure and much more.


Call To Action

Make sure to update unstructured library to version 0.18.18 or newer in all of your applications and workloads


Responsible Disclosure Timeline

  • October 29, 2025: reported the vulnerability to Unstructured.io.
  • November 6, 2025: Unstructured.io acknowledged the report.
  • November 7, 2025: Unstructured.io published patched version (0.18.18).
  • December 1, 2025: The researcher requested an update regarding the publication status of the vulnerability report.
  • December 1, 2025: Unstructured.io team confirms fix deployed to SaaS; CVE issued but not published, awaiting on-prem enterprise customers updates before public disclosure.
  • December 29, 2025: The researcher requested an update regarding the publication status of the vulnerability report.
  • December 29, 2025: Unstructured.io team responds that CVE publication is delayed pending completion of on-prem upgrades by large enterprise customers.
  • February 1, 2026: Researcher notifies team of intent to publish on Feb 8, 2026, citing 95-day timeline and concern for non-enterprise users still at risk.
  • February 3, 2026: Unstructured.io published CVE details of the vulnerability: CVE-2025–64712.
  • February 12, 2026: Cyera published this blog post.
Download Report

Experience Cyera

To protect your dataverse, you first need to discover what’s in it. Let us help.

Get a demo  →
Decorative