Proofpoint releases innovative detections for threat hunting: PDF Object Hashing
Key findings
Proofpoint created a new open-source tool for creating threat detection rules based on unique characteristics in PDFs called “PDF Object Hashing”.
This technique can help with identifying related documents and enable attribution when threat actors rely on PDFs for malware or credential phishing payloads.
Proofpoint uses this tool internally to help track multiple threat actors.
The tool is now available on GitHub.
Overview
The PDF format is widely used by threat actors to kickstart malicious activity. In email campaigns, Proofpoint researchers observe PDFs distributed in many ways. For example, threat actors often distribute PDFs that contain URLs leading to malware or credential phishing; PDFs with QR codes leading to malicious web pages; or PDFs with fake banking details or invoices to enable business email compromise (BEC) activity.
Figure 1. Example PDF lures used by threat actors impersonating various brands.
Due to the complex nature of the PDF format and the many ways threat actors use it to their advantage, detecting malicious PDF files can range from straightforward to nearly impossible. Proofpoint researchers have identified notable campaigns leveraging PDFs and have created a new tool called PDF Object Hashing designed to track and detect the unique characteristics of PDFs used by threat actors. The tool supports attribution by identifying PDFs that are likely associated with specific threat actors, even when attack chains or delivery methods change.
PDF Object Hashing
The PDF format is complex, which can cause issues when creating new detection signatures. One challenge detection engineers face with the PDF format is that, for compatibility reasons, the PDF specification permits multiple ways to represent a PDF that appears identical when viewed. This gives threat actors a multitude of options to introduce random variations in their malicious PDFs, making it difficult for threat detection engineers to write pattern-matching signatures that address all variations. Examples of the options for variation include the following:
Six different valid whitespace characters
Cross reference tables (think table of contents) can be stored in plaintext or compressed and stored in a separate format
Parameter values for an object can be embedded in that object, or referenced in another object
Additionally, some objects are present in the document as clear text and others are compressed in “stream objects.” A stream object is a compressed object within the PDF that the PDF viewer can still access. This means a domain that a security practitioner is trying to alert on might not be visible unless you are inspecting these compressed streams. While most detection engineers recognize that elements like URIs or lure images can change frequently, the PDF format includes numerous additional format-specific hurdles that must be considered when analyzing a file.
A specific challenge in defending against PDF threats occurs when the file is encrypted. When a PDF file is encrypted, the overall structure of the document remains visible, but the details or parameters of the individual objects are obscured. The following screenshot demonstrates how objects such as URI strings are hidden when the PDF is encrypted but are visible when not encrypted.
Figures 2 and 3. Example of both a standard (obj 5) and encrypted URI object (obj 10).
Proofpoint researchers created unique PDF Object Hashing detections to combat challenges presented with the PDF format. Instead of relying on more fragile or temporary detections such as file hashes, URLs, lure images, and metadata values, we are able to focus on the structure of the document. While more robust detections exist using techniques like dhash to compare image similarity, PDF Object Hashing applies to the overall structure of the document, allowing us to ignore specific lure images. By examining the type of objects and the order in which they appear – while ignoring their specific parameters and details – we can create a “skeleton” or “template” representing the PDF document’s overall structure. These object types are then used to create a unique “fingerprint” of the PDF by hashing their values. Doing so allows us to search across a wide range of PDF files to detect and identify other files which potentially match the “fingerprint”. The process starts by parsing the document, following the locations of all the objects that are in use and then parsing out a “type” for each object. Below are just some of the types we extract:
Pages
Catalog
XObject/Image
Annots/Link
Page
Metadata/XML
Producer
Font/Type1
We then concatenate the objects in order and hash that value creating what we’ve called the PDF Object Hash. This works similar to how imphash works in PE files. We can then cluster on these hashes to help identify variations and image lure updates that may have taken place with a particular document. This is useful for identifying documents where an image lure was updated, or a URI was changed, but the overall document is still similar, which could indicate a builder or process unique to a threat group.
Figure 4. Overlap with PDF Object Hash (green) and then the below PDFs (yellow).
Figure 5. Two distinct lures which all contain the same types of objects.
PDF Object Hashing can be a reliable way of generating signals which can be used with other detection logic to help create more robust rules and to cluster PDF files into groups for more focused analysis. Proofpoint researchers have used the tool internally to identify documents and related activity with high confidence, improving attribution in many cases.
Campaign examples
To illustrate how PDF Object Hashing can help with threat hunting and analysis, we can look at two interesting threat actors.
The threat cluster known as UAC-0050 targets Ukraine and frequently distributes encrypted PDFs delivering malware. In their campaigns, messages contain PDF files with URLs leading to NetSupport RAT. The URL typically downloads a compressed JavaScript file which, if executed, installs the NetSupport RAT payload.
Figure 6. Example PDF impersonating OneDrive. (SHA256: ee03ad7c8f1e25ad157ab3cd9b0d6109b30867572e7e13298a3ce2072ae13e5).
Because these malicious PDF files are encrypted, many cybersecurity tools and other PDF parsing systems are unable to extract the embedded content, including the URI, the lure image, and parameters associated with the content of the document. Regardless of encryption, PDFs retain an internal document structure (e.g., a hierarchy of objects and attributes) that can be parsed to reveal how those objects are organized and related within the file. Using PDF Object Hashing, Proofpoint developed a unique signature for these PDFs without needing to decrypt or analyze specific contents of their internal objects. This approach allowed for the rapid identification of other potentially related PDF documents that potentially share the same structure, while also allowing us to condemn and prevent payload delivery.
Another actor currently employing PDFs and tracked using PDF Object Hashing is UNK_ArmyDrive. Tracked by Proofpoint since May 2025, this actor is believed to operate out of India and has a history of using PDFs as part of their attack chain. While Proofpoint has traditional detection coverage of this group, we also have augmented that coverage with PDF Object Hashing. Doing so provides additional signals from the static characteristics in their documents that we can use to find samples that may otherwise be missed if the group were to pivot away from existing lures.
Figure 7. Example UNK_ArmyDrive PDFs impersonating the Bangladesh Ministry of Defense (08367ec03ede1d69aa51de1e55caf3a75e6568aa76790c39b39a00d1b71c9084).
The open source project for PDF Object Hashing can be found in the Proofpoint Emerging Threats public GitHub: https://github.com/EmergingThreats/pdf_object_hashing Proofpoint Threat InsightRead More