The Black Box of .NET Headline Animator

Showing posts with label linq to xml. Show all posts
Showing posts with label linq to xml. Show all posts

Sunday, January 12, 2025

What is LINQ to Xml Atomization?

Atomization is the process that the LINQ to Xml runtime uses to store and reuse string instances. Atomization means that if two XName objects have the same local name, and they're in the same namespace, they share the same instance. Likewise, if two XNamespace objects have the same namespace URI, they share the same instance. This optimizes memory usage and improves performance when comparing for equality of:
  • One XName instance to a different XName instance
  • One XNamespace instance to a different XNamespace instance
because the underlying intermediate language only has to compare references instead of string comparisons which would take longer. Being able to take advantage of this requires that both XName and XNamespace implement the equality and inequality operators - which of course they do. This concept is particularly relevant when working with XElement and XAttribute instances, where the same names might appear multiple times. XName and XNamespace also implement the Implicit Operator that converts strings to XName or XNamespace instances. This allows for automatically passing atomized instances as parameters to LINQ to Xml methods which then have better performance because of atomization. For example, this code implicitly passes an XName (with a value of "x") to the Descendants method:
var root = new XElement("root",
  new XElement("x", "1"),
  new XElement("y",
    new XElement("x", "1"),
    new XElement("x", "3")
  )
);

foreach (var e in root.Descendants("x").Where(e => e.Value == "1"))
{
  ...
}
Atomization is similar to string interning in .NET, where identical string literals are stored only once in memory. When LINQ to Xml processes XML data, it often encounters repeated element and attribute names. Instead of creating a new string object for each occurrence, it reuses existing string instances. It does this by caching the Name property in XName and XNamespace in an internal static XHashtable<WeakReference<XNamespace>>. You can find the source code here:

Practical Implications

So how does atomization affect my application?
  1. Element and Attribute Names: When you create elements or attributes using LINQ to Xml, the names are atomized. For example, if you create multiple elements with the same name, LINQ to Xml will store the name only once.
  2. Namespace Handling: Atomization also applies to XML namespaces. When you define namespaces in your XML, LINQ to XML ensures that each unique namespace URI is stored only once.
  3. Value Atomization: While element and attribute names are atomized, the values are not automatically atomized. However, if you're feeling adventurous and you frequently use the same values, you might consider implementing your own caching mechanism to achieve similar benefits. Now before you go off and write your own caching mechanism to cache values, consider that the .NET team has done a lot of work ensuring the caching of names is both thread-safe and performant. I've never found that I needed to do this though the largest xml documents I've had to work with are in the tens of MB's. If you're using much larger xml documents of hundreds of MB's or even 1 GB+ in size, then you may find this worthwhile.

Pre-Atomization

You might be thinking "..this is great, I can't do anything to improve performance here!". But there is. Unfortunately, even though you effectively can pass atomized instances to LINQ to Xml methods, there is a small cost. This is because the Implicit Operator has to be invoked. You can refer to the Atomization Benchmarks at the end of this post to get the details on some benchmarking I did. In a nutshell, the results show that pre-atomizing is just over 2x faster. That being said, we are talking about a couple hundred nanoseconds in the context of my test for an element with 3 child alements, each having an attribute and string content. The benefit of pre-atomization becomes much more evident with very large XML documents. Here is the xml used for test data in the benchmarks:
<aw:Root xmlns:aw="http://www.adventure-works.com">
  <aw:Data ID="1">4,100,000</aw:Data>
  <aw:Data ID="2">3,700,000</aw:Data>
  <aw:Data ID="3">1,150,000</aw:Data>
</aw:Root>
The full code for the benchmarks can be found at this gist: Linq to Xml - XName Atomization Benchmark.cs

Atomization and ReferenceEquals

Let's take a look at XName as an example. There are two ways to directly create XName instances:
  1. The XName.Get(String) or XName.Get(String, String) methods. See here
  2. The XNamespace.Addition(XNamespace, String) Operator method. See here
They are also created indirectly by LINQ to Xml when you create XDocument, XElement, and XAttribute instances. Here is some code to demonstrate that there is a single atomized instance referred to regardless of how the instance was created.
// Note that these are not explicitly declared as const
string x = "element";
string y = "element";
var stringsHaveSameReference = object.ReferenceEquals(x, y);
Console.WriteLine(
  $"string 'x' has same reference as string 'y' (expect true): {stringsHaveSameReference}");

// Create new XName instances (indirectly) thru XElement ctor
// using a string as the name
var xNameViaGet1 = XName.Get(localName, namespaceUri);
var xNameViaGet2 = XName.Get(localName, namespaceUri);

// Check if XName instances are the same
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, xNameViaGet2);
Console.WriteLine($"xNameViaGet1 is same reference as xNameViaGet2 (expect true): {namesHaveSameReference}");

// Create XName instances via XNamespace.Addition Operator
XNamespace ns = namespaceUri;
XName xNameViaNSAddition1 = ns + localName;
XName xNameViaNSAddition2 = ns + localName;

// Check if XName instances are the same
namesHaveSameReference = object.ReferenceEquals(xNameViaNSAddition1, xNameViaNSAddition2);
Console.WriteLine(
  $"xNameViaNSAddition1 is same reference as xNameViaNSAddition2 (expect true): {namesHaveSameReference}");

// Create XElement and XAttribute using XName instances
XElement ele = new XElement(xNameViaGet1, "value1");
XAttribute attr = new XAttribute(xNameViaGet2, "value2");

// Check if XName instances in XElement and XAttribute are the same
namesHaveSameReference = object.ReferenceEquals(ele.Name, attr.Name);
Console.WriteLine(
  $"ele.Name is same reference as attr.Name (expect true): {namesHaveSameReference}");

// Compare XName references that were created differently
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, xNameViaNSAddition1);
Console.WriteLine(
  $"xNameViaGet1 is same reference as xNameViaNSAddition1 (expect true): {namesHaveSameReference}");
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, ele.Name);
Console.WriteLine(
  $"xNameViaGet1 is same reference as ele.Name (expect true): {namesHaveSameReference}");

// Create 2 XElement instances with the same name of 'root' and same value
XElement eleViaCtor = new XElement("root", "value");
XElement eleViaParse = XElement.Parse("<root>value</root>");

// Note that the 2 XElements DO NOT have the same reference
bool xelementsHaveSameReference = object.ReferenceEquals(eleViaCtor, eleViaParse);
Console.WriteLine(
  $"eleViaCtor and eleViaParse refer to same instance: {xelementsHaveSameReference}");

// However, their respective XName properties DO have the same reference
namesHaveSameReference = object.ReferenceEquals(eleViaCtor.Name, eleViaParse.Name);
Console.WriteLine($"eleViaCtor.Name and eleViaParse.Name refer to same instance: {namesHaveSameReference}");

Atomization Benchmarks

These tests compared creating XElement and XAttribute instances by either passing a string or an XName instance to the constructors. Note that the memory allocations were identicial since they are all referring to the same XName instance.
Method                            | Mean     | Error   | StdDev  | Ratio | Gen0   | Allocated | Alloc Ratio |
----------------------------------|---------:|--------:|--------:|------:|-------:|----------:|------------:|
XNode_Construction_Passing_Strings| 355.9 ns | 3.69 ns | 3.45 ns |  1.00 | 0.0391 |     656 B |        1.00 |
XNode_Construction_Passing_XName  | 136.0 ns | 1.95 ns | 1.82 ns |  0.38 | 0.0391 |     656 B |        1.00 |
The full code for the benchmarks can be found at this gist: Linq to Xml - XName Atomization Benchmark.cs
Share

Saturday, January 11, 2025

How to Control Namespace Prefixes with LINQ to Xml

LINQ to Xml is a great improvement over the XmlDocument DOM approach that uses types that are part of the System.Xml namespace. LINQ to Xml uses types that are part of the System.Xml.Linq namespace. In my experience, it shines for a large majority of the use cases for working with XML by offering a simpler method of accessing:
  • Elements (XElement)
  • Attributes (XAttribute)
  • Nodes (XNode)
  • Comments (XComment)
  • Text (XText)
  • declarations (XDeclaration)
  • Namespaces (XNamespace)
  • Processing Instructions (XProcessingInstruction)
However, this post isn't intended to be an intro to LINQ to Xml. You can read more about it here. The remaining use cases that need more complex and advanced XML processing are better handled by utilizing the XmlDocument DOM. This is because it has been around longer (since .NET Framework 1.1) whereas LINQ to Xml was added in .NET Framework 3.5. As a result the XmlDocument DOM is richer in features. I'd liken it to working with C++ instead of C# - you can do a lot more with it and it's been around a lot longer, but it is more cumbersome to use than C#. Now getting back to the matter at hand... When working with namespace prefixes using LINQ to Xml, it will handle the serialization of the prefixes for you. This is handled both by the XmlWriter as well as calling XDocument.ToString() or XElement.ToString(). This is helpful in most cases as it abstracts all of the details and just does it for you. It's not helpful when you need to control the prefixes that are serialized. Some examples of when you might need to control prefix serialization are:
  • comparing documents for semantic equivalence
  • normalizing an xml document
  • or canonicalizing an xml document

Even LINQ to Xml's XNode.DeepEquals() method doesn't consider elements or attributes with different prefixes across XElement or XDocument instances to be equivalent. How disappointing. 

Here is an example of how LINQ to Xml takes care of prefix serialization for you:
const string xml =  @"
<x:root xmlns:x='http://ns1' xmlns:y='http://ns2'>
    <child a='1' x:b='2' y:c='3'/>
</x:root>";

XElement original = XElement.Parse(xml);

// Output the XML
Console.WriteLine(original.ToString());
The output is:
<x:root xmlns:x="http://ns1" xmlns:y="http://ns2">
    <child a="1" x:b="2" y:c="3" />
</x:root>
Now create an XElement based on the original. This illustrates that the namespace prefixes specified in the original XElement are not maintained in the copy:
XElement copy =
  new XElement(original.Name,
    original.Elements()
      .Select(e => new XElement(e.Name, e.Attributes(), e.Elements(), e.Value)));

Console.WriteLine("\nNew XElement:");
Console.WriteLine(copy.ToString());
The output is:
New XElement:
<root xmlns="http://ns1">
  <child a="1" p2:b="2" p3:c="3" xmlns:p3="http://ns2" xmlns:p2="http://ns1" xmlns="" />
</root>

What happened to my prefixes of x and y? Where did the prefixes p2 and p3 come from? Also, note how there are some additional namespace declarations that aren't in the original XElement. We didn't touch anything related to namespaces when we created the copy; we just selected the elements and attributes from the original. You can't even use XNode.DeepEquals() to compare them for equivalence since the attributes are now different as well as the element names. I'll leave that as an exercise for the reader.

Controlling the prefixes yourself, although not very intuitive, is actually really easy. You might consider trying to change the Name property on the xmlns namespace declaration attribute to get the prefix rewritten. However, unlike the XElement.Name property which is writable, the XAttribute.Name property is readonly.

One caveat here: IF you happen to have superfluous (duplicate) namespace prefixes pointing to the same namespace, you are out of luck. I've written a lengthy description about this on Stack Overflow answering this question: c# linq xml how to use two prefixes for same namespace.

Would you believe me if I told you that LINQ is actually going to help you rewrite the prefixes? In fact, it doesn't just help you, it rewrites all of the prefixes for you! I bet you didn't see that coming! This is because modifications to the xml made at runtime cause LINQ to Xml to keep its tree updated so that it is in a consistent state. Imagine if you did made a programmatic change that caused the xml tree to be out of whack per se and the LINQ to Xml runtime didn't make updates for you to keep things in sync. Well, welcome to debugging hell.

So, all we have to do are a couple of really simple things and LINQ to Xml handles the rewriting for us. There are two approaches: you can either create a new element or modify the existing one. I've done some performance profiling and found that for larger documents above 1MB it is more efficient to create new elements and attributes rather than modifying existing ones. Some of this has to do with all of the allocations that are made when LINQ to Xml invokes the event handlers on the XObject class: see XObject Class. These are fired, thus allocating memory, regardless of whether you have event handlers wired to them or not. I'll show you both patterns so you can decide what works for you. The steps to rewrite vary depending on which approach you decide upon.

Let's use the following xml as an example with the assumption that we want to rewrite these prefixes for both namespaces that are declared and we want to rewrite prefix a to aNew and prefix b to bNew:
XNamespace origANamespace = "http://www.a.com";
XNamespace origBNamespace = "http://www.b.com";
const string originalAPrefix = "a";
const string originalBPrefix = "b";

const string xml = @"
<a:foo xmlns:a='http://www.a.com' xmlns:b='http://www.b.com'>
  <b:bar/>
  <b:bar/>
  <b:bar/>
  <a:bar b:att1='val'/>
</a:foo>
""";

Creating a new XElement

There are three steps.
Step 1: Create a new XNamespace and point it to the namespace whose prefix should be rewritten
// Define new namespaces
XNamespace newANamespace = origANamespace;
XNamespace newBNamespace = origBNamespace;
Step 2: Create new namespace declaration attributes that use the XNamespaces you created in Step 1
XAttribute newANamespaceXmlnsAttr =
  new XAttribute(XNamespace.Xmlns + "aNew", newANamespace);
XAttribute newBNamespaceXmlnsAttr =
  new XAttribute(XNamespace.Xmlns + "bNew", newBNamespace);
Step 3: Create a XElement that you will use this new namespace and contains all of the elements, attributes, and children of the original XElement
// Create a new XElement with the new namespace  
XElement newElement = 
  new XElement(newANamespace + originalElement.Name.LocalName,
    newANamespaceXmlnsAttr,
    newBNamespaceXmlnsAttr,
    originalElement
      .Elements()
      .Select(e => 
        new XElement(
          e.Name,
          e.Attributes(),
          e.Elements(),
          e.Value)));

Modifying an existing XElement

There are four steps with the last step being optional depending if you want to remove the old namespace prefix declaration. THe first two steps are identical to the creating a new XElement approach.
Step 1: Create a new XNamespace that has the new prefix you want and point it to the namespace whose prefix should be rewritten
// Define new namespaces
XNamespace newANamespace = origANamespace;
XNamespace newBNamespace = origBNamespace;
Step 2: Create new namespace declaration attributes that use the XNamespaces you created in Step 1
XAttribute newANamespaceXmlnsAttr =
  new XAttribute(XNamespace.Xmlns + "aNew", newANamespace);
XAttribute newBNamespaceXmlnsAttr =
  new XAttribute(XNamespace.Xmlns + "bNew", newBNamespace);
Step 3: Modify the original XElement by adding the new namespace declaration attributes
originalElement.Add(
  newANamespaceXmlnsAttr,
  newBNamespaceXmlnsAttr);
Step 4 (optional): Remove the original namespace declaration that has now been rewritten
// now remove the original 'a' and 'b' namespace declarations
originalElement
  .Attribute(XNamespace.Xmlns + originalAPrefix)?
  .Remove();
originalElement
  .Attribute(XNamespace.Xmlns + originalBPrefix)?
  .Remove();
I hope you found this helpful. Drop a comment and let me know.

Share