The Black Box of .NET

Showing posts with label performance. Show all posts

January 12, 2025

What is LINQ to Xml Atomization?

Atomization is the process that the LINQ to Xml runtime uses to store and reuse string instances. Atomization means that if two XName objects have the same local name, and they're in the same namespace, they share the same instance. Likewise, if two XNamespace objects have the same namespace URI, they share the same instance. This optimizes memory usage and improves performance when comparing for equality of:

One XName instance to a different XName instance
One XNamespace instance to a different XNamespace instance

because the underlying intermediate language only has to compare references instead of string comparisons which would take longer. Being able to take advantage of this requires that both XName and XNamespace implement the equality and inequality operators - which of course they do. This concept is particularly relevant when working with XElement and XAttribute instances, where the same names might appear multiple times. XName and XNamespace also implement the Implicit Operator that converts strings to XName or XNamespace instances. This allows for automatically passing atomized instances as parameters to LINQ to Xml methods which then have better performance because of atomization. For example, this code implicitly passes an XName (with a value of "x") to the Descendants method:

var root = new XElement("root",
  new XElement("x", "1"),
  new XElement("y",
    new XElement("x", "1"),
    new XElement("x", "3")
  )
);

foreach (var e in root.Descendants("x").Where(e => e.Value == "1"))
{
  ...
}

Atomization is similar to string interning in .NET, where identical string literals are stored only once in memory. When LINQ to Xml processes XML data, it often encounters repeated element and attribute names. Instead of creating a new string object for each occurrence, it reuses existing string instances. It does this by caching the Name property in XName and XNamespace in an internal static XHashtable<WeakReference<XNamespace>>. You can find the source code here:

Practical Implications

So how does atomization affect my application?

Element and Attribute Names: When you create elements or attributes using LINQ to Xml, the names are atomized. For example, if you create multiple elements with the same name, LINQ to Xml will store the name only once.
Namespace Handling: Atomization also applies to XML namespaces. When you define namespaces in your XML, LINQ to XML ensures that each unique namespace URI is stored only once.
Value Atomization: While element and attribute names are atomized, the values are not automatically atomized. However, if you're feeling adventurous and you frequently use the same values, you might consider implementing your own caching mechanism to achieve similar benefits. Now before you go off and write your own caching mechanism to cache values, consider that the .NET team has done a lot of work ensuring the caching of names is both thread-safe and performant. I've never found that I needed to do this though the largest xml documents I've had to work with are in the tens of MB's. If you're using much larger xml documents of hundreds of MB's or even 1 GB+ in size, then you may find this worthwhile.

Pre-Atomization

You might be thinking "..this is great, I can't do anything to improve performance here!". But there is. Unfortunately, even though you effectively can pass atomized instances to LINQ to Xml methods, there is a small cost. This is because the Implicit Operator has to be invoked. You can refer to the Atomization Benchmarks at the end of this post to get the details on some benchmarking I did. In a nutshell, the results show that pre-atomizing is just over 2x faster. That being said, we are talking about a couple hundred nanoseconds in the context of my test for an element with 3 child alements, each having an attribute and string content. The benefit of pre-atomization becomes much more evident with very large XML documents. Here is the xml used for test data in the benchmarks:

<aw:Root xmlns:aw="http://www.adventure-works.com">
  <aw:Data ID="1">4,100,000</aw:Data>
  <aw:Data ID="2">3,700,000</aw:Data>
  <aw:Data ID="3">1,150,000</aw:Data>
</aw:Root>

The full code for the benchmarks can be found at this gist: Linq to Xml - XName Atomization Benchmark.cs

Atomization and ReferenceEquals

Let's take a look at XName as an example. There are two ways to directly create XName instances:

The XName.Get(String) or XName.Get(String, String) methods. See here
The XNamespace.Addition(XNamespace, String) Operator method. See here

They are also created indirectly by LINQ to Xml when you create XDocument, XElement, and XAttribute instances. Here is some code to demonstrate that there is a single atomized instance referred to regardless of how the instance was created.

// Note that these are not explicitly declared as const
string x = "element";
string y = "element";
var stringsHaveSameReference = object.ReferenceEquals(x, y);
Console.WriteLine(
  $"string 'x' has same reference as string 'y' (expect true): {stringsHaveSameReference}");

// Create new XName instances (indirectly) thru XElement ctor
// using a string as the name
var xNameViaGet1 = XName.Get(localName, namespaceUri);
var xNameViaGet2 = XName.Get(localName, namespaceUri);

// Check if XName instances are the same
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, xNameViaGet2);
Console.WriteLine($"xNameViaGet1 is same reference as xNameViaGet2 (expect true): {namesHaveSameReference}");

// Create XName instances via XNamespace.Addition Operator
XNamespace ns = namespaceUri;
XName xNameViaNSAddition1 = ns + localName;
XName xNameViaNSAddition2 = ns + localName;

// Check if XName instances are the same
namesHaveSameReference = object.ReferenceEquals(xNameViaNSAddition1, xNameViaNSAddition2);
Console.WriteLine(
  $"xNameViaNSAddition1 is same reference as xNameViaNSAddition2 (expect true): {namesHaveSameReference}");

// Create XElement and XAttribute using XName instances
XElement ele = new XElement(xNameViaGet1, "value1");
XAttribute attr = new XAttribute(xNameViaGet2, "value2");

// Check if XName instances in XElement and XAttribute are the same
namesHaveSameReference = object.ReferenceEquals(ele.Name, attr.Name);
Console.WriteLine(
  $"ele.Name is same reference as attr.Name (expect true): {namesHaveSameReference}");

// Compare XName references that were created differently
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, xNameViaNSAddition1);
Console.WriteLine(
  $"xNameViaGet1 is same reference as xNameViaNSAddition1 (expect true): {namesHaveSameReference}");
namesHaveSameReference = object.ReferenceEquals(xNameViaGet1, ele.Name);
Console.WriteLine(
  $"xNameViaGet1 is same reference as ele.Name (expect true): {namesHaveSameReference}");

// Create 2 XElement instances with the same name of 'root' and same value
XElement eleViaCtor = new XElement("root", "value");
XElement eleViaParse = XElement.Parse("<root>value</root>");

// Note that the 2 XElements DO NOT have the same reference
bool xelementsHaveSameReference = object.ReferenceEquals(eleViaCtor, eleViaParse);
Console.WriteLine(
  $"eleViaCtor and eleViaParse refer to same instance: {xelementsHaveSameReference}");

// However, their respective XName properties DO have the same reference
namesHaveSameReference = object.ReferenceEquals(eleViaCtor.Name, eleViaParse.Name);
Console.WriteLine($"eleViaCtor.Name and eleViaParse.Name refer to same instance: {namesHaveSameReference}");

Atomization Benchmarks

These tests compared creating XElement and XAttribute instances by either passing a string or an XName instance to the constructors. Note that the memory allocations were identicial since they are all referring to the same XName instance.

Method                            | Mean     | Error   | StdDev  | Ratio | Gen0   | Allocated | Alloc Ratio |
----------------------------------|---------:|--------:|--------:|------:|-------:|----------:|------------:|
XNode_Construction_Passing_Strings| 355.9 ns | 3.69 ns | 3.45 ns |  1.00 | 0.0391 |     656 B |        1.00 |
XNode_Construction_Passing_XName  | 136.0 ns | 1.95 ns | 1.82 ns |  0.38 | 0.0391 |     656 B |        1.00 |

The full code for the benchmarks can be found at this gist: Linq to Xml - XName Atomization Benchmark.cs

May 18, 2012

Why you should use ReadOnlyCollection<T>

Many people believe that you should be using List<T> to return collections from a public API. However, there are several reasons not to do so; instead return a ReadOnlyCollection<T>. The are several benefits of using a ReadOnlyCollection<T>:

The *collection itself* cannot be modified - i.e. you cannot add, remove, or replace any item in the collection. So, it keeps the collection itself intact. Note that a ReadOnlyCollection<T> does not mean that the elements within the collection are "readonly"; they can still be modified. To protect the items in it, you’d probably have to have private “setters” on any properties for the reference object in the collection – which is generally recommended. However, items that are declared as [DataMember] must have both public getters and setters. So if you are in that case, you can’t do the private setter, but you can still use the ReadOnlyCollection<T>. Using this as a Design paradigm can prevent subtle bugs from popping up. One case that comes to mind is the case of a HashSet<T>. It is recommended to avoid having mutable properties on class or struct types that are going to be used in any collection that utilizes the types hash code to refer to an object of that type - e.g. HashSet<T>,
Performance: The List<T> class is essentially a wrapper around a strongly-typed array. Since arrays are immutable (cannot be re-sized), any modifications made to a List<T> must create a complete copy of the List<T> and add the new item. Note that in the case of value types, a copy of the value type is created whereas, in the case of reference types, a copy of the reference to the object is created. The performance impact for reference types is negligible. The initial capacity for a List<T> is 4 unless specified thru the overloaded constructor - see here. https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.collectionsmarshal.getvaluereforadddefault Obviously, this can be very bad for memory and performance. Don’t forget that not only the items directly within the List<T> itself are copied – but every value and object with the entire object graph for that reference type. This could easily contain other references types, collections, etc. This is why it is recommended to create/initialize a List<T> instance with the overloaded constructor which takes an int denoting the size of the List<T> to be created when possible. This can often be done since you are typically iterating on a "known" size value at runtime. For example, creating a List<T> of objects from a "Repository or DataService" method may iterate/read from a IDataReader object which has a property of RecordsAffected. If you are going to be putting an item in the List<T> based on the number of times the IDataReader is read: e.g. while(reader.Read()) you can easily create the list like so:

if (reader != null && reader.RecordsAffected > 0)
{
    // initialize the list with a pre-defined size
    List<Foo> someList = new List<Foo>(reader.RecordsAffected);
    while (reader.Read())
    {
         someList.Add(...);
         ...
    }
}

Just as a side note, every time that a List<T> needs to grow in size (its new size would exceed its current Capactity, the size is *doubled*. So, if you happen to add just one more item to a List<T> that wasn’t constructed with a pre-determined size, and the List<T> expands to accommodate the new item, there will be a whole lot of unused space at the end of the List<T> even though there aren’t any items in it – bad for memory and performance.