Sunday, March 25, 2012

Optimizing C# String Performance


Strings in C# are highly optimized but also potentially very wasteful. They give programmers a safe, fast way to handle character data. However, there are a few tricks you need to know about strings and memory if you want to write efficient code. Without this information, you could easily write code that squanders both memory, and computer clock cycles.

Sharing Memory

To understand C# strings, you need to understand the answer to one fairly simple question. Suppose you have two string variables called MyString1 and MyString2. How can you get them both to point at the same place in memory? The goal here is not just to have two strings that contain the same value, but to have two string variables that reference a single block of memory that contains a string.
It turns out that the answer to this question is very simple and intuitive. The reasons behind the answer, however, are less obvious. Understand those reasons will give you the power to write code that is fast and efficient.
This post emerged from a thread on the C# forum. As often happens, I learned something in the course of the discussion. I've attempted to repackage that information and present it here in this post. The post begins with a look at Strings and StringBuilders, but the focus quickly switches to an exploration of how the String class handles memory.

Strings vs. StringBuilders

C# Strings are immutable. This means you can't modify an existing string. If you try to change it with the concatenation operator, or with the ReplaceInsertPadLeftPadRight, or SubString methods then you end up with an entirely new string. You can't ever change an existing string. The operations you perform on a String frequently cause a new allocation of memory.
Allocations of memory are costly in terms of both memory and performance. As a result, there are times when you don't want to use the String class.
Developers who want to work with a single string and take it through an arbitrary number of changes in a loop can use the StringBuilder class. The StringBuilder class has many of the same methods as the String class. You can, however, change the contents of a StringBuilderclass without having to allocate new memory. This means that in certain situations theStringBuilder class will be much faster than the String class. In other situations, however, the opposite will be true.
What's a developer to do? The String class is highly optimized and very efficient in most cases. However, if you need to modify a string then the String class tends to be a bit wasteful of resources. How concerned should developers be about this problem? How often should they abandon the String class and use StringBuilder? The answer, as it turns out, is "not very often."
You should only use the StringBuilder class if you need to modify a single string many times in a loop, or in a relatively small section of code. To fully understand why this is the case, you need to understand just how smart the String class can be when it comes to handling memory in typical programming scenarios.

What Makes a C# String Sharp?

The big win for Strings is the tricks they perform to limit unnecessary memory allocations. Look at this code:
   1:  using System;
   2:  using System.Collections.Generic;
   3:  using System.Text;
   4:   
   5:  namespace CSharpConsoleApplication3
   6:  {
   7:      class Program
   8:      {
   9:          static void Main(string[] args)
  10:          {
  11:              String foo = "foo data";
  12:              String bar = foo;
  13:              Console.WriteLine(ReferenceEquals(foo, bar));
  14:              Console.WriteLine(foo.Equals(bar));
  15:              foo = "a";
  16:              Console.WriteLine(foo.Equals(bar));
  17:              Console.WriteLine(ReferenceEquals(foo, bar));
  18:              String goober1 = "foo";
  19:              String goober2 = "foo";
  20:              Console.WriteLine(ReferenceEquals(goober1, goober2));
  21:          }
  22:      }
  23:  }
The goal of getting two string variables to reference the same memory is achieved in lines 11 - 12. In this case, both foo and bar point at the same place in memory. To check, call theReferenceEquals method (or the == operator). In this code, the call to Reference Equalsreturns True in line 13. We can also call the Equals method (line 14) of the String class to see that the two strings are equal in that they both have the same value. That is, they both point at the eight letters that spell "foo data".
Now change the value of foo, as we do in line 15. A C/C++ programmer might then expect that both foo and bar would still reference the same memory, and hence both have the value "a". This is not the case. Lines 16 and 17 both return False. The assignment of "a" to foo broke the connection between the two variables. Intuitively, this is what we would expect. It's only our "deeper understanding" of computer languages that make us see this as odd.
The final twist in this saga is that line 20 also returns True. Here we have assigned two different strings to two different variables. Our expectation is that these two variables should not point at the same place in memory. But line 20 shows that they do reference the same block of memory.
C# maintains something called an "intern table." This is a list of strings that are currently referenced. If a new string is created with code like that shown in lines 18 and 19, then the intern table is checked. If your string is already in there, then both variables will point at the same block of memory maintained by the intern table. The string is not duplicated. Again, this is intuitively what we want, but our understanding of computers makes us think that this is not what will happen. C# tries to conform to what we would intuitively expect to happen, not to what we think a computer is likely to do.
Some of the details of the intern table are discussed in this reference to the String Internmethod.

Summary

This post explains a little bit about how C# handles memory allocations for the String class. Knowing this information is helpful if you want to write optimized code. It is also interesting information that intrigues us in part because it explains one small corner of the great wonder that is the C# language.
How important is it that one understands this information? That depends. For some people, it will be information they use every day. For others, it is just background noise. Writing safe, error free code is my most important task. Once that is accomplished, then I like to find time to work on optimization issues like those outlined here.

0 comments: