Once again we consider some of the lesser known classes and keywords of C#. Today we will be looking at two set implementations in the System.Collections.Generic namespace: HashSet<T> and SortedSet<T>. Even though most people think of sets as mathematical constructs, they are actually very useful classes that can be used to help make your application more performant if used appropriately.
For more of the "Little Wonders" posts, see the index here.
A Background From Math
In mathematical terms, a set is an unordered collection of unique items. In other words, the set {2,3,5} is identical to the set {3,5,2}. In addition, the set {2, 2, 4, 1} would be invalid because it would have a duplicate item (2). In addition, you can perform set arithmetic on sets such as:
Intersections:
The intersection of two sets is the collection of elements common to both.
Example: The intersection of {1,2,5} and {2,4,9} is the set {2}.
Unions:
The union of two sets is the collection of unique items present in either or both set.
Example: The union of {1,2,5} and {2,4,9} is {1,2,4,5,9}.
Differences:
The difference of two sets is the removal of all items from the first set that are common between the sets.
Example: The difference of {1,2,5} and {2,4,9} is {1,5}.
Supersets:
One set is a superset of a second set if it contains all elements that are in the second set.
Example: The set {1,2,5} is a superset of {1,5}.
Subsets:
One set is a subset of a second set if all the elements of that set are contained in the first set.
Example: The set {1,5} is a subset of {1,2,5}.
If We’re Not Doing Math, Why Do We Care?
Now, you may be thinking: why bother with the set classes in C# if you have no need for mathematical set manipulation? The answer is simple: they are extremely efficient ways to determine ownership in a collection.
For example, let’s say you are designing an order system that tracks the price of a particular equity, and once it reaches a certain point will trigger an order. Now, since there’s tens of thousands of equities on the markets, you don’t want to track market data for every ticker as that would be a waste of time and processing power for symbols you don’t have orders for. Thus, we just want to subscribe to the stock symbol for an equity order only if it is a symbol we are not already subscribed to.
Every time a new order comes in, we will check the list of subscriptions to see if the new order’s stock symbol is in that list. If it is, great, we already have that market data feed! If not, then and only then should we subscribe to the feed for that symbol.
So far so good, we have a collection of symbols and we want to see if a symbol is present in that collection and if not, add it. This really is the essence of set processing, but for the sake of comparison, let’s say you do a list instead:
// class that handles are order processing service
public sealed class OrderProcessor
{
// contains list of all symbols we are currently subscribed to
private readonly List<string> _subscriptions = new List<string>();
...
}
Now whenever you are adding a new order, it would look something like:
public PlaceOrderResponse PlaceOrder(Order newOrder)
{
// do some validation, of course...
// check to see if already subscribed, if not add a subscription
if (!_subscriptions.Contains(newOrder.Symbol))
{
// add the symbol to the list
_subscriptions.Add(newOrder.Symbol);
// do whatever magic is needed to start a subscription for the symbol
}
// place the order logic!
}
What’s wrong with this? In short: performance! Finding an item inside a List<T> is a linear - O(n) – operation, which is not a very performant way to find if an item exists in a collection.
(I used to teach algorithms and data structures in my spare time at a local university, and when you began talking about big-O notation you could immediately begin to see eyes glossing over as if it was pure, useless theory that would not apply in the real world, but I did and still do believe it is something worth understanding well to make the best choices in computer science).
Let’s think about this: a linear operation means that as the number of items increases, the time that it takes to perform the operation tends to increase in a linear fashion. Put crudely, this means if you double the collection size, you might expect the operation to take something like the order of twice as long.
Linear operations tend to be bad for performance because they mean that to perform some operation on a collection, you must potentially “visit” every item in the collection. Consider finding an item in a List<T>: if you want to see if the list has an item, you must potentially check every item in the list before you find it or determine it’s not found.
Now, we could of course sort our list and then perform a binary search on it, but sorting is typically a linear-logarithmic complexity – O(n * log n) - and could involve temporary storage. So performing a sort after each add would probably add more time.
As an alternative, we could use a SortedList<TKey, TValue> which sorts the list on every Add(), but this has a similar level of complexity to move the items and also requires a key and value, and in our case the key is the value.
This is why sets tend to be the best choice for this type of processing: they don’t rely on separate keys and values for ordering – so they save space – and they typically don’t care about ordering – so they tend to be extremely performant.
The .NET BCL (Base Class Library) has had the HashSet<T> since .NET 3.5, but at that time it did not implement the ISet<T> interface. As of .NET 4.0, HashSet<T> implements ISet<T> and a new set, the SortedSet<T> was added that gives you a set with ordering.
Read more: James Michael Hare
For more of the "Little Wonders" posts, see the index here.
A Background From Math
In mathematical terms, a set is an unordered collection of unique items. In other words, the set {2,3,5} is identical to the set {3,5,2}. In addition, the set {2, 2, 4, 1} would be invalid because it would have a duplicate item (2). In addition, you can perform set arithmetic on sets such as:
Intersections:
The intersection of two sets is the collection of elements common to both.
Example: The intersection of {1,2,5} and {2,4,9} is the set {2}.
Unions:
The union of two sets is the collection of unique items present in either or both set.
Example: The union of {1,2,5} and {2,4,9} is {1,2,4,5,9}.
Differences:
The difference of two sets is the removal of all items from the first set that are common between the sets.
Example: The difference of {1,2,5} and {2,4,9} is {1,5}.
Supersets:
One set is a superset of a second set if it contains all elements that are in the second set.
Example: The set {1,2,5} is a superset of {1,5}.
Subsets:
One set is a subset of a second set if all the elements of that set are contained in the first set.
Example: The set {1,5} is a subset of {1,2,5}.
If We’re Not Doing Math, Why Do We Care?
Now, you may be thinking: why bother with the set classes in C# if you have no need for mathematical set manipulation? The answer is simple: they are extremely efficient ways to determine ownership in a collection.
For example, let’s say you are designing an order system that tracks the price of a particular equity, and once it reaches a certain point will trigger an order. Now, since there’s tens of thousands of equities on the markets, you don’t want to track market data for every ticker as that would be a waste of time and processing power for symbols you don’t have orders for. Thus, we just want to subscribe to the stock symbol for an equity order only if it is a symbol we are not already subscribed to.
Every time a new order comes in, we will check the list of subscriptions to see if the new order’s stock symbol is in that list. If it is, great, we already have that market data feed! If not, then and only then should we subscribe to the feed for that symbol.
So far so good, we have a collection of symbols and we want to see if a symbol is present in that collection and if not, add it. This really is the essence of set processing, but for the sake of comparison, let’s say you do a list instead:
// class that handles are order processing service
public sealed class OrderProcessor
{
// contains list of all symbols we are currently subscribed to
private readonly List<string> _subscriptions = new List<string>();
...
}
Now whenever you are adding a new order, it would look something like:
public PlaceOrderResponse PlaceOrder(Order newOrder)
{
// do some validation, of course...
// check to see if already subscribed, if not add a subscription
if (!_subscriptions.Contains(newOrder.Symbol))
{
// add the symbol to the list
_subscriptions.Add(newOrder.Symbol);
// do whatever magic is needed to start a subscription for the symbol
}
// place the order logic!
}
What’s wrong with this? In short: performance! Finding an item inside a List<T> is a linear - O(n) – operation, which is not a very performant way to find if an item exists in a collection.
(I used to teach algorithms and data structures in my spare time at a local university, and when you began talking about big-O notation you could immediately begin to see eyes glossing over as if it was pure, useless theory that would not apply in the real world, but I did and still do believe it is something worth understanding well to make the best choices in computer science).
Let’s think about this: a linear operation means that as the number of items increases, the time that it takes to perform the operation tends to increase in a linear fashion. Put crudely, this means if you double the collection size, you might expect the operation to take something like the order of twice as long.
Linear operations tend to be bad for performance because they mean that to perform some operation on a collection, you must potentially “visit” every item in the collection. Consider finding an item in a List<T>: if you want to see if the list has an item, you must potentially check every item in the list before you find it or determine it’s not found.
Now, we could of course sort our list and then perform a binary search on it, but sorting is typically a linear-logarithmic complexity – O(n * log n) - and could involve temporary storage. So performing a sort after each add would probably add more time.
As an alternative, we could use a SortedList<TKey, TValue> which sorts the list on every Add(), but this has a similar level of complexity to move the items and also requires a key and value, and in our case the key is the value.
This is why sets tend to be the best choice for this type of processing: they don’t rely on separate keys and values for ordering – so they save space – and they typically don’t care about ordering – so they tend to be extremely performant.
The .NET BCL (Base Class Library) has had the HashSet<T> since .NET 3.5, but at that time it did not implement the ISet<T> interface. As of .NET 4.0, HashSet<T> implements ISet<T> and a new set, the SortedSet<T> was added that gives you a set with ordering.
Read more: James Michael Hare