Sunday, May 29, 2011

C#: IEnumerable get Duplicates (Select Non-Distinct with LINQ)

Saw this page: Raucous Random Ramblings: Select Non-Distinct with Linq!

[Updated 2012-06-12]

Not too shabby.. FWIW: the GetDuplicates() from the raucous random ramblings post returns all items it considers are duplicates.

This is different than my version, which only returns items after they are found in the list. I consider my version better because: if I existed and someone made a clone of me (i.e. 'duplicated'), then I would not consider myself contained in the set of the duplicates. :) But I can understand someone using both functions for different purposes.)

Here is my version for getting duplicates from an enumerable list.
Better? You try both and then let me know what your results are..


/// <summary>
///   Returns duplicate items found in the <see cref = "sequence" />.
/// </summary>
public static IEnumerable<T> Duplicates<T>( this IEnumerable<T> sequence ) {
if ( null == sequence ) {
throw new ArgumentNullException( "sequence" );
}
var set = new HashSet<T>();
return sequence.Where( item => !set.Add( item: item ) );
}

In the testing I slimmmed the two foreach down to "return results.SelectMany( @group => @group )" per Resharper's suggestion.

Examples for the set of
    {1,2,3,4,5,6,7,8,9,6,4,2,2,4,6}
  • Duplicates() returns {6,4,2,4,4,6}
  • GetDuplicates() returns {2,2,2,4,4,4,6,6,6}
Speed tests!
I threw together a function timer with a massive UInt64[1048576] to check for duplicates.
Letting the tests run for a few minutes (over the same set of data)...
  • Duplicates() processes at 3338 units every millisecond.
  • GetDuplicates() processes at 1060 units every millisecond.
Remember: the test results are going to vary under different conditions, of course.
Ignoring the duplicate duplicates that both functions return with a .Distinct() did not make a noticeable timing impact.
Increasing the count of duplicates in the test data increased the speed of both function.
I do like the result set my function returns, and it does perform 3X faster than the raucous post..

1 comment :

waterfoul said...

I believe this results in a worst case of O(n^2) and a best case of O(nlog(n)) which is worse than the previous post which was O(2n) (or O(n) as constants are irrelevant) which is much faster