Remove Duplicates from a List
Thanks to Brian Kuhn (refer title of this post) for the following piece of code to to make a List contain unique items only.
On a side thought, would this chew up lot of cpu/memory? Feel free to post your comments.
public List removeDuplicates(List items) {
Set set = new LinkedHashSet();
set.addAll(items);
return new ArrayList(set);
}
[Resource-Type: Source Code; Category: Java/J2SE; XRating: 4]








9 Comments:
I don't think this would use up any more memory than you are already using. The items in the List are already created, and you are not duplicating them, you are just will have new references to the objects in the set. Of course this will use CPU to do the equals comparison. Look at the source for addAll() and see if you can improve upon it, but I would hope the code is already pretty efficient.
Rich C
3:36 PM
Simpler is
return new ArrayList(new LinkedHashSet(items));
As for performance, i have the feeling it wouldn't be a lot slower than any hand optimised implementation.
Perhaps you could do a benchmark and post some results.
6:24 PM
It looks to me like it would be fairly efficient. Insertion would be O(n), recreating the arraylist would be O(n), memory usage would be a bit over having another copy of the arraylist.
1:17 AM
Specifying the initial capacity might help.
Set set = new LinkedHashSet(items.size());
You're never going to have more elements than this, and Java can grab the memory all at once, rather than whenever the set's capacity is reached.
2:52 AM
At first glance it appears efficient. But a LinkedHashSet is built on a LinkedHashMap, and that creates one map entry (an object) for every item added to the map/set.
Clearly, there is some overhead in creating all these extra objects and gcing them. Whether its significant to you depends on whether you are memory/cpu constrained.
1:26 PM
Hi! Sorry, to comment it here, i could not find any other way ( Perhaps i'm lame :-) I just wanted to say that is linked you out :-) Bye.
7:41 AM
This is final for the same duplicate items (e.g. String objects etc..)
If we have item as ref type and duplication criteria in buried in the reffield.someValue, what could be the solution?
Any idea?
Thanks,
Nirav
6:38 AM
Here's a similar trick for removing duplicate items in place:
public void removeDuplicates(Collection items){
HashSet found = new HashSet();
Iterator it = items.iterator();
while (it.hasNext(){
if (!found.add(it.next()) it.remove();
}
}
This has the advantage that you can filter based on uniqueness of any property you like - just change what you put into the HashSet. It also has somewhat better space and time efficiency.
7:25 PM
If you are looking to use a minimum of processing power but have a lot of extra RAM, this would be a good algorithm to use (as opposed to just creating another list and adding each element to it after searching through the whole (second) list to make sure that it's not a duplicate). However, if you are low on RAM, this is not a good solution for the following reasons:
1. A LinkedHashSet uses all the memory of a HashSet PLUS a doubly linked list.
2. The HashSet "part" of the LinkedHashSet is backed on a LinkedHashMap, creating an unnecessary "value" mapping for each element.
3. The creation of the ArrayList at the end uses another chunk of memory (This is added to the memory usage of the set, since the set cannot be garbage-collected until the list has finished adding all of its elements (and the scope has ended)).
Considering solely the doubly-linked-list portion and the unnecessary "value" mappings, the LinkedHashSet uses 4x the amount of memory as an ArrayList. However, since it is also a HashSet, it has additional bucket references, binary tree pointers within the buckets, etc., this probably raises the memory usage to >5x that of an ArrayList. The "peak" memory usage would be at the very end, when there is 1. the original collection, 2. the LinkedHashSet, 3. the ArrayList used to hold the unique elements, for a total of >7x the memory of the original collection by itself. So, memory-wise, it's not that efficient and could lead to problems for large amounts of data, but processing-time-wise, at least according to its Big-Oh notation, it is pretty efficient.
Maciver's solution removes the doubly-linked-list part of the memory usage problem. Although for the his method to operate under the same contract as the first method presented that creates a new list holding all the unique elements, a duplicate list would need to be created, using up a bit more memory, this does remove an amount of element references equal to the size of the original collection that were used as references in the linked-list portion of the LinkedHashSet in the original method.
3:50 AM
Post a Comment
<< Home