Saturday 21 October 2017

String Deduplication feature of Java 8

itemprop="text">


Since
String in Java (like other languages) consumes a lot of memory
because each character consumes two bytes, Java 8 has introduced a new feature called
String Deduplication which takes
advantage of the fact that the char arrays are internal to strings and final, so the JVM
can mess around with them.



I have read href="https://blog.codecentric.de/en/2014/08/string-deduplication-new-feature-java-8-update-20-2/"
rel="noreferrer">this example so far but since I am not a pro java coder, I
am having a hard time grasping the concept.



Here
is what it says,





Various strategies for String Duplication have been considered, but

the one implemented now follows the following approach: Whenever
the

garbage collector visits String objects it takes note of the
char
arrays. It takes their hash value and stores it alongside with a
weak
reference to the array. As soon as it finds another String which
has
the same hash code it compares them char by char. If they match
as
well, one String will be modified and point to the char array of
the
second String. The first char array then is no longer
referenced
anymore and can be garbage collected.




This whole process of course brings some overhead, but is
controlled
by tight limits. For example if a string is not found to
have

duplicates for a while it will be no longer
checked.




My
First question,



There is still a
lack of resources on this topic since it is recently added in Java 8 update 20, could
anyone here share some practical examples on how it help in reducing the memory consumed
by String in Java
?



Edit:



The
above link says,






As soon as it finds another String which has the same hash code it

compares them char by
char




My
2nd question,



If hash code of two
String are same then the Strings are
already the same, then why compare them char by
char once it is found that the two
String have same hash code ?



Answer





Imagine you have a phone book,
which contains people, which have a String firstName and a
String lastName. And it happens that in your phone book,
100,000 people have the same firstName =
"John"
.



Because you get the data
from a database or a file those strings are not interned so your JVM memory contains the
char array {'J', 'o', 'h', 'n'} 100 thousand times, one per
John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns
take up 2 MB of memory.



With deduplication, the
JVM will realise that "John" is duplicated many times and make all those John strings
point to the same underlying char array, decreasing the memory usage from 2MB to 20
bytes.



You can find a more detailed explanation
in the JEP. In
particular:




Many
large-scale Java applications are currently bottlenecked on memory. Measurements have
shown that roughly 25% of the Java heap live data set in these types of applications is
consumed by String objects. Further, roughly half of those String objects are
duplicates, where duplicates means string1.equals(string2) is
true. Having duplicate String objects on the heap is, essentially, just a waste of
memory.




[...]




The actual expected benefit ends up at around 10% heap reduction.
Note that this number is a calculated average based on a wide range of applications. The
heap reduction for a specific application could vary significantly both up and
down.



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...