Folks, I’m ticked. I’ve recently lost several hours to correcting some third-party Java code that did a terrible job of managing object allocation, and therefore used a lot of memory in a very short period of time. Apparently there are some programmers out there who aren’t aware that memory allocation remains a concern even in a garbage-collecting runtime like the JVM. While Java may not require you to think about every byte allocated and deallocated, sparing a couple of moments to think about memory can yield tremendous performance benefits.
In this case, string handling caused my woes. One of the Java community’s truisms runs like this:
Using the “+” operator to concatenate strings is inefficient. Use StringBuffer and .append() your strings instead.
Great! So instead of
String message = "The value " + someObject.getValue() + " is out of range";
We get
String message = new StringBuffer("The value ").append(someObject.getValue()) .append("is out of range").toString();
Some programmers would feel pretty good about that, but hardly anybody knows why. In JDK 1.4 and prior, The Javadoc for java.lang.StringBuffer
contained this helpful little note:
String buffers are used by the compiler to implement the binary string concatenation operator +. For example, the code:
x = “a” + 4 + “c”is compiled to the equivalent of:
x = new StringBuffer().append(“a”).append(4).append(“c”).toString()
For every statement that includes string concatenation, the runtime allocates a new StringBuffer
, calls the appropriate number of append()
s, and then calls the toString()
method! The JDK 1.5 Javadoc for StringBuffer
no longer contains this insight; it must have been lost in the StringBuffer
/StringBuilder
reorganization. If any Sun Javadoctors are reading, put it back. Please.
The default constructor gives the StringBuffer
an initial length of 16 characters. If a String
argument is supplied, the starting length is the length of the argument plus 16 characters. That means the string-addition version will need to allocate a new character array a bit sooner than the explicit call to StringBuffer
, but otherwise they’re the same. Also, as the guys over at 0xCAFEBABE point out, the runtime is not smart enough to reuse StringBuffer
instances, so if the string concatenation happens in a loop, you’ll throw off all kinds of StringBuffer
s for garbage collection. Not to mention the fact that you probably don’t need a String
until after you exit the loop, so calling toString()
at every iteration is a total waste.
That describes the situation I ran into. I was using a flat-file JDBC driver to reload a handful of reference tables from a set of text files, each of which contained hundreds of thousands of records. The task took far longer than I expected, and when I turned the memory profiler on I saw that the JVM memory allocation had grown by hundreds of megabytes, which was a couple orders of magnitude larger than the files themselves.
After a little quality time with a memory profiler I found this pattern in the String[] parseLine(String line)
method:
String value = ""; Vector values = new Vector(); while (currentPos < line.length()) { char currentChar = line.charAt(currentPos); if ([inside a quoted string]) { value += currentChar; currentPos++; } else if ([done with quoted string]) { values.add(value); value=""; } [...] }
...which manages to violate all of the rules of good StringBuffer
usage in just a few lines of code.
I changed that to
StringBuffer value = new StringBuffer(line.length()); Vector values = new Vector(); while (currentPos < line.length()) { char currentChar = line.charAt(currentPos); if ([inside a quoted string]) { value.append(currentChar); currentPos++; } else if ([done with quoted string]) { values.add(value.toString()); value.replace(0, value.length(), ""); } [...] }
The changed code allocates a single StringBuffer
to hold the work in progress and reuses that object instead of allocating a new one for each line processed. Making the StringBuffer
's initial size equal to the line length gives us more room than we will need, but is still far more efficient than creating a new buffer for each field in each line. This technique dramatically reduced the memory required to reload the reference values and cut the run time by an order of magnitude by avoiding pauses for garbage collection on the somewhat memory-limited host.
This little exercise demonstrates that no matter how advanced the Java Enterprise APIs may get, focusing on the basics can reap big rewards. So please, all you Java programmers out there, take a minute to examine your String usage and save yourself some trouble down the road.
Interesting. How does this make you feel about the standard injunction against early optimization? Which costs more, your one episode of quality time with the profiler, or a hypothetical global effort to write all string concatenations in the most efficient way?
Ah, just noticed the key phrase “third-party Java code.” Man that sucks.
Yeah, it was a bummer, but at least I had the source. Using the “right” string manipulation technique is essentially free (it’s not really much more difficult than the lazy way) so it’s mostly a matter of education.
I think in general working with a language like Java that abstracts some of the memory management concerns is a productivity enhancer, with the caveat that the programmer still needs to understand what’s going on inside the machine. I’m not as much of a curmudgeon as Joel (on Software) Spolsky, but situations like this reveal the gap between “programmer” and “developer.”