You are here: LinuxReference Web>OpenJDKPage (14 Aug 2009)

OpenJDK Optimisation Project

Author: Edward Nevill ([email protected])

This page describes the work I have done to optimse OpenJDK over the past six months. This work has been sponsered by ARM and I would like to thank ARM for their support. Thanks also go to the open source community, especially Matthias Klose and Andrew Hailey who have helped me with this work.

Contents

Project Goals

The project goals were to develop an optimised assembler port of OpenJDK? an acceptable level of performance for running Netbook class applications on Netbook class ARM platforms. The reference platform for the development work was the Babbage board. The reference application chosen was ThinkFree? Office, a pure Java office suite (see http://www.thinkfree.com). The initial goals were to deliver between 2.5X and 5.0X performance improvement.

Babbage Board Specification
  • Single Cortex-A8 processor clocked at 800MHZ
  • 32K I-Cache, 32K D-Cache
  • 256K L2 cache
  • 310nS memory system

The initial development work was done on 'Jaunty' (Ubuntu 9.04). There were a number of reasons for choosing to do the work on Ubuntu rather than directly on IcedTea?. Firstly, I had severe difficulty trying to get IcedTea? to build on ARM. Most of these problems were due to configuration issues (for example, the correct version of gcc, the corect version of fastjar etc). Using 'Jaunty' solved these problems as I could build a single Debian package which used all the correct versions. Secondly, ARM wished to provide a complete reference port of Linux. The development work for 'Jaunty' has now been migrated to 'Karmic' (Ubuntu 9.10) and work is ongoing to port it into IcedTea?.

Initial Benchmarking

Initial benchmarking showed the performance of Zero to be poor. The following table shows some benchmarking results for ThinkFree? Office (TFO), Embedded Caffeine Mark (ECM) and EEMBC.

In the following table Startup is the time table from invoking TFO to get to the main screen where you are offered a choice of lauching Write, Calc or Show (hereinafter referred to as Word, XCEL and PPT). Open Word is the time from clicking on the Write (or Word) icon until it opens a blank document and is ready to accept input, Word X 2 is the time taken to startup Word a second time having already opened a blank word document and closed it. The purpose of this was to examine the effect of any startup delays such as static initialization and class loading / verification. Similarly for Open XCEL, XCEL X 2, Open PPT and PPT X 2.

ECM gives the Embedded Caffeine Mark score (higher is better) for each individual test along with an 'Overall' result which is the geometric mean of the individual results. EEMBC gives the time in milliseconds for each of the individual tests along with the total in milliseconds.

TFOStartupOpen WordWord X 2Open XCELXCEL X 2Open PPTPPT X 2
23804985209126
ECMSieveLoopLogicStringFloatMethodOverall
436359557343379189359
EEMBCChessCryptkXMLParallelPNGRETotal
371274058635154257585383553882246342

Plainly the time taken to open a blank word document (1 Min 20 sec), a blank spreadsheet (1 Min 25 sec) and a blank Powerpoint (1 Min 31 sec) was completely unacceptable.

Optimisation

An initial optimisation phase was carried out which optimised the performance of the simpler bytecodes (load, store, data operations, array operations, branches) but did not perform an optimisation on the more complex bytecodes (invoke, return, get/put static). This was released as part of Ubuntu 9.04 (Jaunty).

This delivered approximately 2 X performance improvement on benchmarks such as Embedded CaffeineMark? and EEMBC but offered virtually no performance improvement on an of the ThinkFree? Office benchmarks.

The reason for this is that ThinkFree? Office is written in a highly object orientated fasion and spends most of its time making method calls.

A second phase of optimisation was carried out to optimise method calls. This involved streamlining the code to remove mutually recursive calls within the C interpreter when making method calls. In addition 'fast' paths were created when calling a Java or native mathod from Java where all recursion was removed. The effect of this optimisation work can be seen in the Embedded CaffeineMark? method score which jumped from 189 to 1339 (over a factor of 7 improvement).

This had a significant effect on ThinkFree? Office reducing the time taken to open a blank word document from 1 min. 20 sec to 46 sec. However this level of performance was still unacceptable so I developed a Bytecode Interpretor Generator to deliver even further levels of performance improvement. The basic idea of the BIG is that it would optimise sequences of bytecodes rather than executing a single bytecode at a time.

The combination of these optimisations reduced the time taken to open a word document to 31 seconds the first time a word document is opened and 16 seconds on subsequent openings. This compares favourably with 26 seconds taken by the Sun J2SE? JIT to open a blank word document the first time and 8 seconds on subsequnt reopening.

Bytecode Interpreter Generator

The Bytecode Interpreter Generator is a tool which generates a bytecode interpreter from a template file. The template file is a description of the bytecodes along with their associated implementations.

The Bytecode Interpreter Generator enables improved levels of Java performance by optimising frequently executed sequences of Java bytecodes. In effect it 'peepholes' sequences of frequently executed bytecodes. In the current implementation of the optimised ARM assembler this is restricted to a sequence of 4 bytecodes although there is no inherent limitation on the bytecode interpreter.

The Bytecode Interpreter Generator is not specific to Java. It could equally be adapted to other bytecode interpreters such as Dalvic. Also, the Bytecode Interpreter Generator is not specific to ARM. It can easily be asapted to generate code for other processors such as X86, PPC etc.

The Bytecode Interpreter Generator is being released under GPLv2 and is available as part of the source release of the optimised ARM assembler. See the Binary and Source Relase for download details.

Further technical details on the Bytecode Interpreter Generator are available from http://camswl.com/openjdk/big.html

Further Developments

The current round of optimisation on OpenJDK? has reached a natural stopping point where there is little to be gained immediately and any further optimisation work will require large scale changes. I have therefore concentrated the past few weeks on stablising the source code for release. However there are a number of optimisations and developments I would like to pursue in the future.

  • Currently the optimised interpreter does not support the use of a JIT. If a JIT is active (UseCompiler? flag is set) it will simply fall back to the C interpreter. There could be significant benefits from combining the optimised interpreter technology with a JIT.

    Currently, JITs tend to be over aggressive in deciding what methods to compile. This is because they are build on top of a slow interpreter. This leads to slow startup time (26 seconds to open a blank word document the first time vs 8 seconds to open it the second time).

    With a really fast interpreter to underpin the JIT there is no need to the JIT to be so aggressive and it can concentrate on compiling what are real hotspots in the code rather than compiling almost everything.

  • The interpreter is a single state interpreter (it operates entirely in the vtos, void top of stack, state). I would like to move this to a five state interpreter which caches the top two stack elements in registers (assuming 32 bit container sized elements). The BIG template definition format would be extended to allow definition of multiple states and to automatically generate state transition code. Based on past experience with multi state Java interpreters this could generate up to 30% further performance improvement.
  • The BIG current generates static code suitable to be linked with the C++ Interpreter. A possible future development would be to develop the BIG template format so it can be recognised by the Template interpreter. The template interpreter could then read the definition file and use it to generate an optimised interpreter for any architecture.

Final Benchmarking Results

The following tables show respective the performance improvement for Think Free Office, Embedded Caffeine Mark and EEMBC. The first column in each case give the initial performance of the pure C code. The second column the performance of the optimised assembler code. The third column the relative performance improvement of the optimised assembler versus the C code. The fourth column the performance of the Sun J2SE? JIT. The fifth column is the relative performance of the JIT compared with the optimised assembler.

Think Free Office

TFOZero/CAsmAsm / CJ2SE JITJIT / ASM
Startup23141.6 X102.3 X
Open Word80312.6 X261.2 X
Open Word X 249163.1 X82.0 X
Open XCEL85302.8 X231.3 X
Open XCEL X 22072.9 X41.8 X
Open PPT91312.9 X251.2 X
Open PPT X 22683.3 X51.6 X

Embedded CaffeineMark?

ECMZero/CAsmAsm / CJ2SE JITJIT / ASM
Sieve43612692.9 X56644.5 X
Loop35913953.9 X2306416.5 X
Logic55723504.2 X218159.3 X
String3437802.3 X16242.1 X
Float37911813.1 X79476.7 X
Method18913397.1 X92466.9 X
Overall35913133.7 X83496.4 X

EEMBC

EEMBCZero/CAsmAsm / CJ2SE JITJIT / ASM
Chess3712799173.7 X17015.8 X
Crypt4058689724.5 X11877.6 X
kXML3515475964.6 X16394.6 X
Parallel2575885323.0 X9219.3 X
PNG5383569467.8 X8728.0 X
RE5388289556.0 X10348.7 X
Total246342509184.8 X73546.9 X

Conclusions

The optimised assembler interpreter offers a preferable Java solution for netbook class solutions. It delivers 70-80% of the performance of a JIT without the additional costs associated with a JIT solution such as extra memory to cache the JIT code. Because the core of the optimised interpreter is relatively small at 62K it operates well on constrained devices such as netbooks because the working set of the interpreter fits entirely within the CPU cache.

In addition the optimised interpreter has been released freely under GPLv2. Binaries and sources may be download from the links on this page. The source is being included as part of Ubunu 9.10 (Karmic) and work is ongoing to contribute the work to Iced-Tea and at a future date to the OpenJDK? trunk.

Glossary

OpenJDK
OpenJDK is an open source release of the Java Development Kit. This has been released into the open source community by Sun. See for more information.
IcedTea
IcedTea is a development on top of OpenJDK?. The main developments are to remove some existing emcumbrances, to develop a pure C/C++ release which could be built on any architecture, and to ensure OpenJDK? passes the JCK. For more information visit http://www.iced-tea.org
Zero
The term Zero is used to refer to the VM in IcedTea?. It is called Zero because it has 'Zero' (well, almost) assembler
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback
zenweb1 : 0.13 secs More Info