OpenJDKPage < LinuxReference

You are visiting this site as: WikiGuest
Users

You are here: LinuxReference Web>OpenJDKPage (14 Aug 2009)

OpenJDK Optimisation Project

Author: Edward Nevill (ed@camswl.com)

This page describes the work I have done to optimse OpenJDK over the past six months. This work has been sponsered by ARM and I would like to thank ARM for their support. Thanks also go to the open source community, especially Matthias Klose and Andrew Hailey who have helped me with this work.

Project Goals
Initial benchmarking
Optimisation
Bytecode Interpreter Generator
Further Developments
Final benchmarking
Conclusions
Glossary

Project Goals

The project goals were to develop an optimised assembler port of OpenJDK? an acceptable level of performance for running Netbook class applications on Netbook class ARM platforms. The reference platform for the development work was the Babbage board. The reference application chosen was ThinkFree? Office, a pure Java office suite (see http://www.thinkfree.com). The initial goals were to deliver between 2.5X and 5.0X performance improvement.

Babbage Board Specification

Single Cortex-A8 processor clocked at 800MHZ
32K I-Cache, 32K D-Cache
256K L2 cache
310nS memory system

The initial development work was done on 'Jaunty' (Ubuntu 9.04). There were a number of reasons for choosing to do the work on Ubuntu rather than directly on IcedTea?. Firstly, I had severe difficulty trying to get IcedTea? to build on ARM. Most of these problems were due to configuration issues (for example, the correct version of gcc, the corect version of fastjar etc). Using 'Jaunty' solved these problems as I could build a single Debian package which used all the correct versions. Secondly, ARM wished to provide a complete reference port of Linux. The development work for 'Jaunty' has now been migrated to 'Karmic' (Ubuntu 9.10) and work is ongoing to port it into IcedTea?.

Initial Benchmarking

Initial benchmarking showed the performance of Zero to be poor. The following table shows some benchmarking results for ThinkFree? Office (TFO), Embedded Caffeine Mark (ECM) and EEMBC.

In the following table Startup is the time table from invoking TFO to get to the main screen where you are offered a choice of lauching Write, Calc or Show (hereinafter referred to as Word, XCEL and PPT). Open Word is the time from clicking on the Write (or Word) icon until it opens a blank document and is ready to accept input, Word X 2 is the time taken to startup Word a second time having already opened a blank word document and closed it. The purpose of this was to examine the effect of any startup delays such as static initialization and class loading / verification. Similarly for Open XCEL, XCEL X 2, Open PPT and PPT X 2.

ECM gives the Embedded Caffeine Mark score (higher is better) for each individual test along with an 'Overall' result which is the geometric mean of the individual results. EEMBC gives the time in milliseconds for each of the individual tests along with the total in milliseconds.

TFO	Startup	Open Word	Word X 2	Open XCEL	XCEL X 2	Open PPT	PPT X 2
	23	80	49	85	20	91	26
ECM	Sieve	Loop	Logic	String	Float	Method	Overall
	436	359	557	343	379	189	359
EEMBC	Chess	Crypt	kXML	Parallel	PNG	RE	Total
	37127	40586	35154	25758	53835	53882	246342

Plainly the time taken to open a blank word document (1 Min 20 sec), a blank spreadsheet (1 Min 25 sec) and a blank Powerpoint (1 Min 31 sec) was completely unacceptable.

Optimisation

An initial optimisation phase was carried out which optimised the performance of the simpler bytecodes (load, store, data operations, array operations, branches) but did not perform an optimisation on the more complex bytecodes (invoke, return, get/put static). This was released as part of Ubuntu 9.04 (Jaunty).

This delivered approximately 2 X performance improvement on benchmarks such as Embedded CaffeineMark? and EEMBC but offered virtually no performance improvement on an of the ThinkFree? Office benchmarks.

The reason for this is that ThinkFree? Office is written in a highly object orientated fasion and spends most of its time making method calls.

A second phase of optimisation was carried out to optimise method calls. This involved streamlining the code to remove mutually recursive calls within the C interpreter when making method calls. In addition 'fast' paths were created when calling a Java or native mathod from Java where all recursion was removed. The effect of this optimisation work can be seen in the Embedded CaffeineMark? method score which jumped from 189 to 1339 (over a factor of 7 improvement).

This had a significant effect on ThinkFree? Office reducing the time taken to open a blank word document from 1 min. 20 sec to 46 sec. However this level of performance was still unacceptable so I developed a Bytecode Interpretor Generator to deliver even further levels of performance improvement. The basic idea of the BIG is that it would optimise sequences of bytecodes rather than executing a single bytecode at a time.

The combination of these optimisations reduced the time taken to open a word document to 31 seconds the first time a word document is opened and 16 seconds on subsequent openings. This compares favourably with 26 seconds taken by the Sun J2SE? JIT to open a blank word document the first time and 8 seconds on subsequnt reopening.

Bytecode Interpreter Generator

The Bytecode Interpreter Generator is a tool which generates a bytecode interpreter from a template file. The template file is a description of the bytecodes along with their associated implementations.

The Bytecode Interpreter Generator enables improved levels of Java performance by optimising frequently executed sequences of Java bytecodes. In effect it 'peepholes' sequences of frequently executed bytecodes. In the current implementation of the optimised ARM assembler this is restricted to a sequence of 4 bytecodes although there is no inherent limitation on the bytecode interpreter.

The Bytecode Interpreter Generator is not specific to Java. It could equally be adapted to other bytecode interpreters such as Dalvic. Also, the Bytecode Interpreter Generator is not specific to ARM. It can easily be asapted to generate code for other processors such as X86, PPC etc.

The Bytecode Interpreter Generator is being released under GPLv2 and is available as part of the source release of the optimised ARM assembler. See the Binary and Source Relase for download details.

Further technical details on the Bytecode Interpreter Generator are available from http://camswl.com/openjdk/big.html

Further Developments

The current round of optimisation on OpenJDK? has reached a natural stopping point where there is little to be gained immediately and any further optimisation work will require large scale changes. I have therefore concentrated the past few weeks on stablising the source code for release. However there are a number of optimisations and developments I would like to pursue in the future.

Currently the optimised interpreter does not support the use of a JIT. If a JIT is active (UseCompiler? flag is set) it will simply fall back to the C interpreter. There could be significant benefits from combining the optimised interpreter technology with a JIT.
Currently, JITs tend to be over aggressive in deciding what methods to compile. This is because they are build on top of a slow interpreter. This leads to slow startup time (26 seconds to open a blank word document the first time vs 8 seconds to open it the second time).
With a really fast interpreter to underpin the JIT there is no need to the JIT to be so aggressive and it can concentrate on compiling what are real hotspots in the code rather than compiling almost everything.
The interpreter is a single state interpreter (it operates entirely in the vtos, void top of stack, state). I would like to move this to a five state interpreter which caches the top two stack elements in registers (assuming 32 bit container sized elements). The BIG template definition format would be extended to allow definition of multiple states and to automatically generate state transition code. Based on past experience with multi state Java interpreters this could generate up to 30% further performance improvement.
The BIG current generates static code suitable to be linked with the C++ Interpreter. A possible future development would be to develop the BIG template format so it can be recognised by the Template interpreter. The template interpreter could then read the definition file and use it to generate an optimised interpreter for any architecture.

Final Benchmarking Results

The following tables show respective the performance improvement for Think Free Office, Embedded Caffeine Mark and EEMBC. The first column in each case give the initial performance of the pure C code. The second column the performance of the optimised assembler code. The third column the relative performance improvement of the optimised assembler versus the C code. The fourth column the performance of the Sun J2SE? JIT. The fifth column is the relative performance of the JIT compared with the optimised assembler.

Think Free Office

TFO	Zero/C	Asm	Asm / C	J2SE JIT	JIT / ASM
Startup	23	14	1.6 X	10	2.3 X
Open Word	80	31	2.6 X	26	1.2 X
Open Word X 2	49	16	3.1 X	8	2.0 X
Open XCEL	85	30	2.8 X	23	1.3 X
Open XCEL X 2	20	7	2.9 X	4	1.8 X
Open PPT	91	31	2.9 X	25	1.2 X
Open PPT X 2	26	8	3.3 X	5	1.6 X

Embedded CaffeineMark?

ECM	Zero/C	Asm	Asm / C	J2SE JIT	JIT / ASM
Sieve	436	1269	2.9 X	5664	4.5 X
Loop	359	1395	3.9 X	23064	16.5 X
Logic	557	2350	4.2 X	21815	9.3 X
String	343	780	2.3 X	1624	2.1 X
Float	379	1181	3.1 X	7947	6.7 X
Method	189	1339	7.1 X	9246	6.9 X
Overall	359	1313	3.7 X	8349	6.4 X

EEMBC

EEMBC	Zero/C	Asm	Asm / C	J2SE JIT	JIT / ASM
Chess	37127	9917	3.7 X	1701	5.8 X
Crypt	40586	8972	4.5 X	1187	7.6 X
kXML	35154	7596	4.6 X	1639	4.6 X
Parallel	25758	8532	3.0 X	921	9.3 X
PNG	53835	6946	7.8 X	872	8.0 X
RE	53882	8955	6.0 X	1034	8.7 X
Total	246342	50918	4.8 X	7354	6.9 X

Conclusions

The optimised assembler interpreter offers a preferable Java solution for netbook class solutions. It delivers 70-80% of the performance of a JIT without the additional costs associated with a JIT solution such as extra memory to cache the JIT code. Because the core of the optimised interpreter is relatively small at 62K it operates well on constrained devices such as netbooks because the working set of the interpreter fits entirely within the CPU cache.

In addition the optimised interpreter has been released freely under GPLv2. Binaries and sources may be download from the links on this page. The source is being included as part of Ubunu 9.10 (Karmic) and work is ongoing to contribute the work to Iced-Tea and at a future date to the OpenJDK? trunk.

Glossary

OpenJDK: OpenJDK is an open source release of the Java Development Kit. This has been released into the open source community by Sun. See for more information.
IcedTea: IcedTea is a development on top of OpenJDK?. The main developments are to remove some existing emcumbrances, to develop a pure C/C++ release which could be built on any architecture, and to ensure OpenJDK? passes the JCK. For more information visit http://www.iced-tea.org
Zero: The term Zero is used to refer to the VM in IcedTea?. It is called Zero because it has 'Zero' (well, almost) assembler

Toolbox

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback

zenweb1 : 0.13 secs More Info