OpenJDK HotSpot 128-bit Multiplication x86 performance improvement banner

OpenJDK HotSpot 128-bit Multiplication x86 performance improvement

2 devlogs
5h 22m 4s

Currently, on the OpenJDK JVM (used by Java, Kotlin, Scala, and others) multiplying two 64-bit numbers to get a 128-bit result uses 2 x86 instructions. Actually, this can be done with only 1 instruction. So I'm modifying a part of code called the …

Currently, on the OpenJDK JVM (used by Java, Kotlin, Scala, and others) multiplying two 64-bit numbers to get a 128-bit result uses 2 x86 instructions. Actually, this can be done with only 1 instruction. So I’m modifying a part of code called the HotSpot Compiler (written in C++), so that it can do this with 1 instruction instead of 2, increasing performance when this operation is done in any JVM language (common in crypto/hashing). This work is tracked in the jdk-8379327-128bit-mul branch of my fork, and has been submitted upstream, and I am iterating currently as requested by Oracle reviewers.

This project uses AI

I used GitHub Copilot Chat to help me navigate around the repo / find where some things were located, as OpenJDK is MASSIVE! :)

Demo Repository

Loading README...

Michael

Two of the reviewers for the PR that adds this new feature disagreed about something (whether the Register Allocator needs an extra match rule in the architecture description file to prevent register spilling - cause for multiplication it doesn’t matter the order!) so I needed to test it. I designed a small test case and extra logging which helped me test whether it was needed (this look a while), and it turned out that the extra match rule wasn’t needed, so I removed it in the commit I just pushed. The testing I made for this was just temporary & shouldn’t become a part of OpenJDK, so it wasn’t pushed. You can see a screenshot of the output of the test case I made to trace this down attached - note in the assembly output that there is NOT excessive shuffling. I have highlighted the important part in red in case you don’t understand OptoAssembly. You can also see the new “mulhilo” node I made in this screenshot :D

Attachment
0
Michael

So this is my first devlog!

I wrote the code to enable making the multiplication only need 1 x86 instruction, which involved changing HotSpot’s compiler, which is written in C++, to have 2 new node types (one for unsigned and one for signed) that represent fused high-low multiplication, and then editing the DSL that’s used for arch descriptions so that the x86 one understands these new nodes, and make a test in Java to make sure this works.

I then make a PR and sent it upstream to OpenJDK for review!

In my PR, some people who work at Oracle asked me to refactor several things, including making the testing use the standard IR framework (for checking HotSpot’s sea-of-nodes made from the bytecode), rather than the kinda of hacky thing I spun myself because I didn’t realise it existed (how silly!). After migrating it to the framework, it’s all working (again) now, and you can see my new-and-improved test passing in the screenshot!

Attachment
2

Comments

Michael
Michael 5 days ago

Oh sorry I forgot to link all the other commits! You can see them in the PR though :)