I don’t always write GPU code in Java but when I do I like to use Aparapi
The AMD JavaLabs team is proud to announce the open source release of our Aparapi project.
Over 18 months ago we began to investigate how Java developers might be able to take advantage of the (potentially huge) compute performance of emerging GPU devices.
At the time we were beginning to see Java bindings for OpenCL™ and CUDA (JOCL, JOpenCL and JCUDA), but most of these provided JNI wrappers around the original OpenCL or CUDA C based APIs and tended to force Java developers to do very un-Java-like things to their code. Furthermore, coding a simple data parallel code fragment using these bindings involved creating a Kernel (in a somewhat alien C99 based syntax; exposing pointers, vector types and scary memory models) and then writing a slew of Java code to initialize the device, create data buffers, compile the OpenCL code, bind arguments to the compiled code, explicitly send buffers to the device, execute the code and explicitly transfer buffers back again.
To the seasoned OpenCL developer this is all routine stuff, but to the Java developer this all seems a little overwhelming. Java developers are used to being (and quite happy to be) shielded from platform nitty gritty. We don’t use pointers, we don’t worry about destructors or tiered esoteric memory models or even about freeing memory explicitly. We expect the Java Virtual Machine to do this heavy lifting for us.
We discussed how we thought Java developers might prefer to write code for the GPU and the answers were something like:
1) We want to write all our code in Java.
2) We want to use common Java idioms and patterns.
3) We don’t want to have to worry about exotic memory models.
4) We would prefer to not have to explicitly transfer data to the GPU and back. Generally the compiler knows what data the code is accessing and whether the code is reading from or writing to it, so why can’t the runtime just handle ensuring that the data is where it needs to be, when it needs to be there?
5) We want to write code once. If it turns out that the code cannot be executed on the GPU (because, say, the runtime platform does not happen to have a GPU) we would still like the code to execute in a performant manner. We certainly don’t expect to have to write our code once for the GPU (in a device specific language) and then a second time in case it has to run using a Thread Pool.
This led us to develop Aparapi, the project that we are releasing to the open source community today.
Aparapi enables Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel friendly code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java bytecode to OpenCL at runtime and executing on the GPU; if for any reason Aparapi can’t execute on the GPU it will execute in a Java thread pool.
We like to think that for the appropriate workload this extends Java’s ‘Write Once Run Anywhere’ to include GPU devices.
With Aparapi we can take a sequential loop such as this (which adds each element from inA[] and inB[] arrays and puts the result in result[]).
final float inA[] = .... // get a float array of data from somewhere
final float inB[] = .... // get a float array of data from somewhere (inA.length==inB.length)
final float result = new float[inA.length];
for (int i=0; i<array.length; i++){
result[i]=intA[i]+inB[i];
}
And refactor the sequential loop to the following form:
Kernel kernel = new Kernel(){
@Override public void run(){
int i= getGlobalId();
result[i]=intA[i]+inB[i];
}
};
kernel.execute(result.length);
Here we extend a Kernel base class, overriding the run() method to express our data parallel code, and we initiate the execution of the code (over a specific range 0..results.length) using Kernel.execute(results.length).
If, at runtime, Aparapi detects that the platform supports OpenCL it will attempt to convert the bytecode of the overridden run() method (and all run-reachable methods) to OpenCL and execute the code on the GPU. If it can’t then the Java code is executed using a thread pool.
Aparapi can work out which Java arrays are being accessed and can therefore ensure that these arrays are transferred to the GPU (and back) as needed.
When we presented our initial Alpha release at JavaOne 2010, one of the expected first questions was ’Will you be releasing Aparapi as an open source project?’. At that time we expressed our ‘hope’ to be able to offer Aparapi as an open source project in the future.
Well the future is here. Aparapi is now available from code.google.com/p/aparapi. AMD has contributed the Java and JNI (C++) code under a modified BSD license along with samples, examples and documentation to enable people to work with and contribute enhancements.
In the past year we have added new features to help more developers take advantage of Aparapi, including support for arrays of Objects (the alpha version only allowed parallel arrays of primitive types), plus shortcuts and performance improvements for common usage patterns.
We look forward to hearing feedback from Aparapi users and accepting contributions to the code base to improve and extend Aparapi.
Please join us at code.google.com/p/aparapi and help us make Aparapi a vibrant open source project.
Gary Frost is a Software Engineer at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
OpenCL is a trademark of Apple Inc. used with permission from Khronos.
POSTED IN: AMD Java Labs, Inside Dev Central
TAGS: Aparapi, APIs, heterogeneous computing, Java, open source, OpenCL, Parallel Programming





Pingback: AMD Aparapi: Java acelerado por GPU - CHW
Pingback: AMD Aparapi: Java acelerado por GPUs | TecnoGaming - Hardware
Pingback: AMD Aparapi: Java acelerado por GPU | Ventiao | El mundo cambia, todo cambia, entérate
We’ve been doing GPU development for years (first in GPU assembly, then in GLSL, then OpenCL) and I have to say this certainly looks to be the easiest way to harness the GPU that I have yet to see. Bravo!
Nice to see tools beginning to emerge for Java. Are there plans to provide Eclipse plugins for AMD APP Profiler and gDEBugger as well? A combination of aparapi for initial algorithm design and tools that support profiling and analysis for low level optimization of the Java binding APIs would be very helpful. Our host code will be in Java/Scala and our development/deployment on MacOS/Linux.
I ‘d rather see NetBeans plugins, and the next developer that reads this might prefer IntelliJ IDEA plugins…imho, there are other things to focus on first. Like for example: being able to run an Aparapi Java program on any OpenCL-capable graphics card.
Of course now that Aparapi is open source we fully expect that the current platform restrictions will be relaxed. As we mention in the FAQ the reason for restricting to AMD devices was mainly to minimize the test matrix prior to release.
Please head over to the project issue list http://aparapi.googlecode.com and raise this as an issue so that the open source patch process can take it’s course. This is probably a single line patch.
Really nice API! I have one question: how does this relate to the brand new fork-join framework? That also lets programmers easily write multi-threaded code, albeit for multicore CPU’s. Maybe some collaboration will be worthwhile?
In Fork/Join you are creating some thread and then this thread durint execution can decide whether to create another subthreads (one or more), or to wait for data from another threads. In OpenCL and CUDA you create many threads, and you get the best performance when there is no or very little communication between them.
So those two (Fork/Join and OpenCL) use many threads but in very different ways for solving different problems
In many ways fork-join is more dynamic and allows the executing code to submit extra work/tasks whilst the original forked code is still executing. This allows recursive type execution. Aparapi has a more restrictive model (via OpenCL) and requires a Kernel to run to completion (albeit across parallel data) before allowing other tasks/kernels to be launched.
I do think that the proposed Java 8 Lambda feature will be very powerful and I would like to see Aparapi use Lambda’s for expressing Kernel code in the future.
Excellent work! We was looking for this project. Will join the open source project.
Nice project. Just compiled a Mac OS version of Aparapi. Check out the video below for the Mandelbrot sample with and without OpenCL:
http://djjoofa.com/data/videos/mandelbrot_java.mov
Joofa
Great video thanks for this.
Witold Bolt just submitted a patch to support Mac OS. I just applied the patch so hopefully Mac OS support is in the main line now. I don’t have access to a Mac, but would welcome an independent validation of the build.
This patch also now means that any OpenCL 1.1 runtime should work.
Just tried the new Mac patch. It builds and works fine. Thanks again for the great project.
Please let me know if you’re looking for a author for your site. You have some really good posts and I think I would be a good asset. If you ever want to take some of the load off, I’d really like to write some material for your blog in exchange for a link back to mine. Please send me an email if interested. Regards!