Note: This article is under construction. This is part 1.
Feel free to ask or suggest anything that you’d like me to add or discuss about!
My contact details can be found at the bottom of this post.
Introduction to Part 1
Building and launching an application or game for a mobile VR headset like the Oculus Quest is a challenging task. Not only should there be interesting interactions and gameplay, the application should be constructed in such a way that it performs well from a performance perspective – under all circumstances.
Personally, improving the performance of processes and software are things that keeps me excited. Therefore I have decided to write about one of my small projects of porting an application from PC VR to the Oculus Quest. I will discuss what I am doing to ensure it can run with good performance on the Quest. This will most likely end up becoming a series of posts where I start from the basics and continue further discussing the alterations to the application that gives a big impact on performance.
I will discuss a couple of tools that I’ve used to profile and identify bottlenecks for this specific application – Unity, the Qualcomm Snapdragon Profiler and RenderDoc.
I hope you will enjoy the post and encourage you to comment and share any findings that you typically use when optimizing your application.
Disclaimer: I will not go into shader optimization or alterations of the render pipeline.
Table of contents
- The application – Be With Mars
- Establishing the baseline
- Details of the PC version
- Details of the Oculus Quest version
- Capturing performance metrics – Unity Profiler
- Capturing performance metrics – Android Debug Bridge
- Capturing performance metrics – RenderDoc
- Capturing performance metrics – Qualcomm Snapdragon Profiler
The application – Be With Mars
For this post, the application I will discuss goes by the name “Be With Mars”. The application is a work in progress and has not been released yet. I began developing it for PC (SteamVR) but have slowly been working on porting it to Oculus Quest.
The application is based around the NASA’s InSight Mars Lander – which landed on Mars on the 26th of November 2018. InSight stands for Interior Exploration using Seismic Investigations, Geodesy and Heat Transport. It is a robotic lander that is designed to study the interior of Mars. It has a series of scientific experiments such as a seisometer and a heat probe. If you would like to know about the actual details of the lander, please visit NASA’s homepage here: https://mars.nasa.gov/insight/
Please note that the application is developed by myself and that the work is not affeliated with NASA. The 3D model that I have used can be found on the NASA 3D resources repository here: https://nasa3d.arc.nasa.gov/
The screenshot is from a sequence where the robotic arm places the seismometer onto the surface of Mars.
Establishing the baseline
In order to know what we are working with it is important to analyze the performance of the application as early as possible. Taking a first look at the PC version we can simply use the Statistics overlay in Unity as an initial indicator of what is going on from a rendering perspective:
Details of the PC version
When moving around and looking at the lander there are between 500k to 1M triangles and vertices to process while pushing somewhere between 400 to 700 batches each frame.
There is one realtime directional light shining and casting shadows onto the lander and the terrain. There is also fog and a few simulated volumetric sand storms around the lander.
The shader is the Standard shader with an Albedo, Metallic and Normal map. Each mesh uses somewhere between 2 to 10 different materials.
The terrain is fairly complex with the closest area of the lander having terrain that is stiched together using photogrammetry from pictures taken by one of its cameras.
Some of the geometry is static while the moving parts are not.
Although not optimized at all up until now, the GPU of a modern desktop PC is fairly capable of handling this amount of data. However the Oculus Quest, with its Adreno 540 GPU, will struggle alot under these conditions.
Details of the Oculus Quest version
Since the 3d geometry of the lander counted to 140k vertices and 70k faces (as reported by Blender) it was clear that the model would not be easy to render without simplification on the Quest. I began reducing the amount of vertices and faces of the model and after a few iterations I settled for a decimated model containing 60k vertices and 30k faces (as reported by Blender).
The next thing I did was to remove the complex terrain used in the PC build, replacing it with a simple quad, simply because the terrain is too complex.
I also removed the skymap so that I’d only focus on the actual lander geometry for this analysis.
Game Engine and oculus SDK
The project was built with Unity version 2019.3.6 and the Oculus Integration version 14.
For this project I decided to use the Universal Render Pipeline (URP) – simply since I wanted to learn more about it. The version I used was URP version 7.3.1 released on 2020-03-11. You can find more about URP here: https://docs.unity3d.com/Packagesfirstname.lastname@example.org/manual/index.html
I enabled Single Pass Stereo rendering. This will help speed up the rendering pass.
I decided to begin the porting using the Lit shader mainly because I wanted to see how physical based rendering with one realtime light and shadows perform on the Quest.
Capturing performance metrics – Unity Profiler
To start things off I created a development build, started the application and looked straight toward the lander. After that I connected the Unity Profiler to the running instance and selected the Rendering profile module. Here are some statistics:
From this we can see that we have almost 140 draw calls that render roughly 265k triangles and 312k vertices. We can also see that there are 4 render textures in use, with 3 render texture switches.
Capturing performance metrics – Android Debug Bridge
Still looking toward the lander I captured some more raw statistics using Android Debug Bridge and the Oculus Mobile logs, by executing this command in a console window:
adb logcat -s VrApi
To learn about the details of the Oculus Mobile log, please look here:
Here is one sample from the actual performance:
FPS=72, Prd=44ms, Tear=0, Early=0, Stale=0, VSnc=1, Lat=1, Fov=0, CPU4/GPU=2/2,1651/414MHz, OC=FF, TA=0/0/0, SP=N/N/N, Mem=1804MHz, Free=1011MB, PSM=0, PLS=0, Temp=31.0C/0.0C, TW=1.96ms, App=9.14ms, GD=0.00ms, CPU&GPU=13.01ms, LCnt=1, GPU%=0.74, CPU%=0.41(W0.43), DSF=1.00
Lets look at a few of the numbers. We have zero stale frames and zero early frames. At this point, if these numbers are steady, we can suspect that our application is just able to keep up with refresh rate of the screen. We can see that Fixed Foveated Rendering is off (Fov=0). Currently there are issues with FFR and the Universal Render Pipeline in Unity meaning it does not work properly.
As you can see, the GPU is occupied for 9.14 ms (App) while a whole frame takes 13.01 ms (CPU&GPU=13.01). Since the Quest operates at 72 Hz each frame should be rendered within 13.89 ms. This is our next hint that there is very little space between each frame.
When the GPU is running at the lowest clock level of 414 MHz (GPU=2) its utilization is 74% (GPU%=0.74). Similarily, CPU core number #4 is running at 1651 MHz with a utilization of 41% (CPU%=0.41).
The Oculus Quest supports dynamic clock throttling, meaning it can adjust its CPU and CPU clock speed based on the complexity of the scene. For this initial study I manually locked the CPU and GPU clock speed to their lowest settings using the following commands:
adb shell setprop debug.oculus.adaclocks.force 0
adb shell setprop debug.oculus.cpuLevel 2
adb shell setprop debug.oculus.gpuLevel 2
By doing this I can get an immediate view on how the CPU and GPU are occupied throughtout a frame when running on its slowest clock speed.
Alternatively, if I had enabled the dynamic clock throttling by default, and if we compare the utilization of the CPU (41%) with the GPU (74%) I would expect the clock speed of the GPU to jump up one or two levels to give the hardware more headroom between each frame.
Lets look at another tool.
Capturing performance metrics – RenderDoc
Using RenderDoc we can find some additional summary data:
Just as the Rendering Profile module in Unity told us the application is busy performing 136 draw while rendering to a total of 4 render textures (4 RTs).
Looking in the Event Browser we can get a feeling for the rendering workflow:
Since I am using a realtime directional light we can see that rendering the shadowmap takes up about 14% of the rendering time. This happens between Event ID 285 to 756.
The big portion of the rest of the Rendering time, about 83%, is spent between Event ID 761 to 1517 rendering the actual lander geometry – the Render Opaques step.
Rendering the Shadowmap
Let’s take a look at the Shadowmap and how it is visualized n RenderDoc:
At the top you see a historical timeline of the captured events/draws. Be aware that this is not a timeline of how long time each call took.
The small orange and blue triangles indicate when reading and writing to a buffer occur. For the shadowmap you can clearly see that it is being written to during the Render Main Shadowmap stage and read from during the Render Opaques stage.
On the left you can see that I’ve highlighted Event ID 751 (draw call #73). In this last step the application renders the ground plane into the shadowmap (the big square with 36 vertices). Internally, the shadowmap is a temporary buffer in memory, here called “TempBuffer1” with a dimension of 2048×2048 pixels. Reading and writing to temporary buffers take quite a lot of time on mobile devices.
Multi-pass vs Single-pass stereo rendering
Below you can see a capture from the application when Multi-Pass Stereo rendering was enabled:
The application begins by rendering the shadow map (Render Main Shadowmap) from the influence of the realtime light.
Rendering the Opaque Geometry
After that you can see that the application continues and renders the actual geometry (Render Opaques). Comparing the two screenshots above you can see that when using Multi-Pass Stereo rendering we end up with two Render Opaques stages; one for the left eye and one for the right eye. However, with Single-Pass Stereo Rendering enabled the hardware can reuse work that is common for both eyes (e.g. culling and shadowmap calculations). With Single-Pass Stereo rendering the system uses one eye texture array and renders individual objects to the left and right eye right after each other. With Multi-Pass Stereo rendering the system uses two indivual eye textures and renders all graphics to one eye first and to the other eye later.
Comparing the Single-Pass and Multi-Pass Stereo rendering builds I see that there are about 70 more draw calls for the latter, obviously this stresses the hardware more compared to Single-Pass Stereo Rendering.
In my initial build I had forgotten to enable Single-Pass stereo rendering meaning the application was not hitting the framerate. I got numerous stale and late frames. So if you can, enable Single-Pass Stereo rendering in your shader to save battery of the device.
Find the 80% that really matter
Now, rather than blindly looking at the actual number per se, I suggest you use the Duration column to get a general feeling for which parts take up most of your rendering time. Focus your optimization effort after you’ve gathered valid data about where the clock cycles are spent.
A simple way to rapidly sort where most of the time is spent is to use Performance Counter Viewer and enabling the GPU Duration counter. Without going into details, for this application, in my listing (now showed here) I see that rendering the ground plane takes up 33% of the whole Render Opaques step.
The ground plane itself consists of 36 vertices while the shadowmap is a temporary buffer sized at 2048×2048. It is being rendered to more than 50% of the screen. In general, reading and writing to/between temporary buffers and they eye texture buffers can slow down your application considerably. Especially if you have a lot of post processing effects happening in the temporary buffers.
The second most expensive operation is spent rendering one big chunk of the under carridge. It consumes 10% of the Render Opaques step.
Let’s move on to another tool.
Capturing performance metrics – Qualcomm Snapdragon Profiler
The Qualcomm Snapdragon Profiler is a profiling tool by which you can dig even deeper into the various stages in the rendering process. One thing I like with the Snapdragon Profiler is that you can capture data about to the application and hardware in realtime as well as doing trace captures.
In the capture above we can see a full frame running between the green marker (IB1 Start Markers) and red marker (Flush Markers). It is able to do its job just inside the VSync spectrum of 13.8 ms.
The total time for this frame is the sum of the three Surfaces posts, namely the rendering of the shadow map texture and the two eye render textures. This time equates to 2.6 + 4.6 + 5.8 = 13.0 ms.
While the graphics is rendered to the eye buffers you can notice that the actual scan out to the display (OVR::TimeWarp) interrupts the rendering process periodically. In this case it consumes about 1.6 ms. This means you actually have a good ~1.5 to 2 ms less time to perform your game logic and have everything rendered smoothly to the display. Spend your cycles wisely!
End of Part 1
I hope you have enjoyed this first part of the series. Feel free to comment and give me suggestions on what to discuss in relation to the application and the tools. I will begin working with part two shortly!
You can email me on peter dot thor at vicator dot se or contact me on twitter.
Some of the planned topics for Part 2
When one of the Surfaces is selected we can see that the current setting for MSAA is reported as “4” – meaning 4xMSAA is enabled. In my initial builds I had forgotten to enable MSAA.
Maybe some text about the binning/tiling and Render/MSAA
E.g 15 tiles for 0xMSAA vs 60 tiles for 4xMSAA
Lock GPU & CPU
To identify first bottleneck. Where to put the focus point for this app. 80/20.
Reason. Approach. Results. When is it good enough.
Revisit view on performance
The Qualcomm Snapdragon Profiler
Describe what can be found when using it.
Describe why locking GPU/CPU will help.
Describe a typical frame.
Describe gaps between each frame.
Various surfaces (e.g. for effects)
Compare it against for example GPU4 and whats the difference now:
Primitives / Shading / Vertices
Describe what is going on in relation to this:
Describe what can be found when using it, overall and e.g. temporary buffers for intense gfx operations. More on understanding some of its details.
Difference between various URP. Typical problem with previous version.
Single vs Multipass.