Using Omniverse Audio2Face in Animated Film Production

NVIDIA Omniverse includes Audio2Face, an AI powered audio clip to facial animation tool. Give it a voice track, and it can animate the face of a character. It even has emotional expression support. Very nice! But how to use this tool as part of a pipeline to create a Pixar-like animated movie?

In this blog I describe the best path forward as I currently understand it, with links off to useful videos, describing where they fit into the big picture. I don’t describe the Audio2Face setup – watch the existing linked videos for that!

Also please note I write this now knowing that Omniverse 105 is about to drop any day now, probably along with a lot of new documentation and videos.

The Point vs Blendshape Animation

The “smarts” of Audio2Face is to animate a face mesh to match the movements of a talking character (with emotions). There are other tools around, but I focus only on Audio2Face here. At the heart is an AI based model where you feed audio in and get mesh deformations out (e.g., open the mouth wider). It does this by taking an original face mesh and moving all the vertexes on the mesh up/down/left/right etc in 3D space. Nearby points move by the similar amounts to keep the skin smooth. The original face mesh movements can then be retargeted to your own character.

Most other platforms don’t use full mesh animations (they don’t animate the position of each individual vertex on your face mesh). They instead use “blendshapes” (aka Shape Keys in Blender). A blendshape is the same sort of mesh deformation expressed as delta movements (it moves lots of vertexes by a bit) for one particular face pose. For example, you might have blendshapes for opening the mouth, raising the left side of an eyebrow, etc. Each blendshape gets a weight between 0 and 1 allowing you to combine them. A smile might involve opening the lips a bit, raising the left and right sides of the mouth a bit more, and maybe raising the cheeks a bit.

(VRoid Studio and VRM files originally defined a blendshape per emotion – joy, sadness, etc. These days it is more common to create a blendshape per micro expression, so you can combine them in many different ways, including extracting from video of real human face acting. ARKit from Apple defines 52 such movements, but there are other standards. Audio2Face by default uses a set of 46 blendshapes.)

Blendshapes mean an animation clip records weight values for around 50 blendshapes instead of for hundreds or thousands of vertexes. It can also make an animation clip easier to apply to different characters due to using standard blendshape names. The blendshape is custom per character, but the blendshape name and purpose is the same. Many other tools also support blendshape animation for this reason.

Audio2Face can be used to generate a “point cache” which animates all the vertexes individually, but it can also be used to take those points and estimate a set of blendshape weights with a similar result. This means an animation clip using blendshapes may be lower quality than the original full mesh animation.

So a decision point is whether to use point cache or blendshape animation. This is useful to understand before watching some of the Audio2Face videos as they often show using Audio2Face in one way or the other, without explaining when one or the other is better.

In my case, I started with a plan to use full mesh animation for quality, but have since decided to move to blendshapes instead. The main reason is it seems easier to merge different animations together. For example, I may want a character to wink while they are talking. Audio2Face does not support winks, and trying to animate a wink for each point in a mesh is hard. Using blendshapes, a wink is relatively easy (push weight of “left eye closed” to 1). I can just override that one blendshape weight to add winks to an Audio2Face animation clip.

Blendshapes can also be animated using skeletal animation, making it easier to use in combination with other skeletal animation clips (such as walking and sitting). This is not available with point cache animation.

There is also a detail of understanding the animation data itself. For point caches, you need to bring across time coded mesh deformations, one per mesh. The video How to Use Audio2Face in NVIDIA Omniverse Machinima does this.

It seems a bit cumbersome.  With blendshapes, I believe I can bring across a single animation clip of blendshape weights. Sure, different blendshapes control different meshes, but the animation clip itself is a single file.

Blendshapes with Audio2Face

So, for myself, the best videos to watch are the ones that focus on blendshapes with Audio2Face. For example, BlendShapes Part 1: Importing a Mesh in Omniverse Audio2Face focuses on adding blendshapes to a model that does not have any defined. (My original characters created outside Omniverse had blendshapes, but the GLB importer seemed to lose them. Oh well, not a big deal because I can add them back to the character using Audio2Face.)

The next video in the series is BlendShapes Part 2: Conversion and Weight Export in Omniverse Audio2Face. It goes into how to convert an audio clip into an blendshape animation clip, then load that time series data and as a clip in the Sequencer for playback. I hope to automate some of these steps, but it is useful to understand.

The next video BlendShapes Part 3: Solver Parameters and Preset Adjustment in Omniverse Audio2Face gets into the details of settings you can change when generating animation clips, such as smoothing levels between facial poses. Useful to improve quality.

Part 4 is less useful for me personally. It talks about how to use existing blendshapes defined on your own character, rather than using Audio2Face to create them. I will include it here for completeness since it’s in the same video series. BlendShapes Part 4: Custom Character Conversion and Connection in Omniverse Audio2Face.

There are some videos with overlapping information. For example BlendShape Generation in Omniverse Audio2Face is another useful video that runs through the full process to set up blendshapes on your characters. The information overlaps with the videos above, but is great as a second description of the blendshape creation process.

The end result is each USD character will have a set of additional blendshapes defined for moving parts of the face, and animation clips generated by Audio2Face will be expressed in terms of those blendshapes.

Organizing your Character Avatar Files

This brings up another brief topic – brief because I have not 100% resolved my complete workflow yet! I have a set of characters all created using the same tool (VRoid Studio in my case). This means they all have identical bone structures and naming, making it easier to share animation clips across all characters without having to worry about retargeting as much. (Blendshapes created by Audio2Face will also have standardized names.) That is nice.

Do I need to add all the Audio2Face infrastructure on those characters? If you look at the “character transfer” videos, there are multiple characters needed by Audio2Face (the green and grey “Mark” heads).

Thankfully, the answer appears to be “no”. I do need to rig up a stage holding my character with lots of extra mappings so Audio2Face knows how to control the character, and I will use that character to convert audio files into animation clips, but I can take the blendshapes created for my character and copy them over to a “clean” copy of the avatar that I use in camera shot scenes. I don’t need to bring all the Audio2Face setup with me every time I need to animate the character.

So, I expect to have at least the following USD files per character:

  • The base character with all its original meshes
  • A stage with the character loaded and all the Audio2Face bindings defined
  • An extended character with a copy of just the blendshapes added

The stage is used to

  • Create the blendshape definitions for the character in the last point, and
  • To convert each audio clip to an animation clip using those blendshapes.

Animation Graphs and Blending Animation Clips

I want to combine multiple animation clips. In the simplest case, I want to combine a full body stand/walk/sit animation with an Audio2Face facial animation so a character can walk and talk. In reality, I want to go further with hand gesture overrides, head turns, eye gaze (look at target), separate upper and lower body animation, and more. For example, I often use a standard “sit” animation clip to control the lower body, the a custom mocap recording for the upper body.

The Sequencer in Omniverse is fairly basic. It can play back animation clips at different time points. It cannot however blend between them. For example, to nicely transition from walking to standing, you either need to have an animation clip for the transition, or you need to “blend” between them (weaken the strength of the walk clip while increasing the strength of the stand pose, blended over a few frames).

Animation graph in Omniverse supports blending between clips, so may be useful here. See Animation Graph Overview in Omniverse USD Composer for an example.

But I am not sure when whether animation graphs  are good enough to solve the problem. For example, the above video only shows a weight of 1 or 0 (no blending). The character abruptly stops walking when you let go of the keyboard key. So I still am deciding whether a “good” solution needs to bake animation clips using a 3rd party tool, then get the Sequencer to play that final clip create a single animation clip (one per character). That clip would also include the Audio2Face facial animations. I am also thinking this tool will need to take multiple lines spoken by a character (with silent sections when another character speaks) and create a single long audio file to feed into Audio2Face, so it can deal with the facial expressions for the whole camera shot.

Conclusions

I don’t have the complete picture worked out in my mind yet, but a key understanding for me was to understand when blendshapes are a better choice even if lower quality. Point caches with full mesh animation may be a better solution for live animation or cases where skeletal animation blending is not needed. But blendshapes seems to be a better choice for my situation where I want to use skeletal animation clips. I am still pondering however how to overcome the relatively basic Sequencer in Omniverse. Do I use Blender, Unreal Engine, Unity, etc for sequencing and blending animations instead? I was hoping to do everything in one platform to keep things simpler.

Further Reading: Where Other Videos Fit In

There are a number of other videos out there. I point out a few here to show where they fit into the big picture.

How to Use Audio2Face in NVIDIA Omniverse Machinima focuses on point cache, not blendshapes, so I don’t think as useful for my personal plans.

Camila Asset Pt 6: Connect Character Setup Meshes to Drive the Full Body in Omniverse Audio2Face describes how to go from just a face model with Audio2Face and bring that over to a full body working with Audio2Face. I don’t think I need to do this as I will export an animation clip instead (a time coded sequence of blendshape weight values). I don’t need the characters in camera shots to have Audio2Face available – the animation clip is sufficient.

Rig a Face Blender Addon for FREE – Nvidia Omniverse Audio2Face shows how to use Blender with frogman setup and Audio2Face. I want to do as much as possible in Omniverse, so less interesting to me.

Exporting to Unity using Blendshapes within Omniverse Audio2Face uses Blender as an intermediate step to get a character out to Unity. Could be interesting as background material, but the first videos above I think are more useful to me.


Leave a comment