AI Movie Pipeline Updates

I want to create a short series on YouTube, using AI generated video. I posted previously about my planned pipeline. As I have learnt more some aspects have become clearer. This post updates some of those details.

Buy a Bigger Disk

I have a new desktop with a 4TB D: drive. I just filled it. Yikes! Carefully planning and managing the models you want to download and use may be a worthwhile financial investment! One of the advantages of some of the online services is they have preloaded model files already. (Meantime, I am off to look for an 8TB drive for my remaining free disk bay.)

Picking the Right Model Files

Model files can be huge. I started by being amazed by 2GB files. I ended up being grateful a 30GB model file could still be loaded on my RTX 5090. But you need compatible sets of model files. New to the space, I have found this a non-trivial exercise. I ended up downloading more files than I needed, or downloading the wrong (incompatible) files. It is a challenge. My advice, spend a bit of time trying to understand what model files you will need.

Image vs Video Models

Stable Diffusion 1.5, SDXL, and Flux are image generation models. Wan 2.1 and Wan 2.2 are video generation models, which you can also use to generate images by creating a video of one frame. Video generation models do a better job of generating consistent images across video frames. But you can use Wan 2.1 and Wan 2.2 Text to Video to generate a 1 frame video (an image!) as well.

Wan 2.1 or 2.2? fp16, fp8, or bf8?

Unfortunately, the Wan model files are big. Wan 2.1 has a 1.3B parameter model and a 14B parameter model. I personally found the output of the 1.3B parameter files not to my satisfaction, so I went with the 14B parameter model files. Okay, ego might have been an influence. (I bought this new GPU card so I am going to USE it!) The problem is when you start adding more capabilities, the size gets bigger. Even though I started fitting in memory

One trick to keep the file size under control is to use the fp8 or bf8 files instead of the fp16 versions of model files. These use less precise floating point numbers (8 bits instead of 16 bits). The bf8 encoding uses an extra bit for the exponent allowing it to represent a wider range of numbers, which is frequently useful for AI models.

Wan 2.2 now has 1.3B, 5B, and 14B parameter models. The 5B improves the quality over the 1.3B model, but the 14B parameter model is better again. The 5B model is a nice size for common consumer GPU cards.

So there are more options to think about. If you want a smaller model file, do you want to use the new Wan 2.2 14B model (which appears to deliver better results but with a larger model file) but drop back to fp8 instead of fp16, or drop from 14B back to 5B? And what resolution do you want to output? Is the smaller Wan 2.1 14B model better than the Wan 2.2 5B model? There are lot’s of combinations to be tested with a not very clear answer.

Oh, and if you want control net pose control, multi-talk, etc., you may need different model files which may be bigger.

I don’t have an answer for what is the “best” combination. The 5B model sounded potentially useful being better than then 1.3B parameter model while using less memory than the 14B parameter model, but I have heard different opinions on how 5B with fp16 compares to 14B with fp8. Initial reports of the 5B model were promising, but there have also been reports of flickers. Time will tell as more people experiment.

Block Swapping

Another consideration is the Wan models support block swapping. You can control how many “blocks” of the model to load into GPU memory at a time, using physical memory of your computer instead. This allows you to use a model that needs say 48GB, but run it on a 24GB GPU. Processing is slower due to the additional overheads of swapping data into GPU memory when needed, but it runs. Picking a smaller model where all the data can fit into your GPU goes faster.

LoRA Training

A LoRA is a set of adjustments to apply to a base model, so it is more likely to generate output based on what the LoRA was trained on. A common technique is to use a special trigger word to control when the Lora changes kick in.

I use this for training a model on characters, but it can also be used for global style changes. To use a LoRA, you take the base model then “apply” the Lora changes on top. This means the LoRA has to be compatible with the underlying model. If two models are close the same Lora can be applied to both; if two models differ more significantly then you need to retrain the LoRA separately for each base model. Knowledge and skill are useful here to determine the ancestry of two models to predict if a new trained LoRA is needed or not.

For training a LoRA model from images, I came across the AI Toolkit by Ostris (https://github.com/orgs/ostris). I used it to train a model, and it made it a much less painful experience. Many other approaches I could not get to run on Windows after hours of hair pulling trying to get the right Python libraries installed and the right model files. This tool just worked.

I have only used it on one character so far, but it felt like I had more success with fewer images that were very consistent, than more images from different angles that did not look quite right. Being ruthless and rejecting images that were borderline seemed to improve consistency.

Image to Video

One thing I had not previously understood well was when to use T2V (text to video) and I2V (image to video) with Wan 2.1 and now Wan 2.2.

For film creation, the I2V model is very important. You create an image of the first frame of the scene you want, then use text to describe how the characters in the scene should move. This can avoid the need to use ControlNet or similar models. It also means you can first storyboard the scene with multiple camera shots using the images alone, to get a feel of pacing and flow. There is a length limit to the video clip you can create, but I personally already normally go with short shots, moving the camera from character to character as they speak and move. Another bonus is you don’t spend as much time rendering before discovering if the shot is acceptable.

With the T2V model, you describe a scene with text. You can apply Loras (such as a character Lora) and trigger it by including the trained trigger keyword in the text, but it is hard to use it to create consistent looking shots. It can however be useful to create an image for a new location or environment.

If the original image has a good image of a particular profile of a character, and the character is just talking, there is frequently no to use a LoRA with I2V. This is great news for bit-actors (characters that are only in a small number of shots). LoRA training is doable, but does take a fair bit of effort. It is much easier to only generate a frontal image of a character and use that.

One approach to get longer shots is take the last frame of a previous render and use that as the first frame of the next render (I believe with extra overlapping frames as context). This seems to work reasonably well from a few examples I have seen.

The important implication is that you first want to generate good quality shots of the first frame of each video clip (like a storyboard), then convert that first frame into a video clip. If you can generate good initial images, you may not need a LoRA model.

Control Net

One of the many questions I had is whether Control Net with OpenPose or dwpose is needed for videos? Do I need to worry about animating the movements of a character to control their movement through a scene? This one is an open question for myself, but my current plan is to try and stick to short clips (e.g. 5 seconds) and see if the text prompting can get me a good enough result. If I can start a scene with a good image with the characters placed strategically, it may be enough and I can avoid animation with Control Net. This one may take some time to work out. My concern is a movie without much physical action can become boring. Is it feasible to stitch together multiple short video clips to achieve a consistent feeling scene? I hope so, because it will reduce the complexity by quite a bit.

My suspicion is simple scenes will work fine with simple prompting, but more complex scenes like a fight scene may come out better using Control Net with depth map and/or skeleton pose animation as the images will more closely follow the provided animation. So different scenes might use different techniques.

Another alternative is the Wan Fun models, which can be used to transfer motion from one video to another. Take a character, a video of another character dancing (or acting), and transfer that motion across.

Lip Sync

If I record myself talking and use a voice changing, it may be video to video transference is an option. Otherwise approaches like Wan Multi-Talk may also be an option. This is an area of further research for myself. It also relates to the source of the voice. Some lip sync solutions can use the audio file alone, meaning it does not matter how the file was created. However if you record audio with video of your face, you then do have the option of facial transfer from the driving video to the final result.

Image Generation

With Image to Video looking promising in the Wan models, it places greater importance on creating a good starting image with characters positioned correctly. This does not have to use Wan – it is just a static image. This means more traditional approaches like using Flux and pose transfers are feasible.

Camera Controls

One of the benefits of the Wan models is they also do a pretty good job of understanding camera movements. This video includes phrasing for different camera shots with examples. It uses an online service, but the service used was for Wan so the prompts are the same for running locally.

Note that the above video also demonstrates creating starting images with characters using Flux. This I found very useful. By having a series of acceptable images at different angles (my LoRA image training set for a character), I can instead pick an image I prefer, put it on a transparent background, and use that to generate a static pose fairly quickly. This can also be combined with a separate image for the background environment.

Conclusions

A bit of a brain dump I know, but it is looking promising.

  • Generate static images per camera shot may avoid the need for LoRAs with video models (but you still need to create initial images with consistent characters).
  • These static images are great as a storyboard. Render out the dialog, generate static images per camera shot, and you can “watch” the movie (with a bit of imagination) to check pacing.
  • Then start rendering the static frames into full video clips.

There are still some unknows for me.

  • How to pose characters in the initial static frames?
  • Wan 2.1 vs 2.2, 5B vs 14B, fp8 vs fp16, Flux vs Wan to create static images, Wan 2.1 has multi-talk for dialog lip sync but not 2.2 (yet). Which combination can I use on my hardware with acceptable results?
  • Can I get by without LoRAs?
  • Are the Wan I2V model text prompts enough for acting in a scene?

But the good news is things are generally looking clearer in the decisions to be made. More testing is needed to try out some of the combinations. With Wan 2.2 being so new with what appears to be significant quality improvements with also the option of the 5B model, it feels like it might become much clearer over the next few weeks. For my project, I need lip sync. Going with Wan Multi-talk (when available for 2.2) is an option, but it might also be an option to use Wan 2.1 with multi-talk for now and upgrade later when ready and proven itself.


Leave a comment