6:15 PM
I am exploring assembling scenes for computer animation from text based on AI. This is partly because I would like to speed up my workflow, and partly as a learning exercise to understand the challenges of applying AI to real life problems. This time I am looking at GPT to see how it can be used to tackle the problem.
To summarize the problem, how to convert arbitrary free text into a restricted vocabulary of commands that software can process. Software cannot have new options added on the fly – it has a well-defined set of operations it can support and options it understands. The output text could be JSON (which is easily machine processible), or text that follows a limited grammar that can be tokenized and parsed, much like a programming language.
Text to Command
My first experiment was to try out the OpenAI “Text to Command” playground, which is very close to what I am after. The following is an example of what I am trying to do:
Convert this text to a programmatic command:
Example: Sam is sitting on his desk in the classroom. He looks sad, looking at his hands. Its midmorning and slightly cloudy. Sam says "Why does this happen to me?"
Output: CHARACTER Sam LOCATION desk POSE sitting FACE sad LOOKAT hands SPEEC "Why does this happen to me?"
Hank is standing at his desk. He is angry with his arms crossed. He is looking out of the window where the sun is shining brightly.
Here is what the playground returned.
Output: CHARACTER Hank LOCATION desk POSE standing FACE angry LOOKAT window SPEECH ""
Not bad! It is only a single character, but it got the idea of what I was after pretty well.
So I tried to “fine tune” the model so it knows what to generate without providing examples every time. But what to provide as training data? Th example and output? The full prompt and output? I could not find any examples, so I decided to try the example text as the prompt and the output as the completion example. The OpenAI tool to check the input file started coming up with recommendations:
- Don’t repeat the same prefix (“Convert this text to a programmatic command:”) for each sample.
- Don’t use lots of words with uppercase because GPT has not been trained with upper case text much.
But I hit my first road bump. I had (incorrectly) assumed the “text-davinci-003” model used in the playground is the same as the “davinci” base model that I can fine tune. Bzzzt! My inability to find training examples from others turns out to be something simple: you cannot fine tune the “text-davinci-003” model.
Text Explanation
So, my next attempt was to look at “text explanation” using ChatGPT alone. “Text explanation” includes capabilities such as asking questions about a piece of text, summarizing text (pull out the highlights), classifying text (including sentiment analysis), and extracting entities. Because I am after structured data to be extracted from the text, question answering and extracting entities appeared to be the most relevant.
For example, with question answering I could provide the text and then ask a series of questions such as “who is in the shot?”, “where is <X> located?”, “what pose does <X> have?” and so forth. I did not find this very useful however as the responses were free text. Adding extra phrases such as “in at most 2 words” helped it be more precise, but it was not ideal – I had to send many questions via separate requests.
So I moved on to try entity extraction. My first attempt was:
From the text below describing a shot in a screenplay, extract the following entities in the following format:
People: <comma-separated list of names of people in the scene>
Expressions: <comma-separated list of names of people in the scene (with their expressions appended in parentheses if known)>
Where: <comma-separated list of names of people in the scene (with their location, if known, described in a few words appended in parenthesis)>
Pose: <comma-separated list of names of people in the scene (with the action they are performing, if known, described in a few words appended in parenthesis)>
Speech: <comma-separated list of names of people who say something in the scene (with what they say in parenthesis)>
Camera shot: <a short description of the camera shot>
Text:
"""
Sam is sitting at his desk in the classroom. He is sad, looking down at his hands. Use a frontal closeup shot. Hank is standing in front of Sam with his hands on his hips, looking at Sam. Hank says "Are you an idiot or what?"
"""
This returned the following:
People: Sam, Hank
Expressions: Sam (sad), Hank (looking at Sam)
Where: Sam (at his desk in the classroom), Hank (standing in front of Sam)
Pose: Sam (looking down at his hands), Hank (hands on his hips, looking at Sam)
Speech: Hank ("Are you an idiot or what?")
Camera shot: frontal closeup shot.
Not bad, but it is frequently too wordy for what I wanted. So I tried to give it a restricted vocabulary of words that I wanted extracted. (See the description of Expressions and Pose below.)
From the text below describing a shot in a screenplay, extract the following entities in the following format:
People: <comma-separated list of names of people in the scene>
Expressions: <comma-separated list of names of people in the scene (with one of the expressions normal, happy, sad, crying, angry, confused, thoughtful, surprised, or worried appended in parentheses if known)>
Where: <comma-separated list of names of people in the scene (with their location, if known, described in a few words appended in parenthesis)>
Pose: <comma-separated list of names of people in the scene (with one of the poses standing, sitting, walking, running, they are performing, if known appended in parenthesis)>
Speech: <comma-separated list of names of people who say something in the scene (with what they say in parenthesis)>
Camera shot: <a short description of the camera shot>
Text:
"""
Sam is sitting at his desk in the classroom. He is sad, looking down at his hands. Use a frontal closeup shot. Hank is standing in front of Sam with his hands on his hips, looking at Sam. Hank says "Are you an idiot or what?"
"""
The response was better, closer to my needs.
People: Sam, Hank
Expressions: Sam(sad), Hank(angry)
Where: Sam(classroom desk), Hank(in front of Sam)
Pose: Sam(sitting), Hank(standing)
Speech: Hank("Are you an idiot or what?")
Camera shot: frontal closeup shot.
I continued on to the next scene, wondering how much of the previous conversation would automatically flow through. This raised a problem however in that the length of the instructions is eating into the memory of previous shots. ChatGPT has a fixed size buffer of how much it feeds into subsequent requests.
Add the text below to the next shot in the screenplay, then extract the following entities in the following format:
People: <comma-separated list of names of people in the scene>
Emotions: <comma-separated list of names of people in the scene (with one of the emotions normal, happy, very happy, sad, crying, angry, confused, thoughtful, surprised, or worried appended in parentheses if known)>
Where: <comma-separated list of names of people in the scene (with their location, if known, described in a few words appended in parenthesis)>
Pose: <comma-separated list of names of people in the scene (with one of the poses standing, sitting, walking, running, they are performing, if known appended in parenthesis)>
Speech: <comma-separated list of names of people who say something in the scene (with what they say in parenthesis)>
Camera shot: <a short description of the camera shot
>Text:
"""
Liana walks in from the left and stands next to Hank. Hank is surprised and looks at Liana. Wide shot of all three.
"""
This returned the following:
People: Sam, Hank, Liana
Expressions: Hank(surprised)
Where: Sam(classroom desk), Hank and Liana(not specified)
Pose: Sam(sitting), Hank and Liana(standing)
Speech: None
Camera shot: Wide shot of all three.
Note the use of “and” sneaking in (“Hank and Liana(standing)”). Well, I can handle that. The camera description however needs to be tightened up.
After a few more trials however I noticed it start return values not in the lists I provided. It might provide “smiling” when I told it to use “happy”. That may be better choice, but is problematic because it means my code later has to deal with unexpected values.
But adding all this specific details into the problem is annoying. There is a length limit for prompts, so adding more and more detail to the prompt leaves me less space for content. So my next try was to reword it to try and be more concise.
The text below is a single shot of a screenplay. For each person in the shot, list the following: Name: <name of the person>
Facial expression: <one of neutral, smiling, grinning, sad, crying, angry, confused, thoughtful, surprised, shocked, worried, or unknown>
Location: <where the person is in as few words as possible>
Full body actions: <one of standing, walking, jogging, running, sitting>
Speech: <everything the character says>
Text:
"""
Sam is sitting at his desk in the classroom. He is sad, looking down at his hands. Use a frontal closeup shot. Hank is standing in front of Sam with his hands on his hips, looking at Sam. Hank says "Are you an idiot or what?"
"""
I did not include the camera request yet, but the output was pretty good!
Name:
* Sam * HankFacial expression:
* Sam: Sad * Hank: UnknownLocation:
* Sam: Sitting at his desk in the classroom * Hank: Standing in front of SamFull body actions:
* Sam: Sitting * Hank: Standing Speech:
* Hank: "Are you an idiot or what?"
Another example, trying to tighten up the location description.
The text below is a single shot of a screenplay. For each person in the shot, list the following: Name: <name of the person>
Facial expression: <one of neutral, smiling, grinning, sad, crying, angry, confused, thoughtful, surprised, shocked, worried, or unknown>
Location: <where the person is in three words or less>
Full body actions: <one of standing, walking, jogging, running, sitting>
Speech: <everything the character says>
Text:
"""
Liana smiles at Sam, taking a seat on the desk next to Sam. Sam smiles back. Liana says "Hank can overreact, can't he?". Mid shot of Liana. Sam says "Yes, sometimes!" Liana replies "Are you sure it's only sometimes?" They both laugh. Hank looks angry again.
"""
This returned
Name: Liana
Facial expression: Smiling
Location: Desk
Full body actions: Sitting
Speech: "Hank can overreact, can't he?"
Name: Sam
Facial expression: Smiling
Location: Desk
Full body actions: Sitting
Speech: "Yes, sometimes!"
Name: Hank
Facial expression: Angry
Location: Unknown
Full body actions: Unknown
Speech: Unknown
Err, great. The location descriptions are much more useful. But instead of listing characters per attribute, it listed attributes per character. Don’t you love consistency? I still had not included camera positioning. So I added some more text to the prompt.
Also for each camera shot, include:
Position: <the position of the camera in 3 words or less>
Shot: <one of wide, mid, closeup>
The response was, you guessed it, structured differently again. (It did however pick up an additional line of dialogue.)
Shot 1:
Name: Liana
Facial expression: Smiling
Location: Desk
Full body actions: Sitting
Speech: "Hank can overreact, can't he?"
Position: Unknown
Shot: Mid
Shot 2:
Name: Sam
Facial expression: Smiling
Location: Desk
Full body actions: Sitting
Speech: "Yes, sometimes!"
Position: Unknown
Shot: Mid
Shot 3:
Name: Liana
Facial expression: Laughing
Location: Desk
Full body actions: Sitting
Speech: "Are you sure it's only sometimes?"
Position: Unknown
Shot: Mid
Shot 4:
Name: Hank
Facial expression: Angry
Location: Unknown
Full body actions: Unknown
Speech: Unknown
Position: Unknown
Shot: Mid
It has decided that what I wrote as a single shot should be broken down into a series of shots! They were actually pretty good decisions where to do the breaks. But I want that artistic control, so I plan to tweak the wording more to be more precise what I am after.
Conclusions
The purpose of this blog was to explore how AI can be used to extract directions from text. A few approaches exist that show promise, but its by no means a solved problem. You might not be interested in this precise problem, but understanding the capabilities of the various AI libraries around can be useful as working out which of many available paths to take may ultimately be the most important question to answer. For myself, the GPT models show promise, but I will need to refine my prompts further, and it is annoying to have to repeat so much instructional text per request.
I like the idea of training a model to avoid repeating instructions for my request per call. I have been told that to fine tune a model, you need to give it 200 to 300 samples minimum for it to start following what you tell it. Not a small effort for an experiment. And its training a base model, not the model pre-trained for commands. So I hope to give this a go, but it might take a while to get all the text created.
Without fine turning, carefully wording and restructuring prompts (referred to as “prompt engineering”) can still get you a long way. I need to work on the length and preciseness of the instructions. It feels wasteful to have to keep repeating lengthy instructions per shot description (and you pay based on number of input tokens).
Another challenge I face is I want to extract more information than the above. I want to know what characters were looking at, I want to know any mentioned leg, arm, and hand positions beyond the basic poses. So the instructions are going to get even longer.
But it is interesting that with a very simple API (text in, text out) you can get access to an API with pretty powerful capabilities.