My thoughts on LLM Video Generation

Generated this video using Google Veo3. I realized two things after generating this video …

Prompt Engineering is still the KING.

As humans we know the world and we fill in the blanks.

My prompt for this video was:

A white bread slice lying down on a table. An egg is besides it. Then its hands and legs pop out and it stands up. It picks up the egg. Walks on to the pan on a gas stove beside the table. The bread slice cracks the egg on the pan. Throws the egg shells away. Then jumps on the egg to become french toast.

This is normal scenario for us but for a multi-modal LLM which needs frame-by-frame information, this is not enough information.

After generating this video, I used ChatGPT to generate a detailed prompt for storyboarding and frame-by-frame detailed information on what happens in that frame. That video was good but it failed for a different reason. That brings me to my second realization.

Physics is hard for LLMs

LLMs still struggle with the real-world physics. They are getting better but there is still some more improvement needed until they get really better. See how in the video a set of hands came to lift the bread slice first. Then another set of fingers appeared to crack open the eggs. At the end it didn’t become a french toast.

Here is the second video I generated.

See how the bread slice cracks the egg and two butter knobs were thrown and the somehow egg white and yolk dropped from already cracked eggs shells. Also at the end, the bread slice was standing in the pan with two butter knobs. I should have prompted better.

This is the detailed prompt for your reference.


DETAILED MULTI-MODAL VIDEO PROMPT  
────────────────────────────────────────
1 · GLOBAL SETTINGS  
• Duration ≈ 8 s (192 frames)  
• Aspect Ratio / Resolution 16∶9, HD (1280 × 720 px)  
• Visual Style Bright cartoon-realism; saturated colours; soft morning sunlight (key-light 40 ° camera-left).  
• Camera Single hand-held virtual camera, gentle dolly & pans that keep the bread character centred; f/2.8 depth-of-field.  
• Audio Light kitchen ambience bed + spot SFX listed in the storyboard.

────────────────────────────────────────
2 · CHARACTER & PROP REFERENCES  

| Asset       | Look & Texture                                                        | Size          | Notes |
|-------------|-----------------------------------------------------------------------|---------------|-------|
| Bread slice | White sandwich bread, 1.5 cm thick, browned crust, cute eyes & mouth; flexible toasted-dough arms/legs ending in white cartoon gloves & red sneakers. | 11 × 11 cm | Idle breathing motion. |
| Egg         | Large ivory hen’s egg with faint speckles; glossy.                    | 6 cm tall     | Weighty bounce. |
| Pan         | 25 cm cast-iron skillet, seasoned black interior, subtle sheen.       | —             | Centered on burner; faint steam once heated. |
| Gas stove   | Stainless cook-top, blue flame under pan (already lit).               | —             | — |
| Table       | Warm oak countertop with visible grain.                               | —             | 90 cm from stove. |

────────────────────────────────────────
3 · STORYBOARD & FRAME-BY-FRAME BREAKDOWN (8 s = 192 frames)

| Timecode (s) | Camera / Composition                                   | Action & Animation                                               | Key SFX / FX                         | Validation Checks |
|--------------|--------------------------------------------------------|------------------------------------------------------------------|--------------------------------------|-------------------|
| **0.0 – 0.5** | Static close-up; bread lying flat, egg 10 cm right.    | Subtle breathing in bread.                                       | Low room tone.                       | ✓ Both assets visible |
| **0.5 – 1.0** | Same angle, slight dolly-in.                           | Four limbs pop out with squash-stretch; eyes blink awake.        | 4 quick “pop” whooshes.              | ✓ Limbs emerge, no overlap errors |
| **1.0 – 1.3** | Medium shot; camera tilts up.                          | Bread stands, grabs egg with both hands.                         | Soft thud + cloth rustle.            | ✓ Egg securely held |
| **1.3 – 3.0** | Side tracking shot following bread.                   | Bread walks 90 cm to stove (9 steps at 0.19 s each).           | Sneaker squeaks; faint egg rattle.   | ✓ No foot-slip; egg intact |
| **3.0 – 3.5** | Over-shoulder high angle on pan.                       | Bread lifts egg overhead in anticipation.                        | Rising whoosh.                       | ✓ Pan & flame visible |
| **3.5 – 4.0** | Two-frame insert on egg striking pan rim.              | Egg cracks; whites/yolk pour in; shells split cleanly.           | Sharp crack → sizzling ramp.         | ✓ No shell bits in yolk |
| **4.0 – 4.3** | Return medium; bread flicks shells behind onto table.  | Shells arc back, land out of focus.                              | Light clink.                         | ✓ Disposal path clear |
| **4.3 – 4.8** | Low stove-level angle.                                 | Bread crouches; anticipatory squash.                             | Sizzle continues.                    | ✓ Limb IK stable |
| **4.8 – 5.3** | Slow-motion (0.5 s) mid-air front flip; camera up-tilt | Bread leaps 15 cm, flips 180 °, lands butter-side down on egg.   | “Boing” + louder sizzle on impact.   | ✓ Flip rotation accurate |
| **5.3 – 6.5** | Top-down tight shot.                                   | Egg absorbs; bread surface turns golden-brown; steam rises.      | Crackling pops.                      | ✓ Colour shift to #F6B66A |
| **6.5 – 7.0** | Pull-back hero shot, slight dolly-out.                 | Bread now French-toast, pats of butter for eyes; gives thumbs-up.| Cymbal “ta-da!”                      | ✓ Final pose held 2 frames |
| **7.0 – 8.0** | Freeze-frame hold with gentle vignette.                | Title card optional.                                             | Music sting trails off.              | ✓ All assets on screen |

────────────────────────────────────────
4 · CONTINUITY & TECHNICAL CHECKLIST  

1. **Assets present every frame:** bread/toast, pan, stove flame, table.  
2. **Lighting continuity:** single key + 20 % fill; avoid popping shadows.  
3. **Physics:** gravity 9.8 m s⁻²; egg fluid viscosity; no mesh interpenetration.  
4. **Animation rules:** limbs ≤ 130 % stretch; centre-of-mass forward during walk.  
5. **Audio sync:** SFX start within ±2 frames of visuals.  
6. **Per-frame metadata required:** camera transform, asset transforms & materials, light vectors, active particles, audio triggers.  
7. **Export:** 24 fps PNG sequence (no alpha) + WAV 48 kHz 24-bit stereo; master container ProRes 422 HQ.

────────────────────────────────────────
END OF PROMPT — ready for multi-modal LLM ingestion.

What is your experience with LLM video generation? Did you try any other LLMs?

If you are new to my posts, I regularly post about AWS, EKS, Kubernetes and Cloud computing related topics. Do follow me in LinkedIn and visit my dev.to posts. You can find all my previous blog posts in my blog