← Back to articles

Creating the Holiday 2025 video - Part 1: Tooling

What happens when a software engineer with no video experience decides to make a holiday film before lunch?

If you haven’t yet seen the result, the video is here.

Introduction

Seeing as this is the first post here, I should introduce myself: I’m Ben and I’m the founder of No Bananas, a multidisciplinary dev studio based in Squamish, BC (a beautiful mountain town north of Vancouver). If you need help rolling out AI to your team or into your product - we’re here for you. We’re also writing a living book on how to cope in this new AI era.

What we’re not is a video agency, and I’m not a video creator. So, why on a rainy December morning did I think: “I’ll create a short film using AI”, and why did I think “I’ll try and do it in 4 hours”?

There’s not really a good answer to that. As a career software engineer who has been deeply involved in AI for the last few years, I felt I had a pretty good grasp on the landscape in that realm. But I wanted to explore another realm, and video feels like where AI-assisted software engineering was a year ago - impressive, but not yet ubiquitous.

In this series of posts, I will attempt to capture how I went about this project. You can’t build anything without tools, so in part one we’re going to talk about the current tooling that is out there, and what I used to create the film.

TL;DR:

Just want to know what we ended up using?

Video Models

I set out with very little knowledge on what the current video models were, and even less about their capabilities. A few quick google searches revealed a plethora of options, but time was limited. I’d love to say I discovered some little known model that was perfect for the job, but with such little time I looked into 2 of the main players, Sora by OpenAI and Veo by Google. I figured one of those models, or maybe a combination would be good enough.

I had previously used Sora to generate a background video for a side-project we’re working on at No Bananas, so I had some experience with prompting it, however I had struggled to get good results for what I was trying to achieve. That may well have been down to my lack of prompting skills for these models, which is why I chose to utilise an LLM model to help with that (more on that below).

As for Veo, I’ve tended to avoid Google models such as Gemini1, so their ecosystem was more of a black box, but I signed up, sent them some money, and ran a few initial tests. Essentially comparing outputs for the same prompts across the two models.

My immediate impressions were that Sora seemed to be targeting more shortform creators for social media. Their default aspect ratio is portrait, and their app seems more targeted at casual user creating videos for fun rather than professionals. Veo (and the tooling around it) seemed much more aimed at a more professional market, creating longer form videos. As it turns out it was the tooling around the model rather than the model itself that led to me using Veo 3.1 for this project.

Context tooling

I soon discovered that when it comes to creating longer form videos, the tooling that wraps the models is almost as important as the model itself. So what exactly is this tooling? Firstly we have to talk about some of the limitations of the current crop of video models:

  • Generation length
  • Context window size

We’ll talk more about context in the next post where we delve into constructing the “world” for the video, but the main constraint for somebody looking to generate anything longer than a clip of a few seconds is the generation size. The current models max out at around 10-20 seconds in practical terms. This isn’t a real problem in and of itself, after all, describing a 2 minute clip in one prompt would come with its own slew of problems. The issue comes with getting consistency and continuity across these essentially distinct clips.

The way in which this is done is by passing context about what the previous clip contained, along with information about what happens in the new clip. Both Sora and Veo allow you to pass in images as context which is the best way currently to communicate context about the film to each subsequent clip. So assuming you have some initial clips that look and feel how you want them to, you should be able to pass frames from these clips to the model in order to generate the next clip and so on.

I’m assuming that there may be 3rd party tooling for this, however with so little time, I decided I’d only use what Google/OpenAI provided, which really amounted to much the same - a simple text interface with image upload functionality. I’d assumed I’d be taking screenshots of clips to pass to the model for the next clip.

That’s when I discovered Flow, a tool designed to minimize the overhead of generating consistent videos across a whole story. It’s a labs product, so it’s new, fairly rudimentary, and bit rough around the edges, but this tool was the key enabler to getting the video done in a morning.

Prompting

The next challenge I knew I was going to face was prompting. At this point I’d spent years prompting LLMs to generate code and words, but probably less than an hour prompting video models. I didn’t really have time to learn, so I wondered could I lean on an LLM to help me write prompts for Veo?

A quick bit of research uncovered this great prompting guide from Google, and I used this to tell ChatGPT how to generate prompts for Veo. Overall this approach worked well, especially in the initial stages where I had less context to work with. As my clip library built up, I was able to lean more on previous clips to generate the next clips with minimal prompts. When the scene changes, e.g. from desert to lab, I could again lean on ChatGPT to help me with the more advanced prompts to describe the change in location.

Images for Video

Whether you’re using AI to start a new software project, or a video project you’ve got to provide information to the model to convey your intentions. There has been a whole heap of advancements in this area with LLMs and code generation such as spec-driven development, these kinds of advancements don’t seem to exist for the video realm yet (as least in my limited discovery phase) 2

I figured if I could generate a set of base imagery that depicted the character, the locations and the general look and feel of the film, that could provide the seed for the project and, along with the storyline, form the specification for the film. This proved to be a very successful way of both developing the initial look and feel, but it also proved valuable later on when the models were struggling with maintaining consistency.

I used Nano Banana to generate the base imagery, mainly because it can be done directly from Google Flow. It’s also a fast model.

Here are a few tips for image generations:

  • For characters generate a seed image followed by multiple different images, including different profiles and full body shots showing their build and clothing, keep tweaking them until they are right.
  • For scenes, keep the images basic, and stark. You can add buildings etc via later images, or directly in the video prompts
  • As you build up your scenes, capture more images (they can be frames from the video) to provide to future generations.

In the next post we’ll delve into creating the actual story and how I used Flow to generate clips.

  1. It’s nothing personal, it’s just a bit painful to setup Gemini when you have a Google Workspace account. I ended up creating a personal Google account to save time.

  2. Who knows, perhaps we’ll build some here at No Bananas