On-demand compute power for rendering video
You know the saying, the shoemaker's children go barefoot? That’s exactly how this post topic came by. I was preparing some graphics visualization for a new webinar and as I was rendering the final product for like 6th time due to a minor change, effectively turning my notebook into a fancy, unusable paperweight for the next thirty minutes. It struck me. Here I am, calling myself a Cloud Architect, using my own hardware to do the heavy work, being blocked away from working on my product.
You probably guessed what followed after this epiphany. Let’s take the repetitive, compute-heavy tasks, shift them to the magic environment in the Cloud and focus only on your thing.
Choosing a platform
The provider of choice for this task is Google Cloud Platform as we can use multiple GCP features to achieve a completely automated rendering pipeline. Now let’s start with some preparations.
Preparing worker image
We want to run our computation tasks in an environment that already includes everything we need to complete the task. There’s little point in spinning and manually initializing worker nodes for each computation round.
For that, we will prepare a Compute Engine Image. The image can be created from various sources. One of them is the boot disk of a Compute Engine VM instance. Firstly, we will spin up a new VM Instance. In this stage, it doesn’t really matter what machine type we choose, we are just preparing the boot image.
For the purpose, I’ve chosen a Debian base image. You can choose whatever suits your needs. After the instance is up and running, we exec into the instance via GCP ssh feature and prepare everything we need to execute our tasks.
In my case, I will be using Synfig tool to render a 2D vector animation, pulling the source files from a git repository. Those are the tools I installed on our image.
Now we are ready to create an image from our instance. The recommendation and best practice, when you are creating an image from running instance, is to either stop the instance entirely or at least minimize the disk operation during image creation as inconsistency might occur if boot disk is modified during the image creation.
After we stop our image from the VM instances tab, we head into Compute Engine - Images section and create a new image from here.
As a source, we select disk and Source disk will then be the boot disk of our stopped instance, usually named after our VM instance.
As we have our image, we can delete the VM instance for now.
To make things a little more convenient, we will connect our git repository to GCP Source Repositories (GCP integrated git-like storage). This way we can take advantage of using the default service account included in VM instance and authenticate to clone the source code on the VM instance (assuming we don’t want to make our repo public). Keep in mind that in order to use this Compute Engine default service account to access the repo, we need to add at least the source.repos.get permission to this service account, which is included i.e. in source repository reader role.
In the Source Repositories section, you can add a new repository.
There you can either create a new repository and push all your content there, or connect external repository i.e. on Github which will be mirrored in there.
Specify the name of your repo and current GCP project. After creation, you will be provided with example codes to init or push your existing code.
If you can’t or don’t want to do this, you will have to additionally deal with ssh keys configuration or client-side Google SDK initialization in order to access your source code.
Preparing output storage
We will take advantage of the same default service account authentication to store the final product in Google Cloud Storage. For that, we will create a new bucket where you specify a globally-unique name and type. The same as for Source Repositories access, don’t forget to add at least storage write permission to your Compute Engine default service account if needed.
Spinning the worker node
With the Image, source repository and output storage prepared, we can spin the worker node which will serve as a computation unit. In the Compute Engine section, create new Instance where you specify the name, machine type (in my case, 2D vector rendering are very CPU intensive so my choice was a n1-highcpu type instance with 32 cores) then in Boot disk, select your Image we created earlier. Finally, we will use a Startup script field in additional configuration to specify, what should happen after the instance is started.
There will be the following script in the field:
# script for rendering images from source file
cat <<EOF > /opt/render.sh
intervals=\$(seq 0 \$((\$cores-1)))
for i in \$intervals; do
test \$e -gt \$end && e=\$end
synfig -t png --begin-time "\$b"f --end-time "\$e"f -o /opt/synfig-kubernetes-repo/output/kubernetes.png --sequence-separator "-" /opt/synfig-kubernetes-repo/\$name.sif* &
# clone source repo, delete older version if needed
rm -rf /opt/synfig-kubernetes-repo
gcloud source repos clone synfig-kubernetes /opt/synfig-kubernetes-repo --project=kot-test-207914
# prepare folder for output files
mkdir -p /opt/synfig-kubernetes-repo/output
# run the rendering script
chmod +x /opt/render.sh
# upload rendered images into Google Cloud Storage, mark folder by date and time
gsutil -m cp /opt/synfig-kubernetes-repo/output/* gs://synfig-render/$(date +"%y-%m-%d.%H-%M")/
# power off the instance when not needed
As you can see, in the Startup script, I am saving the rendering script into a file and then proceed to clone the source code and create an output folder. Then we can run the rendering script itself and after finishing, upload the final product into google storage. As for the rendering script, it just splits all frames to render equally among all available CPU cores and running multiple commands on the background (unfortunately the default multi-threading option of Synfig CLI is currently not working)
And that’s it, after creating the instance, it will automatically clone your repo and start rendering, then uploading the final product. You can see in GCP storage that each render will create a new folder tagged by date and time.
And in those folders the final product, in my case the series of rendered images.
When everything is done it will automatically power itself off to save some of those bucks.
If you check the integrated monitoring of our instance, you will see exactly how long the instance was running. In my case, it was around 8 minutes. If we take the price of chosen 32-highcpu instance: $1.1344 per hour, we are looking on around $0.15 per render plus some minor charges for storage and network traffic.
Now every time you want to render your masterpiece you can manually restart the instance and the startup script will run again. Well.. at least in a world where three additional mouse clicks are acceptable, which obviously is not the world we want to live in... So let’s create the final piece of the puzzle.
Creating an automatic trigger
As we are using Google Source repositories in this example, we can create a Google Cloud Build trigger, which will react to updates on git repository by automatically starting our worker node again.
Firstly we need to create a build config in our git repository. In cloudbuild.yaml file in the root of our git folder we add the following lines:
args: ["compute", "instances", "start", "synfig-render", "--zone", "europe-west1-c"]
Here we specify the helper builder - gcloud (docker image that will run the command specified in the args line) and what command we want to run with this helper image, that is the start of our worker instance.
As for previous stages as well, we need to add permission to start GCP Compute Engine Instance to the default service account that is running within GCP Cloud Builds (looks something along ….@cloudbuild.gserviceaccount.com) i.e. Compute admin role.
Now in the Cloud Build section - Triggers, create a new trigger where you select the source repository (our Cloud Source Repository)
And in the trigger configuration, we specify that we want to run the Cloud build configuration we created above. Additionally, I specified that I want to trigger this action only when the file with the vector graphic is edited (kubernetes.sifz)
Finally, now whenever we push a new version of our vector file to the repository, our rendering instance will be automatically started and within a few minutes, we will have the final product available in the GCP storage.
As you probably said to yourself somewhere along reading the post, this was quite a “quick and dirty” solution for a problem that occurred. But the point is more to show the core thought of delegating the computation work to a non-blocking environment. There is a lot of room for improvement and the final solution will be heavily dependent, among other things, on the provider, rendering tool and source/output file format. Some of the things to think about what comes to mind right away:
Some of the things throughout this journey are kinda just put together as I went. The boot image preparation, for example, was created in the imperative-style fashion. The steps resulting in our boot image are nowhere documented or versioned. The same goes for the startup script. In the DevOps heaven, everything would be created from some kind of as-a-code repository.
What we had here was a CPU-heavy task at hand. In case we would need some serious GPU power, we would need to attach some GPUs to our worker node
Source files for render from GCP storage
In the case of a vector graphics, its source code is very storage-friendly. That’s why I could afford to use a git repository. For other source types, we could, for example, upload them to GCP Storage first, and then transfer them to the worker node
Better worker node control
Here we have spun the worker node and are dependent only on startup script and worker node restarts. There is a number of things that can go wrong with this approach. For example, the script can fail midway and thus the instance will not power itself off correctly. Furthermore, from now on we can’t change (instantly) how many cores we want to use. The solution could be to use some external control of the worker nodes, such as Google Cloud Function, where we could create new (multiple) instances of variable size as well as monitor that they finished successfully.