Part of my workflow for each episode of STEAM Powered is to create one-minute-long video clips for promotion on social-media.
Due to time constraints, I only did this on Facebook Creator Studio because it offered the caption auto-generation feature. However, the UI for this is buggy and fills my rage bar, but now that I’ve started doing captions, I don’t want to stop.
So I am now using the Google Cloud Speech To Text API to generate the subtitles. The API offers 60 minutes of audio processing per month for free before billing kicks in, which is well within my requirements at this time. And having a separate asset also allows me to add captions to the videos on Twitter and Instagram as well.
While I would love to offer captions and transcripts on full episodes in a variety of languages for accessibility, this isn’t presently possible with my availability or budget. I would, however, be incredibly appreciative if a kind patron or sponsor would like to donate towards a subscription to AssemblyAI so that this podcast and YouTube channel can reach more prospective and current #WomenInSTEM and other STEAM enthusiasts. 🙏
SRT from YouTube videos
If your videos are up on YouTube and and have closed captions enabled, you can find an online or downloadable tool that will fetch the SRTs. You’ll still need to edit it for typos if you want accuracy. YMMV.
I won’t recommend the one I used because it had questionable popups, but there are several free options available.
To be Investigated
Some cool things I’ll look into later when time and resources allow.
Using Google Cloud Services Speech to Text to generate SRT files from short videos.December 04, 20202 mins
- One-minute long because that’s the maximum length video you can upload to Instagram.
- While writing the previous annotation, I found a thing suggesting that you split the videos into 60-second segments and share it to Instagram as a slide-show. This is actually quite helpful to know, and isn’t that much extra work to create the extra segmented videos just for Instagram, but still annoying to have to do.
- As it is I have to create separate assets of different dimensions because each platform has different preferences.
- I record in my dining room, and my guests are record via video call wherever they are and use headsets and mics (some onboard) which can make for unpredictable audio quality. Add to that the use of technical and scientific terminology and we have a melting pot of factors that can reduce transcription accuracy. Which means I have to manually correct the transcripts. This is time-consuming.