Blog

How to Add Captions to Your Video Clips

Podcast recording microphone
The ScriptCut Team
/
June 9, 2026
/
8 min read

The reliable way to caption a video clip is to start from an accurate transcript with timing, sync it to the audio, style it for muted viewing, and export either burned-in captions or an SRT, never retype it by hand. Most people watch short clips with the sound off, so captions are not an accessibility afterthought, they are the script your audience actually reads.

Get them right and the clip works on autoplay. Get them sloppy, with typos, bad timing, or text that runs off the safe zone, and people scroll past.

Why captions decide whether a clip lands

A huge share of social video is watched muted, especially in feeds. If your hook lives in the audio and there are no captions, the hook does not exist for most viewers. Captions also keep people watching longer, which platforms reward. And they are the baseline for accessibility. There is no real argument against captioning a clip; the only question is how to do it without it eating your afternoon.

Step 1: Get an accurate transcript with timing

Auto-captions are a starting point, not a finish line. They mangle names, brand terms, and anything said quickly. Start from a real transcript with word-level timing so each word is tied to a moment in the audio. That timing is what makes captions snap to the voice instead of drifting. If you are starting from raw footage, how to transcribe an interview covers getting a clean, timed transcript.

Step 2: Clean the text

Fix the obvious errors, especially proper nouns and any term your audience would notice. Then decide how literal to be. For most clips you do not caption every 'um' and false start; you caption what the person meant. Removing the filler reads cleaner on screen, and the trick is the same one editors use to tighten audio, see how to remove filler words.

Step 3: Sync to the audio

Captions should appear as the words are spoken and clear shortly after. Word-level timing handles this automatically; if you are working from a flat transcript, you will be nudging timing by hand, which is the tedious part. Aim for one or two short lines on screen at a time, not a paragraph.

Step 4: Style for muted autoplay

This is where good captions are made or lost.

  • Keep lines short. One or two lines, a few words each. Long lines force the eye to work and get cut off on mobile.
  • Big, high-contrast text. A bold sans-serif with a stroke or background box so it reads over any footage.
  • Stay in the safe zone. Keep captions clear of the bottom UI on TikTok, Reels, and Shorts, and away from the very edges.
  • Match the platform. Vertical clips for YouTube Shorts, Reels, and TikTok are 9:16, so your caption layout has to live in a tall, narrow frame.

Per Google's own help docs, Shorts are vertical and up to three minutes; design your captions for that tall frame from the start rather than reformatting later.

Step 5: Burned-in or SRT?

Two ways to deliver, and they are not interchangeable.

  • Burned-in (open captions): the text is baked into the video pixels. Always visible, full styling control, and the safe choice for social where you cannot rely on the platform's caption toggle. The downside is they are permanent, so proof them carefully.
  • SRT (closed captions): a separate file the viewer can toggle. Best for YouTube long-form, where the platform displays them and uses them for search. Easy to edit later. Less reliable in fast feeds where the toggle is buried.

For short social clips, burn them in. For long-form YouTube, an SRT is usually enough, and you can do both.

A worked example

You have a 45-second clip pulled from a podcast. You start with the timed transcript you already have, fix two mispronounced names and trim the 'you knows,' set captions to two short lines in a bold white font with a dark stroke, position them in the upper-middle so the TikTok UI does not cover them, and export burned-in for the vertical platforms plus an SRT for the YouTube version. Ten minutes, not an hour, because you never retyped a word.

Common mistakes

  • Shipping raw auto-captions. One wrong brand name in a caption reads as careless.
  • Walls of text. Two short lines max. Long captions cover the footage and lose the eye.
  • Ignoring the safe zone. Captions hidden behind the platform UI are worse than no captions.
  • One layout for every platform. A 9:16 caption layout is not the same as a 16:9 one.
  • Captioning every filler. Caption the meaning, not every stumble.

How ScriptCut fits

If your clip came out of long content, you already did the hard part in the pre-edit: ScriptCut transcribes with word-level timecode, lets you remove fillers and trim to the moment, and exports subtitles alongside your timeline so the captions are already accurate and synced. You are styling, not retyping. Start at ScriptCut. To make the clips themselves, see how to make YouTube Shorts from a long video and repurposing a podcast into shorts.

Sources

Frequently asked questions

Should captions be burned in or a separate SRT file?

For short social clips, burn them in so they always show on muted autoplay with full styling control. For long-form YouTube, an SRT the viewer can toggle is usually enough and helps search. You can ship both.

Why not just use a platform's auto-captions?

Auto-captions are a draft. They mishear names, brand terms, and fast speech. Start from an accurate timed transcript, fix the errors, and you avoid the typo that makes a clip look careless.

How many lines of caption should be on screen?

One or two short lines, a few words each, kept inside the safe zone away from the platform UI. Walls of text cover the footage and lose the viewer's eye.

Do captions actually improve performance?

Yes. Most feed video is watched muted, so captions carry the hook, and they tend to keep viewers watching longer, which platforms reward. They are also the baseline for accessibility.