Advanced Fine-Tuning Guide

This guide provides a comprehensive reference for optimizing transcription quality in AI Subtitle Creator. Each setting influences accuracy, speed, or robustness in different ways. Understanding how they interact will help you achieve the best possible subtitles for your content.


Voice Activity Detection (VAD)

Purpose
Voice Activity Detection analyzes the audio before transcription and removes segments that do not contain speech, such as silence, music, or sound effects.

Default setting
Enabled. This is the recommended option for most use cases.

How it works
Before Whisper begins transcription, the audio is scanned to distinguish speech from non-speech. Only the detected speech segments are sent to the transcription engine, reducing unnecessary processing and minimizing hallucinated text.

Recommended use cases
VAD is strongly recommended for movies and television shows, videos with long silent sections (such as credits or establishing shots), action scenes with loud effects or music, and any content where Whisper tends to output placeholders like “[Music]” or repeated phrases.

When to disable it
VAD may be unnecessary for content with continuous speech, such as audiobooks or uninterrupted podcasts. It should also be disabled if you need to preserve the exact timing of pauses, or when working with very short clips where the added analysis overhead provides little benefit.

Impact on performance and quality
VAD adds a small preprocessing cost, typically around 10–20% additional time. However, because less audio is ultimately sent to Whisper, the overall transcription process is often faster. Most importantly, it significantly reduces hallucinations and repeated text.

Technical note
This feature relies on Apple’s Sound Analysis framework to distinguish speech from non-speech audio.


Use Previous Context

Purpose
This setting determines whether Whisper should use previously transcribed text as contextual input for subsequent audio segments.

Default setting
Disabled. This is the recommended configuration for movies and TV shows.

How it works
When enabled, Whisper conditions each new segment on the text produced earlier, improving continuity and coherence. When disabled, each segment is transcribed independently, preventing earlier mistakes from influencing later output.

Why this matters
Context can be beneficial for long, uninterrupted speech with a consistent topic, such as interviews or lectures. However, in movies and other noisy or fast-changing content, it can cause error cascades. A single misheard line may be repeated or reinforced across multiple segments.

Recommended usage
Disable previous context for movies, television shows, action scenes, and content with frequent scene or topic changes. Enable it for interviews, audiobooks, podcasts, conference talks, and other forms of continuous, structured speech.

Impact
This setting does not affect processing speed. It can improve coherence in clean audio, but it increases the risk of hallucinations in noisy or complex scenes.


Temperature

Purpose
Temperature controls how conservative or flexible Whisper is when selecting words during transcription.

Default setting
0.0, which is the most accurate and deterministic option.

How it works
At lower temperatures, Whisper always chooses the most likely transcription. Higher values introduce randomness, allowing for more creative interpretations at the cost of reliability.

Recommended values
For subtitles, a temperature of 0.0 should always be used. Slight increases (up to 0.1 or 0.2) may help with extremely challenging audio, but anything above that significantly increases the risk of hallucinations and should be avoided.

Impact
Temperature has no effect on processing speed. Higher values consistently reduce transcription reliability.


Filter Non-Speech Audio

Purpose
This setting filters out segments that Whisper identifies as non-speech, such as music or background noise.

Default setting
0.6, which provides a balanced level of filtering for most content.

How it works
Whisper assigns a speech probability to each segment. Segments with probabilities below the configured threshold are treated as non-speech and excluded from the final output.

How to tune it
Lower values are more permissive and retain more content, including noise and music. Higher values are more aggressive and remove more non-speech audio, but may also filter out quiet or unclear dialogue.

Recommended adjustments
Lower the threshold if real dialogue is being lost, especially with whispered or soft speech. Increase it if you see excessive “[Music]” tags or noise being transcribed, particularly in action movies or concert footage.

Impact
This setting affects quality but not performance, as filtering occurs after transcription.

Best practice
Use this setting together with VAD. VAD removes silence before transcription, while this filter cleans up remaining non-speech afterward.


Confidence Threshold

Purpose
This option removes words that Whisper considers low-confidence based on probability scores.

Default setting
-1.0, which is a permissive and balanced choice.

How it works
Each word generated by Whisper has an associated confidence score. Words below the threshold are discarded, reducing random guesses and gibberish.

Recommended usage
Keep the default value for clean, professionally recorded audio. Increase the threshold (toward 0.0) if you prefer missing words over incorrect ones, especially with poor or noisy recordings. Lower values retain more content but may include more errors.

Impact
Raising the threshold improves accuracy but can introduce gaps. There is no impact on performance.


Filter Repetitive Text (Compression Ratio Threshold)

Purpose
This setting detects and removes repetitive or stuttering text patterns that commonly result from hallucinations.

Default setting
2.4, which works well for most content.

How it works
Highly repetitive text compresses efficiently and has lower entropy. Whisper uses this signal to identify likely hallucinations and discard them.

Recommended adjustments
Lower the value if you encounter repeated phrases or stuttering output, especially with larger models known to repeat text. Increase it if legitimate repetition is being removed, such as in dialogue, poetry, or song lyrics.

Impact
More aggressive filtering improves quality by removing repetition, with no effect on processing speed.


Beam Size

Purpose
Beam size controls how many alternative transcription paths Whisper evaluates simultaneously.

Default setting
5, which offers a good balance between accuracy and speed.

How it works
A larger beam explores more possible transcriptions, increasing the chance of finding the best one. This improves accuracy but significantly increases processing time.

Recommended usage
Reduce beam size for fast processing and clear audio. Increase it for difficult accents, noisy recordings, or professional work where maximum accuracy is required.

Impact
Beam size has a major impact on performance. Processing time roughly doubles for every increase of five beams. It is the single most expensive setting in terms of speed.


Time Offset

Purpose
Time Offset shifts all subtitle timestamps forward or backward to correct synchronization issues.

Default setting
0.0 seconds.

When to use it
If subtitles consistently appear too early, apply a small positive offset. If they appear too late, use a negative offset. Adjust in small increments (typically 0.1 seconds) and regenerate subtitles until timing feels correct.

Impact
This setting affects timing only and has no impact on transcription quality or performance.


Recommended Presets

For convenience, the following presets summarize effective combinations for common scenarios:

  • Movies and TV Shows (default): VAD enabled, previous context disabled, temperature 0.0, no-speech threshold 0.6, logprob threshold -1.0, compression ratio 2.4, beam size 5.
  • Interviews and podcasts: VAD disabled, previous context enabled, no-speech threshold slightly reduced.
  • Action or noisy movies: Higher no-speech and logprob thresholds, slightly lower compression ratio.
  • Poor-quality audio: Higher beam size and stricter uncertainty filtering.
  • Fast test runs: Beam size reduced to 1.
  • Maximum accuracy: Beam size increased to 8–10 with stricter filtering.

Troubleshooting by Symptom

If you see excessive “[Music]” or “[Silence]” tags, enable VAD and increase the no-speech threshold.
Repeated phrases usually indicate a need to lower the compression ratio or disable previous context.
Random or incorrect words can often be reduced by increasing the logprob threshold and enabling VAD.
Missing dialogue suggests overly aggressive filtering and should be addressed by lowering thresholds.
Timing issues are almost always resolved with a small time offset adjustment.


Performance vs. Quality Overview

Most fine-tuning options affect quality without changing performance. Beam size is the notable exception and should be adjusted carefully. As a general rule, start with defaults and only change one setting at a time.

Key takeaway: Beam size is the only setting with a major impact on processing speed. All other options primarily influence subtitle quality.


Resetting to Defaults

If adjustments lead to unexpected results, you can always restore the default configuration using:

Settings → Fine-tuning → Reset to Defaults

This returns all parameters to their recommended baseline values.