Writing Text for Video: Did Someone Say 'Autumn Aided Cap Shins'?

Not long after the invention of the modern computer, 一个明显错误的假设是，计算机很快就能胜任处理自然语言数据的工作. 人们通常在3岁左右的时候就能很好地沟通了, so this didn’t seem to be an unreasonable expectation, 因为人们都知道，计算机解决的问题超出了最聪明的三岁孩子的能力. 语言理解是一种感知由另一个人说话引起的复杂声压变化的能力，然后根据当地文化和当前环境为他们分配象征性的解释，这是很难教给计算机的.

With the explosive growth of streaming media in the past decade, 已经投入了大量百家乐软件来应对自动为该视频配上字幕的挑战. Happily, some improvement has been made. The task of captioning is essentially this: Identify candidate speech sounds the speaker might be making; identify candidate words that fit the sequence of plausible sounds; choose the most probable sequence of candidate words; add appropriate punctuation; and segment the resulting text so it appears on screen in a way that can be easily and fluently read as it is spoken. Each of those tasks is difficult in its own right, 不同的自动字幕软件工具在某些方面比其他方面更好.

其中一项最近有所改善的任务是识别音素——语音中的元音和辅音. 这是一个众所周知的难题:因为每个人的声音都是独一无二的, 语音识别器需要经过训练来学习每个用户的特质. Improvement has come from two directions. On the client side, 我们大多数人都带着体积小但功能强大的电脑，它们的键盘很差，但麦克风还不错. 移动和桌面操作系统现在都有语音助手，它们会不断调整自己，以识别你独特的声音和你用它发出声音的方式. On the server side, we have classifiers, 选择输入数据是否属于某种分类或相似的软件, previously encountered data or rather to another category. 服务器端平台可以将语音信号与庞大的音位模式数据集进行比较，并比以前的系统更准确地对候选声音进行分类.

Another of those tasks that has improved, and will continue to improve, 从可用的候选词中选择最可能的单词序列. This is traditionally done with a language model; in its simplest form, 对不同单词一起出现的频率的统计分析. “自动化”(automated)和“标题”(captions)这两个词更有可能同时出现，而不是“秋季辅助帽饰”(autumn aided cap shins).” That likelihood is what language models capture.

教育视频的字幕在推动语音识别研究方面尤其成熟. A school is a fairly closed ecosystem. 我们可以很容易地识别老师是谁在讲课，我们可以很容易地让老师训练一个自定义的演讲模型，以便在她出现在视频中进行字幕时重用. Teachers at large research universities are the brightest minds recruited from all over the world and so their linguistic diversity is extreme; these custom-tuned speech models are critical for accurate captioning when your speakers are from such varied linguistic backgrounds.

教育视频通常包含技术词汇和术语，标准识别器很难识别. However, 我们可以使用老师在视频中使用的视觉辅助工具(通常是幻灯片), and those aids can be mined for contextually relevant vocabulary. This is exactly what Microsoft Garage’s Presentation Translator does. 对这些非典型术语的准确标注是至关重要的, since the captions would be misleadingly bad otherwise.

大学是语音识别领域许多顶尖研究人员工作的地方，对准确的自动字幕的需求也是迫切的. 这是一个完美的例子，体现了大学的三重使命——教育, to research, and to provide public service—demand cooperative action.

[This article appears in the June 2018 issue of Streaming Media magazine as "Autumn Aided Cap Shins."]

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Writing Text for Video: Did Someone Say 'Autumn Aided Cap Shins'?

Backward Design for Educational Video Production

An Impending Accessibility Backlash

如何得分，增强，并说明视频与YouTube创作者工作室

New FCC Caption Requirements: What You Need to Know

How to Caption Live Online Video

Best Practices: Video Conferencing Solutions

最佳实践:视频工程师和业务经理的编码和转码

More

现实世界中的实时:体育运动的超低延迟流媒体, Esports, iGaming, and Interactive Events

分析在行动:利用流数据来提高你的底线

More Web Events

The State of Video Monetization 2024

Streamticker: The Biggest Streaming Mergers & Acquisitions of 2023

Netflix、Max、迪士尼+密码监管是流媒体生存能力的最新出价

The State of Video Codecs 2024

Real-Time Streaming at Scale

Nimble Streamer: Cost-Efficient Streaming Software

More