Work Items

to be discussed on 2014-10-30*Please add your remarks by clicking edit at the top right of this page!*

Evaluation: tool to evaluate correctness of alignments/srt files.

  • 1. proportion of matching subtitles between gold standard and hypothesis
  • 2. proportion of subtitles that start&stop within Δt of gold standard start&stop times (e.g. Δt = 100ms, 300ms, 1000ms)
  • for later (when we don't necessarily evaluate off gold standard text): some way to evaluate text-division into smaller units
Evaluation Metrics: Levenshtein Distance, Histogram of Deviations from Ground Truth, Word Error Rate, Time Taken to Correct after Alignment
Evaluation Driven Alignment: Phone and Phoneme based scoring for more focused corrections automatically

Alignment: find out what Sphinx really does (and why)

  • take a very close look at SpeechAligner, LongTextAligner, AlignerGrammar, etc.
  • what's the overall algorithm, how is it divided between classes?
  • profile runtime (and possibly memory use) by the involved parts (frontend, decoding, alignment-magic)

Data-Formats: re-visit the question of how we store our data

  • focus on intermediate layers
  • include meta-data (where did videos come from, when were transcripts pulled, etc.)
  • support in existing viewers/editors/software
  • flexibility of using different inputs (raw text, pre-split text, ...)
  • Allow export to USF Format in karaoke mode to include timing infomration

Splitting subtitles:

  • analyze splitting positions (how often are sentences split into parts?, where are they split?)
  • generate splits based on different methods (syntactic, length, ngram?)
  • think about how to evaluate splitting-performance

Subtitle display:

  • make strategic decisions about when to display subtitles earlier/later than alignment would indicate?
  • review existing literature/guidelines on this topic
  • integrate timings with video properties (such as cuts, people coming into view, ...)

Multilingual support:

  • identify all language-dependent processing and determine whether they factor out linguistic resources into separate models
  • find out whether current solutions scale to a reasonable language variety (EN, DE, FR, ES?)
  • if not, find alternatives for all or specific languages

Thinking further:

  • subtitle-creation from real-time transcriptions (could it be done, with what delay, in our architecture?)
  • automatic transcription, at least for some pre-defined keywords? (and extract keywords from presentations?)
  • cross-talk: what to do when multiple speakers speak at the same time (e.g. in a movie, not a presentation)

-- TimoBaumann - 28 Oct 2014
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback