Skip to main content

Planning Meeting - ASR, LLM, and TTS Development

· 4 min read
Neasa Ní Chiaráin
Assistant Professor of Speech and Language Technologies, PI ABAIR Project
Ailbhe Ní Chasaide
Adjunct Professor
Harald Berthelsen
Software Engineer
Oskar Mroz
Research Assistant
Andy Murphy
Research Fellow
Joey McInerney
Masters Student
Christoph Wendler
Research Fellow
Simon Harshman-Earley
Research Assistant
Liam Lonergan
PhD Candidate

The team met to discuss strategic priorities and development plans across ASR, LLM, and TTS workstreams for the coming period.

ASR Development

ASR Meeting Notes

Core Technical Work

  • Developing automated speech & text corpus scraper and organizer
  • Implementing language ID with diarization and code-switching capabilities
  • Building bilingual Irish-English test sets with focus on:
    • Accent diversity
    • Source diversity (Joe Duffy, local radio stations, Gaeilge24)
    • Test augmentation with other English varieties
  • Creating Corpus na Gaeilge Beo test set with code-switching support
  • Conducting representation analysis

Data & Infrastructure

  • Standardizing format and metadata for existing speech corpora
  • Planning live capture for additional data collection
  • Seeking access to more corpora
  • Developing Irish speech encoder (HuBERT/wav2vec) for publication

Future Capabilities

  • Offline ASR capability
  • Automatic speech translation
  • Children's recognition - adding mile glór na n-óg (already performs very well)

API Development

  • Streaming support for concurrent users
  • Saving user input on website
  • Prompting users to credit ABAIR upon download
  • "Powered by ABAIR" branding
  • Testing, security, and scalability improvements

Evaluation & Resources

  • Comparison with commercial solutions
  • Compute resources - LUMI (AMD GPUs only)
  • Direct application to Nvidia
  • Exploring machines to run own LLMs (~€5,000)
  • DGX Spark consideration

ASR Evaluation Notes

LLM Work

LLM Meeting Notes

Training Data

  • Investigating whether including speech transcripts in training data improves conversational chatbot performance
  • Identifying data sources:
    • Contact tuairisc.ie
    • Historical corpus of Irish (Kevin Scannell)
    • Other accessible data sources
  • Building questions & answers dataset

Methodology

  • Exploring LIMA paper approach - less is more for alignment
  • Translation by multiple models for agreement, followed by native speaker feedback

TTS Development

TTS Meeting Notes

Children's Voices

  • Processing Donegal recordings to add children's data to current average voice
  • Creating mixture unique to each speaker

Bilingual Synthesis

  • Determining how to contain proseme set
  • Deciding between combined models vs. separate models with markup
  • Text processing using LLM / LID tags

Speech LLM Integration

  • Addressing challenges in fixing errors that arise
  • Resolving issues with language switching in SOTA systems

Research & Monitoring

  • Monitoring research updates for new synthesis techniques & methods
  • Addressing potential issues with Python-based models (speed & integration)

Evaluation

  • Compiling list of errors in each voice
  • Single word synthesis testing
  • Metric testing for new voices with subcategories
  • Creating comprehensive testing standards (primarily intelligibility)
  • Numeric judgment of voice quality
  • Integrating ASR for evaluation

Voice Transformation

  • Controllability (emotion, etc.)
  • Transformation to children's voices for minority dialects
  • Anonymization
  • Voice repair for impaired speech

Speech Enhancement

  • Enhancement for historical voices

Transcription

  • Integrating ASR to speed up TransTool process
  • Exploring whether LTS rules can be eliminated
  • Using LTS rules with grapheme-based synthesis to speed up dialect development

Documentation

  • Gathering metadata for each voice
  • Adding training data to config file
  • Documenting hyperparameters

Paralinguistics

  • Controllability over emphasis / emotion / age for child synthesis

Webpage / API

  • Better naming for voices (moving beyond basic/AI labels)
  • Creating beta page for testing

Infrastructure

  • Separating production and testing servers

ABAIR IV Deliverables

ABAIR IV Deliverables Notes

Core Features

  • Multispeaker dialect/gender support
  • DPO ethics
  • Comprehensive evaluation

End-User Orientation

  • Bank of voices for accessibility
  • Hiberno English support
  • Code-switching capabilities

Research Focus

  • Voice transformation
  • Anonymization
  • Emotion/expressivity

Security

  • Misuse flagging
  • Training attribution to ABAIR
  • Kudos-working system

Ethics

  • Communication with end users
  • Addressing "AI" terminology concerns with lay persons

Application Contexts

  • Public services
  • Accessibility applications
  • Broadcasting

Training & Data Protection

  • Data protection & security protocols

Additional Topics

Additional Topics Notes

  • Bilingual ASR synthesis for Gaeilge
  • Heritage dialects and speakers
  • Speed enhancement
  • Local/offline API (model size considerations)
  • C-Pen integration
  • LLM computational overheads
  • Corpus collection strategies for different types of corpora
  • Encoder development
  • Anonymization techniques
  • Data protection for public services/voices
  • Permissions framework