Cymen

Welsh Speech Recognition Technology

This project will develop automatic transcription technology that recognises Welsh speech. Cymen will collect 500 hours of Welsh voices to create the largest ever corpus of Welsh text and voice data, and collect voices from participants from all over the ARFOR area. They would then be able to evaluate how many hours of audio are needed to effectively train the transcription machine.

We see this as a huge leap on a crucial path to ensure that the Welsh language will not be left behind in the digital space. Neither the language models nor the bulk data are available in minority languages so our intention is to create these linguistic databases in order to be able to train the AI machines to deal with the Welsh language in the same way.