Automated processing of spoken language is a fundamental component of many applications that have found their way into our everyday lives. This includes dialog systems in telephone customer service as well as digital voice-based assistance that aim to make our everyday tasks easier; e.g. Apple's Siri, Microsoft's Cortana, or Amazon's Alexa. Although automated speech processing has proven itself in these systems, there are application areas in which the processing and recognition of spoken language continue to be a major challenge. One such area, which in addition to the technical challenge holds a high user potential, is automated speech processing in real as well as in virtual meetings.
Video conferences and virtual meetings are forms of communication used by many business teams that implement projects together in branch offices spread across several cities and countries. Speech processing in multi-participant meeting rooms is complicated because there are often multiple participants talking simultaneously, there is overlap during rapid speaker changes, and there is usually only one microphone placed in the room.
The use of automated speech processing and enhancement in combination with new, cost-effective solutions for miking the meeting rooms for multi-channel recordings holds immense potential to make virtual meetings more pleasant and efficient for all participants. In addition to improving the quality of the audio signal in virtual meetings, this relates in particular to possibilities for automatic speaker identification as well as smart meeting guidance, which can provide automated feedback to participants in real-time - for example, about poor audio quality.
The SAM project aims to develop methods and applications that improve communication in virtual meetings as well as their organizational flow to increase the efficiency of meetings. This includes the development of cost-effective methods for improving the audio signal in meeting rooms as well as new solutions for segmenting the audio signal. Each segment can be assigned to corresponding speakers, which helps to identify who spoke when, where, and for how long. Accurate speaker assignment is fundamental for all further steps of speech processing, such as giving feedback to meeting participants about their speaker participation in the form of smart meeting guidance. Since the listed goals are very difficult or even impossible to achieve based on a single-channel recording, as it is often the case in meeting rooms, the aim is to enable multi-channel recordings by interconnecting the microphones of participants' smartphones.
In this project, new methods for improving audio signal quality and segmentation for speaker assignment are being developed. The proposed approaches combine methods of signal processing, to emphasize certain properties of the audio signal, with new developments in the field of deep learning for segmentation and speaker assignment. Complementing these methods, the project team will develop an app that makes it possible to implement a multi-channel recording based on an ad-hoc microphone array by interconnecting the microphones of several smartphones.
Ongoing project 11/2019 — 10/2022
The project "Language Segmentation and its Applications in Meetings (SAM)" is funded by the Federal Ministry of Education and Research (BMBF) through the programme „Research at Universities of Applied Sciences (FHprofUnt 2018)“ with a duration until October 2022.