Development of Dialogue Systems for Serbian and Other South Slavic Languages is a technology project (TR-32035) of the Ministry of Science and Technological development (2011-2016) of the Republic of Serbia, aiming at establishing more flexible speech communication between humans and machines. The project gathers 38 researchers from 6 scientific research institutions in Serbia, as well as 5 researchers from abroad.
The project represents a continuation of previous technology project of the Faculty of Technical Scienced financed by the Ministry: Human-machine speech communication (2008-2010) and Development of Speech Technologies in Serbian and their application in "Telekom Srbija" (2005-2007), during which a continuous speech recognition system and a high quality text-to-speech system for Serbian and some other kindred South Slavic languages have been developed. Within the on-going project as well as the previous ones a number of speech databases and language resources have been developed, and more than 100 scientific papers at renowned international conferences and in journals have been published.
IVG10tf100n is a database designed for research in the area of speaker identification
and verification, namely for speaker recognition on the basis of digits
spoken over the phone. Recording has been repeated once a month with
approximately 100 speakers. It was carried out over the telephone network,
using a Dialogic CTI card. Samples were recorded on a hard disc in mono
PCM format, 16 bits/sample, 8000 samples/second. Each time the caller's
name, the calling phone number, two fixed and ten more random sequences
of four digits were also recorded. Some of the callers participated
in the recording process every month (their voice can be used for system
training), and some of them called only once (their voice can be used
for faulty identification probability testing).
SpeechDat II is a database compliant to the SpeechDat standard. The base is of telephone quality and currently contains 500 speakers. Every speaker pronounced 50 utterances which contain names (people, cities, companies), digits, amounts, dates, isolated phonemes, application words and phrases, phonetically rich sentences, etc. Recording format was mono A-law, 8 bits/sample, 8000 samples/second. The whole base is labeled and documented in accordance to the standard. The database inspection implied labeling of each noise and poorly pronounced phoneme, as well as phoneme boundaries positioning. It is used for training the system for phoneme based speech recognition over the telephone line.
AN_CASR is a database still in the recording phase. It is being recorded under
the criteria similar to the SpeechDat standard, but over the microphone
covering full audible range. It currently contains 30 speakers. Every
speaker pronounced 120 sequences which contain names (people, cities,
companies), digits, amounts, dates, isolated phonemes, application words
and phrases, phonetically rich sentences, etc. Recording format was
mono PCM, 16 bits/sample, 22050 samples/second. The recorded part of
database is fully inspected and labeled. The database should be used
together with S70W100s120 for training large vocabulary continuous ASR
TTSlsMarica is a database in Croatian, containing two hours of text chosen in a way convenient for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the studio in mono PCM, 16bits/sample, 22050 samples/second format. The base has been inspected, labeled and pitch-marked. Inspection implied marking the degree of impairment for every phoneme, open/closed types of vowels, as well as places where disturbances in glottal activity occured.
TTSlsMarija is a database which contains two hours of text chosen in a way convenient
for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the
studio in mono PCM, 16bits/sample, 22050 samples/second format. The
base has been inspected, labeled and pitch-marked. Inspection implied
marking the degree of impairment for every phoneme, open/closed types
of vowels, as well as places where disturbances in glottal activity
occured. Beside using another speaker, the speech rate was somewhat
slower in order to minimize the impairment of both vowels and consonants.
A professional speaker unanimously selected among 5 candidates was engaged
in the recording (see ETRAN2003.pdf). The voice of the database was also automatically converted to a male one, enabling speech synthesis using a male voice.
Among more than 100 scientific papers published at regional and international conferences and in scientific magazines and books, the following stand out:
Relevance of the Types and the Statistical Properties of Features in the Recognition of Basic Emotions in Speech
User-Awareness and Adaptation in Conversational Agents
Automatic Prosody Generation in a Text-to-Speech System for Hebrew
Speaker Detection Using Phoneme Specific Hidden Markov Models
Comparison of Linear Discriminant Analysis Approaches in Automatic Speech Recognition
Discrimination Capability of Prosodic and Spectral Features for Emotional Speech Recognition
Influence of the Number of Principal Components used to the Automatic Speaker Recognition Accuracy
A Novel Split-and-Merge Algorithm for Hierarchical Clustering of Gaussian Mixture Models
Automatic Prosody Generation for Serbo-Croatian Speech Synthesis Based on Regression Trees
Speech Technologies for Serbian and Kindred South Slavic Languages
Applications of Speech Technologies in Western Balkan Countries
Transformation-Based Part-of-Speech Tagging for Serbian Language
Eigenvalues Driven Gaussian Selection in Continuous Speech Recognition Using HMMs with Full Covariance Matrices
Part-of-Speech Tagging Based on Combining Markov Models and Machine Learning
Energy Normalization in Automatic Speech Recognition
An Overview of the AlfaNum Text-to-Speech Synthesis System
A Review of R&D of Speech Technologies in Serbian and Their Applications in Western Balkan Countries
Computers as a Tool for Serbian-Speaking Blind Persons
Description of training procedure for AlfaNum continuous speech recognition system