AlfaNum is the only team within the boundaries of former Yugoslavia that develops speech technologies for Serbian language, and achieved results that are quite comparable to those achieved for some languages spoken throughout the world. Several experts for speech technologies have been working on these problems for a number of years, at the Faculty of Engineering in Novi Sad, Serbia and Montenegro. Resolving the problem of speech technologies requires knowledge from various areas - from technical and mathematical to linguistic, and such multidisciplinary problems are generally tackled by teams that assemble up to several dozens of people. A very wide area of application of these technologies and the extreme language dependency show the true importance of the AlfaNum project, whose solutions contain original scientific contributions verified on both regional and international level, with the most innovative solutions patented as well. This page introduces the most important software solutions created by AlfaNum, speech databases and language resources developed and dealt with within the project, papers published on conferences and in science magazines, as well as several scientific and humanitarian projects we are currently involved in.

SOFTWARE

The first system for phoneme-based continuous speech recognition in Serbian language:

  • the system is speaker independent
  • the solution is software-only and does not require any additional hardware
  • accuracy exceeds 98% on a dictionary of 2000 words (telephone quality)
  • the system gives an estimate of recognition reliability, with a list of alternative recognition results sorted by their likelihood
  • there are wildcard models useful for word-spotting
  • the system is fast and reliable, and supports modern multicore and multiprocessor platforms
  • several interfaces are supported: C++ library, ActiveX, MS SAPI, IP server, MRCP
  • The system supports distribution over multiple computers as well as load balancing, which enables its application in very demanding environments

The first text-to-speech synthesis system in Serbian language with elements of prosody incorporated

  • the solution is software-only and does not require any additional hardware
  • a version of PSOLA algorithm is used for synthesis
  • the system is adaptable to other synthesis algorithms (HN model, HMM)
  • the system is fast and reliable, and can handle 100 lines in real time
  • implemented prosodic elements (accentuation) significantly contribute to the intelligibility and naturalness of synthesized speech, while the naturalness of sentence intonation is achieved using techniques based on classification and regression trees (CART)
  • the system has many additional features (reading Cyrillic alphabet, numbers, words without diacritical marks, abbreviations, words of foreign origin...)
You can try out the speech synthesis system on the demo page, where you can also find the instructions for the demonstration of the speech recognition system via telephone.

Beside this software, we have developed some additional software for digital speech signal processing and speech database processing (free to download), appropriate C++ visualization libraries as well as a C-compiler adapted to the MAS 35xx digital signal processor of the Micronas company. During our work on speech synthesis, we have developed a tool for efficient creation of an accentuation dictionary of Serbian language, as well as for creation of a morphologically annotated text corpus.

W150tf1000 is a database of 150 isolated words, pronounced by 600-1000 speakers recorded over telephone channel, using Dialogic CTI card, on a hard disc in mono PCM format, 8 or 16 bits/sample, 8000 samples/second. Database contains utterances of the days of the week, months of the year, horoscope signs, geographic terms (state, city, river, mountain, sea, ocean, lake,...), words for browsing menus in IVR applications (information, account, check,...), command words (go, stop, forward, back, up, down,... ) and numbers: 0-9, 10-19, 20, 30,..., 90, 100, 200,..., 900, 1000, 100000, as well as some of their variations. Beside standard documentation, the database contains concomitant text files with additional information on phoneme boundaries in some of the words. This database is stored on two CDs.

IVG10tf100n is a database designed for research in the area of speaker identification and verification, namely for speaker recognition on the basis of digits spoken over the phone. Recording has been repeated once a month with approximately 100 speakers. It was carried out over the telephone network, using a Dialogic CTI card. Samples were recorded on a hard disc in mono PCM format, 16 bits/sample, 8000 samples/second. Each time the caller's name, the calling phone number, two fixed and ten more random sequences of four digits were also recorded. Some of the callers participated in the recording process every month (their voice can be used for system training), and some of them called only once (their voice can be used for faulty identification probability testing).

IVG10tf100n and W150tf1000 contain pronounced telephone numbers and random number arrays that can be used for testing connected word recognition systems.

S70W100s120 is a database that has originally been recorded on tapes in 1983 in the deaf room of the Faculty of Electrical Engineering (ETF) in Belgrade. 120 speakers were recorded, each one of them pronouncing 70 sentences and another 100 isolated utterances (60 words, digits 0-9 and 30 phonemes). AlfaNum team has transferred those analog recordings into digital form on a CD. The recording were digitalized to a mono PCM format, 16 bits/sample, 22050 samples/second. The base is segmented, documented and labeled. A compressed form of the database is stored on three CDs. The database is an ideal resource for starting scientific and research work on continuous speech recognition in Serbian language.

SpeechDat II is a database compliant to the SpeechDat standard. The base is of telephone quality and currently contains 500 speakers. Every speaker pronounced 50 utterances which contain names (people, cities, companies), digits, amounts, dates, isolated phonemes, application words and phrases, phonetically rich sentences, etc. Recording format was mono A-law, 8 bits/sample, 8000 samples/second. The whole base is labeled and documented in accordance to the standard. The database inspection implied labeling of each noise and poorly pronounced phoneme, as well as phoneme boundaries positioning. It is used for training the system for phoneme based speech recognition over the telephone line.

AN_CASR is a database still in the recording phase. It is being recorded under the criteria similar to the SpeechDat standard, but over the microphone covering full audible range. It currently contains 30 speakers. Every speaker pronounced 120 sequences which contain names (people, cities, companies), digits, amounts, dates, isolated phonemes, application words and phrases, phonetically rich sentences, etc. Recording format was mono PCM, 16 bits/sample, 22050 samples/second. The recorded part of database is fully inspected and labeled. The database should be used together with S70W100s120 for training large vocabulary continuous ASR system.

TTSlab2g2s is a database of diphones and disyllables in Serbian language designed for TTS system development. It has been recorded in laboratory conditions in mono PCM format, 16 bits/sample, 44100 samples/second. Diphones have been recorded both as parts of meaningful words, and as parts of meaningless words. Disyllables of the most frequent consonant groups have also been recorded. Dissylables are phonetic units that start within a vowel and end within the following vowel.

TTSlsMarina is a database in Serbian, containing two hours of text chosen in a way convenient for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the studio in mono PCM, 16bits/sample, 22050 samples/second format. The base has been inspected, labeled and pitch-marked. Inspection implied marking the degree of impairment for every phoneme, open/closed types of vowels, as well as places where disturbances in glottal activity occured.

TTSlsMarica is a database in Croatian, containing two hours of text chosen in a way convenient for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the studio in mono PCM, 16bits/sample, 22050 samples/second format. The base has been inspected, labeled and pitch-marked. Inspection implied marking the degree of impairment for every phoneme, open/closed types of vowels, as well as places where disturbances in glottal activity occured.

TTSlsMarija is a database which contains two hours of text chosen in a way convenient for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the studio in mono PCM, 16bits/sample, 22050 samples/second format. The base has been inspected, labeled and pitch-marked. Inspection implied marking the degree of impairment for every phoneme, open/closed types of vowels, as well as places where disturbances in glottal activity occured. Beside using another speaker, the speech rate was somewhat slower in order to minimize the impairment of both vowels and consonants. A professional speaker unanimously selected among 5 candidates was engaged in the recording (see ETRAN2003.pdf). The voice of the database was also automatically converted to a male one, enabling speech synthesis using a male voice.

TTSlsSnezana is a database which contains ten hours of text chosen in a way convenient for TTS system which takes speech segments from a large database (see AlfaNumTTS.pdf). It is recorded in the studio in mono PCM, 16bits/sample, 44100 samples/second format. The base has been inspected, labeled and pitch-marked. Inspection implied marking the degree of impairment for every phoneme, open/closed types of vowels, as well as places where disturbances in glottal activity occured. Each word was part-of-speech tagged and marked for the values of particular morphological categories as well as accentuation. A certain portion of the database (about 40%) has been marked for phrase breaks as well as sentence focus, making the database convenient for automatic prediction of prosodic features of speech as well. A professional speaker unanimously selected among 10 candidates was engaged in the recording (see ETRAN2003.pdf).


Morphological dictionary of the Serbian language represents a database containing entries related to particular inflected forms of words of the Serbian language. Each entry is classified under a particular lemma, and marked for part-of-speech, values of appropriate morphological categories, as well as the position and type of accent (see Akcenatski_recnik.pdf). An example of a dictionary entry is:

Vb-p-1-- dobićemo (dobiti) [\000]

Morphological categories that are marked are dependent on the part-of-speech, e.g. for verbs only tense/mood, gender, number and person are applicable, and the last three are applicable only to certain tenses/moods. The particular example represents the 1st person (1) of the plural (p) of the verb (V) dobiti in the future tense. The most recent version of the dictionary contains approximately 4.4 million entries, classified under approximately 102,000 lemmas. The format of the entry allows for a phonetic transcription to be specified as well if necessary (in view of the fact that phonetic transcription of almost all Serbian words is practically equal to the orthographic form), which allows the inclusion of frequent words from foreign languages (approximately 2,000 entries). The dictionary is based on the Dictionary of Serbo-Croat by Matica srpska, the Dictionary of the Serbian Academy of Sciences and Arts, as well as a number of other sources, and it has been supplemented with words that have not been found in any of the dictionaries but have been detected by automatic analysis of large text corpora in the electronic format.

Morphological dictionary of the Croatian language represents a database containing entries related to particular inflected forms of words of the Croatian language. Each entry is classified under a particular lemma, and marked for part-of-speech, values of appropriate morphological categories, as well as the position and type of accent. An example of a dictionary entry is:

Vc------ dobit (dobiti) [\0]

Morphological categories that are marked are dependent on the part-of-speech, e.g. for verbs only tense/mood, gender, number and person are applicable, and the last three are applicable only to certain tenses/moods. The particular example represents the infinitive (c) of the verb (V) dobiti, specifically its form used for building the future tense, with the i at the end dropped. The most recent version of the dictionary contains approximately 4.1 million entries, classified under approximately 97,000 lemmas. The format of the entry allows for a phonetic transcription to be specified as well if necessary (in view of the fact that phonetic transcription of almost all Croatian words is practically equal to the orthographic form), which allows the inclusion of frequent words from foreign languages (approximately 500 entries). The dictionary is based on the Dictionary of Serbo-Croat by Matica srpska, the Dictionary of the Croatian language (V. Anić), as well as a number of other sources, and it has been supplemented with words that have not been found in any of the dictionaries but have been detected by automatic analysis of large text corpora in the electronic format.

AlfaNum text corpus of the Serbian language contains approximately 200.000 words. The corpus has been annotated with parts-of-speech, values of relevant morphological categories, as well as accentuation pattern, following the scheme outlined above. In the first phase the tagging was carried out using the AlfaNum module for automatic morphological annotation, with the accuracy of 94% (even higher on "easier" texts), while in the second phase the remaining errors have been corrected manually. The corpus represents a collection of texts corresponding to different functional styles, principally technical, journalistic and administrative.

The Serbian dependency treebank is a corpus of 1,148 syntactically annotated sentences in Serbian, containing a total of 7,117 words. The annotation is carried out in line with the standards set by the Prague Dependency Treebank (more, precisely, its analytical level), which has been adopted as a starting point for the development of tree-banks for some other kindred languages in the region. The databank is intended for various applications in the field of natural language processing, primarily natural language understanding within human-machine dialogue.


Among more than 200 scientific papers published at regional and international conferences and in scientific magazines and books, the following stand out:

Relevance of the Types and the Statistical Properties of Features in the Recognition of Basic Emotions in Speech
Milana Bojanić, Vlado Delić, Milan Sečujski
Facta Universitatis, University of Niš, 2014
2014_facta_esr.pdf

User-Awareness and Adaptation in Conversational Agents
Vlado Delić, Milan Gnjatović, Nikša Jakovljević, Branislav Popović, Ivan Jokić, Milana Bojanić
Facta Universitatis, University of Niš, 2014
2014_facta_agents.pdf

Automatic Prosody Generation in a Text-to-Speech System for Hebrew
Branislav Popović, Dragan Knežević, Milan Sečujski, Darko Pekar
Facta
Universitatis, University of Niš, 2014
2014_facta_hebrew.pdf

Speaker Detection Using Phoneme Specific Hidden Markov Models
Edvin Pakoci, Nikša Jakovljević, Branislav Popović, Dragiša Miškovic, Darko Pekar
SPECOM 2014
Novi Sad, Serbia, September 5th-9th, 2014
2014_specom_spk_detect.pdf

Comparison of Linear Discriminant Analysis Approaches in Automatic Speech Recognition
Nikša Jakovljević, Dragiša Mišković, Marko Janev, Milan Sečujski, Vlado Delić
Elektronika ir Elektrotechnika, Kaunas University of Technology, 2013
2013_eie_lda.pdf

Discrimination Capability of Prosodic and Spectral Features for Emotional Speech Recognition
Vlado Delić, Milana Bojanić, Milan Gnjatović, Milan Sečujski, Slobodan Jovičić
Elektronika ir Elektrotechnika, Kaunas University of Technology, 2013
2013_eie_esr.pdf

Influence of the Number of Principal Components used to the Automatic Speaker Recognition Accuracy
Ivan Jokić, Stevan Jokić, Zoran Perić, Milan Gnjatović, Vlado Delić
Elektronika ir Elektrotechnika, Kaunas University of Technology, 2012
2012_eie_pc.pdf

A Novel Split-and-Merge Algorithm for Hierarchical Clustering of Gaussian Mixture Models
Branislav Popović, Marko Janev, Darko Pekar, Nikša Jakovljević, Milan Gnjatović, Milan Sečujski, Vlado Delić
Applied Intelligence, Springer, 2012
2012_ai.pdf

Automatic Prosody Generation for Serbo-Croatian Speech Synthesis Based on Regression Trees
Milan Sečujski, Darko Pekar, Nikša Jakovljević
INTERSPEECH 2011
Florence, Italy, August 28th-31th, 2011
2011_interspeech.pdf

Speech Technologies for Serbian and Kindred South Slavic Languages
Vlado Delić, Milan Sečujski, Nikša Jakovljević, Marko Janev, Radovan Obradović, Darko Pekar
Advances in Speech Recognition (chapter in the book), SCIYO, 2010
(link to IntechOpen)

Applications of Speech Technologies in Western Balkan Countries
Darko Pekar, Dragiša Mišković, Dragan Knežević, Nataša Vujnović Sedlar, Milan Sečujski, Vlado Delić
Advances in Speech Recognition (chapter in the book), SCIYO, 2010
(link to IntechOpen)

Transformation-Based Part-of-Speech Tagging for Serbian Language
Vlado Delić, Milan Sečujski, Aleksandar Kupusinac
CIMMACS 2009
Puerto de la Cruz, Spain, December 14th-16th, 2009.
CIMMACS2009.pdf

Eigenvalues Driven Gaussian Selection in Continuous Speech Recognition Using HMMs with Full Covariance Matrices
Marko Janev, Nikša Jakovljević, Darko Pekar, Vlado Delić
Applied Intelligence, Springer, 2009
AI2009.pdf

Part-of-Speech Tagging Based on Combining Markov Models and Machine Learning
Aleksandar Kupusinac, Milan Sečujski
Speech and Language 2009
Belgrade, November 13th-14th, 2009
SL2009.pdf

Energy Normalization in Automatic Speech Recognition
Nikša Jakovljević, Marko Janev, Darko Pekar and Dragiša Mišković
Lecture Notes in Computer Science, Vol. 5246, 2008
LNCS2008.pdf

An Overview of the AlfaNum Text-to-Speech Synthesis System
Milan Sečujski, Vlado Delić, Darko Pekar, Radovan Obradović, Dragan Knežević
SPECOM 2007
Moscow, Russia, October 15th-18th, 2007
SPECOM2007.pdf

A Review of R&D of Speech Technologies in Serbian and Their Applications in Western Balkan Countries
Vlado Delić
SPECOM 2007
Moscow, Russia, October 15th-18th, 2007
SPECOM_WBC2007.pdf

Speech-Enabled Computers as a Tool for Serbian-Speaking Blind Persons
Vlado Delić, Nataša Vujnović, Milan Sečujski
EUROCON 2005
Belgrade, November 22th-24th, 2005
EUROCON2005.pdf

Description of training procedure for AlfaNum continuous speech recognition system
Jakovljević Nikša, Pekar Darko
EUROCON 2005
Belgrade, November 22th-24th, 2005
EUROCON_CASR2005.pdf

Assessment of Various Aspects of Synthesized Speech Quality
Milan Sečujski, Darko Pekar
Speech and Language 2004
Belgrade, November 29th-December 1st, 2004
SL2004.pdf

Speech Signal Processing in ASR&TTS Algorithms
Vlado Delić, Darko Pekar, Radovan Obradović, Milan Sečujski
Facta Universitatis, 2003
FACTA2003.pdf

The AlfaNumCASR Application - a System for Continuous Automatic Speech Recognition (Serbian)
Darko Pekar, Radovan Obradović, Vlado Delić
DOGS conference, pp. 49-56,
Bečej, May 16th-17th, 2002

AlfaNumCASR.pdf

AlfaNum System for Speech Synthesis in Serbian Language
Milan Sečujski, Radovan Obradović, Darko Pekar, Ljubomir Jovanov, Vlado Delić
TSD 2002, pp.237-244,
Brno, September 9th-12th, 2002
TSD2002.pdf

AlfaNum System for Continuous Speech Recognition
Darko Pekar, Radovan Obradović, Vlado Delić
TSD 2002, demonstration
Brno, September 9th-12th, 2002
TSD2002_ASR.pdf

A Robust Speaker-Independent CPU Based ASR System
Radovan Obradović, Darko Pekar, Srđan Krčo, Vlado Delić, Vojin Šenk
EUROSPEECH’99, Volume 6, pp. 2881-2884,
Budapest, September 5th-10th, 1999
Eurospeech99.pdf

A Method for Reducing Error Rate in Extended Phone Dialing
Vlado Delić and Vojin Šenk
YU patent P-434/97, accepted on March 4th, 1999
Cifre_patent.pdf
System for automatic tracking of audio clips in radio and TV programmes
Darko Pekar, Stevan Molerov, Goran Kočiš, Robert Vuković
Patent pending since December 26th, 2007
AM_patent.pdf

SP2: SCOPES PROJECT FOR SPEECH PROSODY
Scientific research project financed by the Swiss National Foundation (2014-2016), related to multilingual prosody transfer. The project is coordinated by the IDIAP Institute in Martigny, Switzerland, and the remaining participants include the Faculty of Technical Sciences, University of Novi Sad; Faculty of Electrical Engineering and Information Technologies, University of Skopje, Macedonia; and the Technological University of Budapest, Hungary.

AUDIO LIBRARY FOR THE DISABLED
Technology project of the Provincial Secretariat for Scientific and Technological Development (2011-2014), which represents a sequel to the previous project "Audio library for the visually impaired". This project focused on the development of existing technologies and implementation of new interfaces aimed at expanding the circle of users of audio libraries to persons with other types of disabilities.

DEVELOPMENT OF DIALOGUE SYSTEMS IN SERBIAN AND OTHER SOUTH SLAVIC LANGUAGES
Technology project (TR-32035) of the Ministry of Science and Technological development (2011-2016), aims at establishing more flexible speech communication between humans and machines. The project gathers 38 researchers from 6 scientific research institutions in Serbia, as well as 5 researchers from abroad. More info about the project can be found at the project page.

S-VERIFY: ADVANCED SPEAKER VERIFICATION
S-VERIFY is an international EUREKA project (E!-TESTED, 2009-2011) carried out in cooperation with the company "Alpineon" from Maribor, Slovenia (www.alpineon.com). The aim of the project is to research and develop innovative speaker verification technologies. More info about the project can be found at the project page.

TEXT-TO-SPEECH FOR EMBEDDED DEVICES (TESTED)
TESTED is an international EUREKA project (E!-TESTED, 2009-2011) carried out in cooperation with the company "Alpineon" from Maribor, Slovenia (www.alpineon.com). The aim of the project is increasing the functionality of existing text-to-speech solutions for Serbian and Slovenian language, and porting them to mobile devices.

HUMAN-MACHINE SPEECH COMMUNICATION
Technology project (TR-11001) of the Ministry of Science and Environment Protection (2008-2010), aimed at further development and improvement of quality of speech technologies. Led by prof. Vlado Delić, the project gathered 22 researches from several universities in Serbia.

DEVELOPMENT OF SPEECH TECHNOLOGIES IN SERBIAN AND THEIR APPLICATION IN "TELEKOM SRBIJA"
Technology project (TR-6144A) of the Ministry of Science and Environment Protection (2005-2007) with financial participation of "Telekom Srbija", aimed at further development and improvement of quality of speech technologies, as well as their application in the services offered by "Telekom Srbija". Led by prof. Vlado Delić, the project gathered 22 researches from several universities in Serbia.

INTELLIGENT TELEPHONE E-MAIL ACCESS (iTEMA)
iTEMA is an international EUREKA project (E!3864, 2007-2009), and carried out in cooperation with the company "Alpineon" from Maribor, Slovenia (www.alpineon.com). The aim of the project was the development of the iTEMA multilingual e-mail reader: a user-friendly solution to e-mail access over the telephone. By using iTEMA, users are able to listen to the received e-mails in a variety of European languages, with an emphasis on ex-Yugoslav languages: Slovenian, Serbian, Croatian, Bosnian and Macedonian. They are also able to choose between basic responses to the heard email and to save or delete individual messages.

AUDIO-LIBRARY FOR THE VISUALLY IMPAIRED
Technology project of the Provincial Secretariat for Scientific and Technological Development (2005-2006), whose aim was realisation of an information system that provides access to information that is stored in textual format, but is accessed by users as synthesised speech. The first audio-library was installed in the School for the visually impaired pupils "Veljko Ramadanović" in Zemun (www.skolaveljkoramadanovic.edu.rs), and the initial system has been subsequently upgraded on several occasions, and new functionalities were added to it, such as Internet access, multilinguality, efficient administration. The system has a number of advantages over a classical library containing Braille and voice books - it is far simpler and less costly to maintain, it enables simultaneous access of multiple users to a single book, as well as remote access and the possibility of saving a book in the sound format (as a CD). The initial project was financed by the British non-government organisation DFID.

CONTACT
In 2004. the project "CONTACT" was carried out. The aim of this project was to create an interactive telephone speech portal as well as an interactive Web speech portal, intended to be a sort of a meeting place for the visually impaired. Using this portal they can get information on the possibilities for education, self-development, even employment - using speech technologies. Besides news updated from four national news websites, there are links to books and magazines in electronic form that can be downloaded and read out through speech synthesis. The project was initially financed by OSCE, Provincial Secretariat for Science and technological development, as well as the Ministry of Culture and Media of the Republic of Serbia.

COMPUTER DICTIONARY FOR THE BLIND
This project (2004) was related to the creation of the "Computer Dictionary for the Blind", a CD with an electronic edition of the book "Illustrated Computer Dictionaries for Dummies", by Dan Gookin and Sandra Hardin Gookin with an integrated speech synthesizer which reads out the explanations of more than 2,000 terms from computer science and technology to the user. The electronic edition of the dictionary includes a graphical user interface adapted to the needs and ergonomics of visually impaired people. This dictionary is of invaluable help to all those making their first brave steps in the world of computer technologies.
This project was initiated by the Association of Visually Impaired Intellectuals and Artists of Serbia, under the auspices of the Ministry of Education of the Republic of Serbia.

VISION
In 2004. the project "VISION" was initiated. The aim of the project was to train groups of visually impaired PC users to work with the new speech synthesizer in Serbian language and to pass their knowledge and experience on to others, thus initiating a chain of education among them. Such groups were thus enabled for written communication and unaided access to the information on the Internet, as well as the use of optical character recognition software etc. This project helped the blind and partially sighted in our midst to actualize their right to education and access to information guaranteed by law, thus giving them a higher level of equality and the possibility to study, communicate and work unaided. This project has thus contributed to the improvement of the quality of life of the visually impaired, helped them to build their self-respect and enabled them to organize themselves and integrate into the society more easily. The project was financed by a number of non-government organisations, including EHO, Share-See and the Fund for Open Society.