Quality has been defined as the result of a perception and judgment process, during which the user compares the perceived characteristics of the speech sound (the so-called "auditory event") to the expected or desired characteristic. Because of the necessity of perception and judgment processes, subjective tests are still the only valid and reliable means for the purpose of quantifying the impact of different types of degradations on perceived overall quality (QoE). However, instrumental models such as PESQ and the E-model have shown to provide valid estimations of the results of perceptual tests, within the limits of applications they have been de signed for. As a consequence, these models are widely used instead of auditory tests, e.g. for network planning and monitoring. Still, their range of validity needs to be respected, and this is why they cannotsimply be applied without further validation to NGMN scenarios.
In the case of transmitted speech, which is the focus of the present chapter, the perceived quality of the given conditions is collected in auditory tests. In a listening-only situation, for instance, a selection of test participants is asked to judge the quality of a number of processed speech samples. The text material, read by different speakers, is chosen according to the aim of the experiment, and the recorded clean speech files are processed by the system of interest and are finally presented to the listeners. Their task is to judge upon the perceived quality of the processed speech sample, providing a quantitative measure of the QoE.
The Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) provides information about how such tests have to be performed in detail. There, it is specified how to choose "balanced" speech material, "normal" speakers as well as "normal-hearing" test participants. In order to increase the reliability of the experiment, the recording and play-back situation is specified, as well as further test parameters which might have a significant impact on the measurement results.
The judgments of the listeners are usually limited to the identification or scaling of pre-defined properties of the percept. Therefore, a set of predefined scales are available. In the area of speech transmission, a 5-point category scale is usually employed (Absolute Category Rating, ACR). For the collection of overall quality ratings, this 5-point scale is labeled with the attributes "Excellent", "Good", "Fair", "Poor", and "Bad". Subsequently, the ratings are averaged overall participants, leading to an arithmetic mean (Mean Opinion Score, MOS) for each processing condition.
In contrast to the listening-only situation, a bi-directional communication system is employed in real-world conversation situations. Thus, the ecologically most valid method for quality assessment is the conduction of conversation experiments with two interacting participants. Here, the interlocutors are asked to have a conversation on an arbitrary or pre-defined topic. By means of, e.g., the abovementioned ACR scale, the opinion about the just finished connection is supposed to be judged.
Another standardized paradigm, which can be regarded as a trade-off between the "artificial" listening-only experiments and the quite complex conversation tests and it is also called the "CallQuality" method. In this process, the perceived quality of a "simulated" telephone conversation is assessed. The participants are asked to listen to short extracts of a normal conversation, and verbally answer questions regarding the content of the stimulus they just heard. After five of these stimuli, they rate the quality of the entire simulated conversation on the ACR scale. The answering part was introduced to come close to a real conversation with its turn-takings, and to distract subjects from concentrating on the quality until rating it (the test participants are instructed to try to put themselves into the position of an interlocutor).
Once the perceptual effects have been quantified, instrumental quality prediction models can be developed that are capable of estimating the subjective ratings. One type of recommended models by ITU-T, the so-called "Perceptual Evaluation of Speech Quality", is based on the application-layer speech signals. Here, the quality of transmitted speech in a listening-only situation is estimated by comparing the clean and degraded signals on a perceptual level, i.e. by taking advantage of psycho-acoustic knowledge, such as the Bark-scale transform, loudness functions, time/frequency masking, asymmetries of "positive" and "negative" error components, as well as insensitivities of certain variations in delay, the spectrum, and the amplitude. PESQ has been extended towards wideband speech by applying a flat input filter and a different mapping function.
Currently, the requirements for a successor of PESQ are discussed in Study Group 12 of ITU-T which is supposed to improve known drawbacks of PESQ and is valid for an even wider range of distortions (e.g., audio bandwidth, time-warping). NGMN-specific degradations, however, will not explicitly be covered by this model.
For conversational speech quality, the ITU-T recommends the so-called E-Model, which is a parametric model usually employed for offline quality estimation, e.g. for network planning. The MOS values are estimated on the basis of transmission channel parameters commonly known in classical telephonometry (e.g., loudness ratings, noise levels), but also in the context of packet-based networks (e.g., packet loss rates). These parameters are subsumed by so-called impairment factors, which are assumed to be additive on a psychological scale, the so-called transmission rating scale (R-scale). The eventual quality estimates are then obtained by a non-linear transformation of the R-values, i.e. the summations of the perceived impairments.
Until now, none of these models have been validated to correctly predict the effects of NGMN handovers and/or codec changeovers on user perception. However, such instrumental models are indispensable to rapidly design network handover and codec changeover strategies which provide an optimum quality to the user. The auditory investigations presented in this chapter can therefore be considered as a basis for the development of NGMN-capable quality models.