---------------------------------------------------------------------------- The Florida SunFlash Multimedia: Audio (4 of 6) SunFLASH Vol 40 #28 April 1992 ---------------------------------------------------------------------------- 4 Audio Audio plays an important role in multimedia applications. When a service representative adds a voice note to a credit record, when executives hold a video conference, when travelers listen to voice mail or have their email read to them over the phone, or when new employees complete training modules, they all use desktop audio. Audio Applications Audio can be used in many applications, including voice annotation, voice conferencing, voice mail, training and presentations, text-to-speech, and speech recognition. The following paragraphs describe some of these applications. Voice Annotation A voice annotation application enables audio comments to be added to documents, data base records, and so on. These applications can be built using fairly simple record and playback capabilities. For example, someone records a voice comment and attaches it to a certain spot in a document. The document then displays an indicator, showing that an audio note is attached. While viewing the document, the reader can select the indicator to play back the audio recording. Voice annotation applications may also provide some simple audio editing capabilities for message creation. Voice Conferencing Voice conferences enable people to speak to each other in real-time over the network. Voice conferencing is an alternative to video conferencing since both parties do not always have to see each other. Often, voice conferencing would be used in a collaborative environment, where two people can sit at their workstations, look at the same document at the same time, and make verbal comments about it. Voice Mail Voice mail provides a way to send and receive voice-quality audio recordings in a multimedia email message. Audio messages can be recorded, sent as attachments to a multimedia email message, and played back by the recipient. With the integration of telephony into the desktop, messages could be recorded automatically by a telephone-answering application, and forwarded as email to the telephone owner. Training and Presentations Audio can be very effective in training applications. It can provide a soundtrack for video segments, or for illustrations of any form. It can also provide help and feedback to the student with a more personal feel, and without interrupting the student's focus. Audio can also provide richer, more interesting, and more effective presentations. Audio in combination with other multimedia technologies enables authors to create presentations of a quality that meet the expectations of today's consumers. Text-to-Speech Text-to-speech technology enables information stored as text to be converted to speech - effectively to be read aloud. Applications can use text-to-speech to provide verbal help, or to read your calendar, address file information, email message or other information. When integrated with telephony capabilities, text-to-speech technology can provide remote access to information. For example, you could telephone your computer and have it read your mail or appointments. This technology can also provide spoken desktop messages, such as a reminder of a pending appointment, without imposing a pop-up window into the middle of your current work. These desktop alerts could include announcing an incoming email message or, with telephone integration, an incoming phone call. They can also provide spoken status messages or warnings from the system. Speech Recognition Speech recognition enables you to speak to your computer through a microphone or telephone. The workstation translates the voice input into text that the system can understand. This technology enables the use of voice as an additional input channel to supplement the keyboard and mouse. For example, radiologists save valuable time by dictating their x-ray reports into the computer for immediate viewing. They can be edited by voice or by keyboard. You could also give commands to open and close windows or start applications without moving away from your current work. With telephony integration, speech recognition would enable you to give commands to your workstation verbally over the telephone, for example, "Read the headers of any mail messages from William Tell." Speech recognition technology in the near future will be limited in the size of its vocabulary, and will typically require you to train your system to your own voice. Eventually, these restrictions should disappear. Key Audio Concepts Multimedia audio applications depend on the interaction of a number of variables such as the type of audio and how it can be digitized, edited, stored, and played back. The following paragraphs describe some of these key concepts. Types of Audio Workstations generally support two types of audio input and output: o Music quality audio (often called CD-quality audio or 16-bit audio) o Voice quality audio (also known as telephone quality or 8-bit audio) MIDI data, which specifies music, is also often included in the audio category. CD-quality Audio CD-quality audio requires both higher sample rates and greater sampling precision (more bits of data), thus making greater storage and processing demands on the workstation. Today, applications requiring CD-quality audio are found primarily in the music industry. The use of CD-quality audio for business training and presentations, while currently limited, is expected to expand considerably throughout the corporate marketplace. CD-quality audio is typically input from a CD player or DAT (Digital Audio Tape) player, and is output through a high-quality speaker. Voice-quality Audio Voice-quality audio can reproduce the comparatively limited dynamic range of the human voice. Voice-quality audio is standard on every Sun desktop workstation, enabling multimedia applications ranging from electronic voice mail to voice annotation of documents to voice control of your workstation. Voice-quality audio is commonly input from a microphone or over a telephone, and can be output through a speaker built into or attached to the workstation, or using a telephone handset or speaker phone. MIDI MIDI (Musical Instrument Digital Interface) is a note-oriented control language for specifying music. MIDI data consists of codes specifying notes and timing. These codes can be generated by or output to MIDI-compatible devices such as keyboards or synthesizers. MIDI applications are generally found in the computer music industry, used for studio control and audio production. Audio Editing and Manipulation You can perform various operations to audio data stored in a file in addition to playing it back. Probably the most common operation is to edit the audio data. Programs that do audio editing typically generate a display of the waveform representing the data, and then enable you to specify sections of data to cut out or relocate. Editing can be used to isolate segments of interest (for example, to create a "sound bite"), or remove leading or trailing noise, silence, or pauses. Another common operation is the mixing of sound files, for example to combine a voice overlay on top of a music background for a training application or in a presentation. Audio Playback Playing back stored audio data requires regenerating the analog audio signal from the digital data. This is done by a digital to analog converter or DAC. The analog signal can then be output to a speaker built into or attached to your workstation, to the speaker in a telephone handset, or to a speakerphone. Capturing and Digitizing Audio Sound, or audio, is analog data. To store, manipulate, and enhance it using a computer, it must be digitized - converted to a computer-readable format. Audio starts as a complex analog waveform coming from some form of input device, such as a microphone, telephone handset, or CD player connected to your workstation. An audio signal is characterized by its bandwidth, the highest frequency in cycles per second or hertz (Hz) that can be represented in the waveform. Digitizing this signal involves two processes, sampling and quantization. These functions are generally performed by a chip known as an analog-to-digital converter, or ADC. Today the ADC and its counterpart, the DAC, are sometimes combined into a single chip called a Coder-Decoder or CODEC. The quality of audio a workstation supports is primarily determined by the capabilities of the ADC and DAC components. Audio Data Storage Once the audio input stream has been captured and digitized, it can be stored in a data file for later playback or for editing or other processing. Even voice- quality audio is data intensive; one minute of voice-quality audio on a SPARCstation takes almost half a megabyte of storage space. One minute of uncompressed CD-quality audio (16-bit 44.1 Khz stereo) would require close to 10 Mbytes of storage space. Besides the raw data, you also need to store information about the data, such as its sampling rate, the number of bits per sample, and the encoding algorithm used. This information is necessary in order to be able to reproduce the original signal. Thus, audio data is commonly stored in files with a special format that includes this data, often in some sort of header structure. This often requires special routines to write the data to these files and to read it back properly. Multi-Channel Audio Many workstations, such as today's SPARCstation family, support one channel of audio, or monaural sound. Multiple channels are also possible. Supporting two channels (stereo) requires two input and two output ports, independent ADC/DAC components for each data stream (or components designed to handle two channels), and a data representation format for the storage of multiple channels of data. Challenges There are still some challenging issues to tackle before audio will become commonplace on the desktop. One of the most significant is the development of more effective ways to handle the volume of data that audio involves. Development of compression algorithms to minimize storage space and network bandwidth to allow transmission across computer networks is an area for further research. Ongoing research in the area of text-to-speech and speech recognition is another challenge. More human-sounding speech generation, and more flexible and accurate speech recognition are important goals for the future.