Text-to-Speech is a term for several types of technology that turn written text into audible speech. It is most famously used by those who are blind, as the primary output method of screen readers. It is also used by those who are unable to speak themselves, to allow them to write text and have it spoken aloud.

The history of text-to-speech goes back to the 1940’s, when phone companies began researching ways to replace human telephone operators with machines. They were also interested in finding more efficient ways to model the human voice to allow for clearer audio recordings and cheaper ways to transmit more calls over telephone wires. Soon, though, the advent of personal computers would bring text-to-speech into people’s homes to read text or to speak aloud on their behalf.

The first text-to-speech systems were specialized hardware that would connect to a computer via serial or parallel ports and output any text sent to the device as speech. This meant that the CPU of the computer was free to perform other tasks and no soundcard was required. Unfortunately though, these hardware text-to-speech systems sounded extremely robotic and did not include the ability to customize the speech.

This is a recording of the cricket, one of the first hardware text-to-speech devices, connected to an Apple II computer via a serial port. Originally released in the late 1970’s, it was one of the first text-to-speech devices to become popular in the schools for the blind and homes of people who are blind.

As technology progressed, hardware text-to-speech systems became slightly less robotic, easier to understand, and better at pronouncing words. However, they would continue to sound somewhat artificial.

This is a recording of the sales demo built-in to the Accent Text-to-Speech system, a hardware text-to-speech system commonly used with DOS computers.

As computers became more powerful, specialized hardware text-to-speech systems became less and less common. Instead, the text-to-speech functions were moved into software. While this required computers to have a much more powerful CPU, more memory, and a sound card, it had multiple advantages:

  • The text-to-speech system could be upgraded or changed, without replacing hardware.
  • Users no longer needed to carry around a separate piece of hardware to output text; instead, it is now built into the computer.
  • It made text-to-speech systems much cheaper, as they could be distributed entirely digitally.

The most popular software text-to-speech systems in use today are Eloquence, Vocalizer, and Espeak. Espeak and Eloquence are highly robotic voices that users who are blind can speed up and understand at an extremely quick rate of speech (sometimes as high as 800 words per minute). Vocalizer is a much higher quality voice but can’t be understood quite as quickly.