A token is the basic unit of text or data that a foundational model processes. It can represent a word, part of a word, or even a single character. When you input text to a model, it’s first broken down into these tokens. The model then analyzes and generates text by working with sequences of these tokens rather than raw text.
Most modern AI models allow users to interface with them conversationally using plain language. However, in order to process natural language, a AI must internally break down a prompt’s input into small chunks called tokens. These tokens are based on common sequences of characters in the input, and they facilitate the AI’s processing, analysis, and pattern identification. Tokens are the main metric of computation in AI processing, and as such, usage costs are based on token count.
For example, if an AI processes the input string “The quick brown fox jumps over the lazy dog,” it breaks down the 44 characters into around 10 tokens of data. As a rule of thumb, for text, each token is comprised of around four characters. Or, at scale, 100 tokens roughly equal 75 words of text. Estimates are assuming a Latin alphabet dataset. Using non-Latin characters, emojis, or binary data may result in more tokens. While token count may vary based on your selected AI model, you can use OpenAI’s Tokenizer to estimate and visualize how many tokens an input string might result in.