1. What is a token, really?
For language models like ChatGPT, text isn’t processed as whole words or sentences. Instead, it’s chopped into tokens:
-
A token might be:
-
a full short word:
cat -
a piece of a longer word:
inter,nation,al -
punctuation:
.,?,, -
spaces or symbols
-
Rough rule of thumb (for English):
-
~1 token ≈ ¾ of a word
-
100 tokens ≈ 75 words (very approximate)
So when you see “this model supports 200k tokens,” think: “it can juggle roughly a big book’s worth of text at once.”
2. Why AI models use tokens instead of words
Models don’t understand language as humans do. They see sequences of numbers representing tokens.
Why tokens (not whole words)?
-
Flexibility with any language / style
-
Works with English, code, emojis, hashtags, URLs, weird spacing, etc.
-
-
Efficient vocabulary size
-
Instead of millions of whole words, the model memorises a few tens of thousands of subword pieces.
-
These pieces can be combined to form almost any word.
-
So the pipeline (simplified) is:
Text → tokeniser → tokens → model → tokens → detokeniser → text
3. How tokens relate to pricing
Cloud AI providers charge by how many tokens you send in and get out.
There are usually two parts:
-
Input tokens
All the text you send:-
your prompt
-
system instructions
-
previous conversation history (if included)
-
-
Output tokens
All the tokens the model generates in its reply.
The bill is roughly:
Cost = (input tokens × price_in) + (output tokens × price_out)
Why pricing by token makes sense for providers:
-
Cost to run the model grows roughly linearly with token count
-
More tokens in = more computation
-
More tokens out = more computation
-
-
It’s a fair way to:
-
charge light users less
-
charge heavy users more
-
-
It’s also model-agnostic:
-
Doesn’t matter what language you use
-
Doesn’t matter if it’s prose, code, or emojis
-
For you as a user, that means:
-
Long prompts + long responses = more tokens = more cost
-
Short, focused prompts = fewer tokens = cheaper & usually faster
4. Context window: why token limits matter
Every model has a maximum context window, like:
-
8k tokens
-
32k tokens
-
200k+ tokens for some “long context” models
This is the total space for:
all input text + the model’s output
If you go past that limit:
-
The provider will refuse the request, or
-
Older parts of the conversation will be dropped/trimmed
So tokens limit how much the model can “see” at once.
5. What is token efficiency?
“Token efficiency” just means:
Getting as much useful work as possible out of each token.
For you, that means:
-
Spending less money
-
Getting faster responses
-
Fitting more into the context window
There are two sides:
A. Being efficient as a user
Ways to reduce token usage without losing quality:
-
Shorten prompts
-
Remove boilerplate (“Please answer this question in a detailed manner…”) if you’ve already set a style.
-
Use bullet points instead of long paragraphs when possible.
-
-
Avoid resending the whole history
-
For APIs, don’t send your entire conversation every time if you can summarise it.
-
Store a compressed summary as the “memory” instead of all previous turns.
-
-
Use summaries
-
Ask the model to summarise long documents into shorter notes.
-
Then refer back to the summary rather than the full text.
-
-
Be precise
-
A clearer prompt can be shorter and better:
-
Bad: “Tell me everything you know about solar panels.”
-
Better: “In 5 bullet points, explain pros/cons of rooftop solar for a small business in the UK.”
-
-
B. Models becoming more token-efficient
Behind the scenes, researchers and companies are trying to do more with fewer tokens and less compute. Some of the big trends (in simple terms):
-
Better tokenisers
-
Smarter ways of chopping text so:
-
common words use fewer tokens
-
scripts like Chinese/Japanese get fairer/easier splits
-
-
Result: same text → fewer tokens → cheaper & faster.
-
-
Sparse / selective attention
-
Classic models look at every token vs every other token = cost grows with square of the context size.
-
Newer approaches (various “efficient transformers,” special attention mechanisms, RNN-style hybrids, etc.) selectively focus on the most relevant tokens, so:
-
much longer context windows
-
less compute per token in huge contexts
-
-
-
Retrieval instead of stuffing
-
Rather than pushing a 100-page document into the prompt, systems:
-
store it in a database
-
pull out just the few relevant chunks at query time
-
-
This is Retrieval-Augmented Generation (RAG) in simple terms: “look things up on demand instead of carrying everything in memory.”
-
-
Compression / summarisation
-
Using models to compress:
-
long chats
-
big docs
-
-
into compact summaries that preserve key info with far fewer tokens.
-
-
Smaller + smarter models
-
“Distilled” or fine-tuned models that:
-
use fewer parameters
-
need fewer tokens
-
still perform well on specific tasks
-
-
Think: “tiny specialist” instead of “huge generalist” for certain workloads.
-
All of this is about cutting cost per useful answer, not just cost per raw token.
6. How to personally “do more with less” tokens
Concrete habits you can use right away:
-
Front-load instructions once
-
“From now on, answer in UK English, concise, technical but layperson-friendly.”
-
Then stop repeating that every time.
-
-
Use references instead of repetition
-
“Using the same assumptions as before, now calculate X…”
-
When using APIs, store those assumptions in your own app and re-send a summary.
-
-
Ask for structured outputs
-
Tables, bullet points, JSON.
-
Easier to reuse, and often shorter than rambling prose.
-
-
Incremental refinement
-
First: “Give me a short outline.”
-
Then: “Expand section 2 only.”
-
Avoid: “Write the whole 10,000-word thing in one go” — that’s a massive token hit.
-
7. Mental model to keep
You can think of tokens like:
-
SMS characters for old text messages
-
Longer text = more “segments” = higher cost
-
-
Electricity usage
-
Every token is a tiny bit of compute “energy”
-
More tokens = more energy = more cost
-
So: Write like someone paying by the character, but still demanding clarity.