Token Counting
The token counting feature helps you estimate how many tokens your code will use when sent to AI models like GPT-4. This is crucial for staying within context limits and optimizing your interactions with AI models.
Overview
When you run promptprep with the --count-tokens option (which requires --metadata), it will:
Process your code files as usual
Count the number of tokens in the aggregated output
Include this information in the metadata section
This helps you understand how much of an AI model’s context window your code will consume.
What are Tokens?
Tokens are the basic units that AI models like GPT-4 process text with. A token can be as short as a single character or as long as a word. For English text:
1 token ≈ 4 characters
1 token ≈ 0.75 words
For code, the tokenization is more complex due to special characters, indentation, and syntax.
Basic Usage
To count tokens in your code:
promptprep --metadata --count-tokens [other options]
Example:
promptprep -d ./my_project --metadata --count-tokens -o tokenized_output.txt
The output will include token count information in the metadata section:
# Code Aggregation - my_project
## Metadata
- Files processed: 10
- Total lines of code: 1,250
- Total tokens: 15,678
- Estimated GPT-4 context usage: 7.8%
## Directory Structure
...
Specifying a Token Model
Different AI models use different tokenizers. By default, promptprep uses the cl100k_base tokenizer (used by GPT-4), but you can specify a different one with the --token-model option:
promptprep --metadata --count-tokens --token-model MODEL [other options]
Available models:
cl100k_base: Used by GPT-4 (default)p50k_base: Used by GPT-3.5r50k_base: Used by earlier GPT models
Example:
promptprep -d . --metadata --count-tokens --token-model p50k_base -o gpt35_tokens.txt
Context Window Estimation
The metadata includes an estimation of how much of the AI model’s context window your code will consume. This is based on the following context window sizes:
GPT-4: 8,192 tokens (base model) or 32,768 tokens (extended context)
GPT-3.5: 4,096 tokens
The estimation helps you understand if your code will fit within the model’s limits or if you need to reduce the amount of code you’re sending.
Optimizing Token Usage
If your code exceeds or approaches the token limit, consider these strategies:
Use Summary Mode: Extract only function/class signatures and docstrings:
promptprep -d . --summary-mode --metadata --count-tokens
Focus on Specific Files: Include only the most relevant files:
promptprep -i "src/main.py,src/utils.py" --metadata --count-tokens
Exclude Comments: Strip comments to reduce token count:
promptprep -d . --no-include-comments --metadata --count-tokens
Filter by Extension: Include only specific file types:
promptprep -d . -x ".py" --metadata --count-tokens
Exclude Directories: Skip directories that aren’t relevant:
promptprep -d . -e "tests,docs,examples" --metadata --count-tokens
Advanced Use Cases
Working with GPT-4
When preparing code for GPT-4, you can optimize for its context window:
promptprep -d . --metadata --count-tokens -c
This will copy the output to your clipboard, ready to paste into a GPT-4 conversation, with token count information to help you stay within limits.
Comparing Token Efficiency
Compare the token efficiency of different code versions:
# Before optimization
promptprep -d . --metadata --count-tokens -o before.txt
# After optimization (e.g., removing comments, unused code)
promptprep -d . --metadata --count-tokens -o after.txt
This helps you see how your optimizations affect token usage.
Project Planning
Use token counting to plan how to split a large project when working with AI models:
# Count tokens for each module
promptprep -d ./module1 --metadata --count-tokens -o module1.txt
promptprep -d ./module2 --metadata --count-tokens -o module2.txt
promptprep -d ./module3 --metadata --count-tokens -o module3.txt
This helps you decide which modules can be processed together and which need to be handled separately.
Technical Details
Token Counting Implementation
promptprep uses the tiktoken library from OpenAI to count tokens. This is the same library used by OpenAI’s models, so the count should be accurate.
The token counting process:
Aggregates all the code into a single text
Applies the selected tokenizer
Counts the resulting tokens
Dependencies
Token counting requires the tiktoken package, which is included as a dependency when you install promptprep.
Performance Considerations
Token counting adds some processing overhead, especially for large codebases. The impact depends on:
The size of your codebase
The number of files
The complexity of the code
For very large projects, token counting might noticeably increase processing time.
Best Practices
Always Use with Metadata: The
--count-tokensoption requires--metadatato display the results.Choose the Right Tokenizer: Use the tokenizer that matches the AI model you’re targeting.
Monitor Context Usage: Pay attention to the estimated context usage to avoid hitting limits.
Combine with Summary Mode: For large codebases, use summary mode to reduce token count while preserving structure.
Focus on Relevant Code: Only include the files that are necessary for your specific question or task.
Troubleshooting
If token counting isn’t working as expected:
Check Dependencies: Ensure the
tiktokenpackage is installed.Verify Metadata Option: Make sure you’re including the
--metadataoption.Check Tokenizer Availability: Some tokenizers might require additional dependencies.
Consider File Encoding: Non-standard file encodings might affect token counting accuracy.