Token Counting

The token counting feature helps you estimate how many tokens your code will use when sent to AI models like GPT-4. This is crucial for staying within context limits and optimizing your interactions with AI models.

Overview

When you run promptprep with the --count-tokens option (which requires --metadata), it will:

Process your code files as usual
Count the number of tokens in the aggregated output
Include this information in the metadata section

This helps you understand how much of an AI model’s context window your code will consume.

What are Tokens?

Tokens are the basic units that AI models like GPT-4 process text with. A token can be as short as a single character or as long as a word. For English text:

1 token ≈ 4 characters
1 token ≈ 0.75 words

For code, the tokenization is more complex due to special characters, indentation, and syntax.

Basic Usage

To count tokens in your code:

promptprep --metadata --count-tokens [other options]

Example:

promptprep -d ./my_project --metadata --count-tokens -o tokenized_output.txt

The output will include token count information in the metadata section:

# Code Aggregation - my_project

## Metadata

- Files processed: 10
- Total lines of code: 1,250
- Total tokens: 15,678
- Estimated GPT-4 context usage: 7.8%

## Directory Structure
...

Specifying a Token Model

Different AI models use different tokenizers. By default, promptprep uses the cl100k_base tokenizer (used by GPT-4), but you can specify a different one with the --token-model option:

promptprep --metadata --count-tokens --token-model MODEL [other options]

Available models:

cl100k_base: Used by GPT-4 (default)
p50k_base: Used by GPT-3.5
r50k_base: Used by earlier GPT models

Example:

promptprep -d . --metadata --count-tokens --token-model p50k_base -o gpt35_tokens.txt

Context Window Estimation

The metadata includes an estimation of how much of the AI model’s context window your code will consume. This is based on the following context window sizes:

GPT-4: 8,192 tokens (base model) or 32,768 tokens (extended context)
GPT-3.5: 4,096 tokens

The estimation helps you understand if your code will fit within the model’s limits or if you need to reduce the amount of code you’re sending.

Optimizing Token Usage

If your code exceeds or approaches the token limit, consider these strategies:

Use Summary Mode: Extract only function/class signatures and docstrings:
```
promptprep -d . --summary-mode --metadata --count-tokens
```

Focus on Specific Files: Include only the most relevant files:

promptprep -i "src/main.py,src/utils.py" --metadata --count-tokens

Exclude Comments: Strip comments to reduce token count:

promptprep -d . --no-include-comments --metadata --count-tokens

Filter by Extension: Include only specific file types:

promptprep -d . -x ".py" --metadata --count-tokens

Exclude Directories: Skip directories that aren’t relevant:

promptprep -d . -e "tests,docs,examples" --metadata --count-tokens

Advanced Use Cases

Working with GPT-4

When preparing code for GPT-4, you can optimize for its context window:

promptprep -d . --metadata --count-tokens -c

This will copy the output to your clipboard, ready to paste into a GPT-4 conversation, with token count information to help you stay within limits.

Comparing Token Efficiency

Compare the token efficiency of different code versions:

# Before optimization
promptprep -d . --metadata --count-tokens -o before.txt

# After optimization (e.g., removing comments, unused code)
promptprep -d . --metadata --count-tokens -o after.txt

This helps you see how your optimizations affect token usage.

Project Planning

Use token counting to plan how to split a large project when working with AI models:

# Count tokens for each module
promptprep -d ./module1 --metadata --count-tokens -o module1.txt
promptprep -d ./module2 --metadata --count-tokens -o module2.txt
promptprep -d ./module3 --metadata --count-tokens -o module3.txt

This helps you decide which modules can be processed together and which need to be handled separately.

Technical Details

Token Counting Implementation

promptprep uses the tiktoken library from OpenAI to count tokens. This is the same library used by OpenAI’s models, so the count should be accurate.

The token counting process:

Aggregates all the code into a single text
Applies the selected tokenizer
Counts the resulting tokens

Dependencies

Token counting requires the tiktoken package, which is included as a dependency when you install promptprep.

Performance Considerations

Token counting adds some processing overhead, especially for large codebases. The impact depends on:

The size of your codebase
The number of files
The complexity of the code

For very large projects, token counting might noticeably increase processing time.

Best Practices

Always Use with Metadata: The --count-tokens option requires --metadata to display the results.
Choose the Right Tokenizer: Use the tokenizer that matches the AI model you’re targeting.
Monitor Context Usage: Pay attention to the estimated context usage to avoid hitting limits.
Combine with Summary Mode: For large codebases, use summary mode to reduce token count while preserving structure.
Focus on Relevant Code: Only include the files that are necessary for your specific question or task.

Troubleshooting

If token counting isn’t working as expected:

Check Dependencies: Ensure the tiktoken package is installed.
Verify Metadata Option: Make sure you’re including the --metadata option.
Check Tokenizer Availability: Some tokenizers might require additional dependencies.
Consider File Encoding: Non-standard file encodings might affect token counting accuracy.