Aggregator Module

The aggregator module is the core component of promptprep that handles file scanning, content extraction, and aggregation.

Module Overview

The aggregator module provides functionality for:

  • Scanning directories for code files

  • Filtering files based on various criteria

  • Extracting content from files

  • Generating directory trees

  • Processing files incrementally

  • Generating diffs between versions

Key Classes and Functions

CodeAggregator

class promptprep.aggregator.CodeAggregator(directory='.', output_file='full_code.txt', include_files=None, exclude_dirs=None, extensions=None, max_file_size=100.0, include_comments=True, summary_mode=False, line_numbers=False, incremental=False, last_run_timestamp=None)[source]

The main class responsible for aggregating code files.

Parameters:
  • directory (str) – The directory to scan for code files (default: current directory)

  • output_file (str) – The file to save the output to (default: ‘full_code.txt’)

  • include_files (list) – List of specific files to include (default: None)

  • exclude_dirs (list) – List of directories to exclude (default: None)

  • extensions (list) – List of file extensions to include (default: None)

  • max_file_size (float) – Maximum file size in MB to include (default: 100.0)

  • include_comments (bool) – Whether to include comments in the output (default: True)

  • summary_mode (bool) – Whether to extract only signatures and docstrings (default: False)

  • line_numbers (bool) – Whether to add line numbers to the output (default: False)

  • incremental (bool) – Whether to process files incrementally (default: False)

  • last_run_timestamp (float) – Timestamp of the last run for incremental processing (default: None)

scan_directory()

Scan the directory for code files based on the configured filters.

Returns:

A list of file paths that match the criteria

Return type:

list

generate_directory_tree()

Generate an ASCII representation of the directory structure.

Returns:

ASCII directory tree

Return type:

str

process_file(file_path)

Process a single file and extract its content.

Parameters:

file_path (str) – Path to the file to process

Returns:

Processed content of the file

Return type:

str

aggregate_code()[source]

Aggregate code from all matching files.

Returns:

Aggregated code with directory tree and file headers

Return type:

str

save_output(content)

Save the aggregated content to the output file.

Parameters:

content (str) – The content to save

Returns:

None

generate_metadata()

Generate metadata about the processed files.

Returns:

Metadata as a formatted string

Return type:

str

count_tokens(content, model='cl100k_base')

Count the number of tokens in the content.

Parameters:
  • content (str) – The content to count tokens in

  • model (str) – The tokenizer model to use (default: ‘cl100k_base’)

Returns:

Number of tokens

Return type:

int

generate_diff(prev_file, context_lines=3)

Generate a diff between the current output and a previous output file.

Parameters:
  • prev_file (str) – Path to the previous output file

  • context_lines (int) – Number of context lines to include in the diff (default: 3)

Returns:

Diff as a formatted string

Return type:

str

FileProcessor

class promptprep.aggregator.FileProcessor(include_comments=True, summary_mode=False, line_numbers=False)

Class responsible for processing individual files.

Parameters:
  • include_comments (bool) – Whether to include comments in the output (default: True)

  • summary_mode (bool) – Whether to extract only signatures and docstrings (default: False)

  • line_numbers (bool) – Whether to add line numbers to the output (default: False)

process_file(file_path)

Process a file and extract its content based on the configured options.

Parameters:

file_path (str) – Path to the file to process

Returns:

Processed content of the file

Return type:

str

extract_summary(content, file_ext)

Extract function/class signatures and docstrings from the content.

Parameters:
  • content (str) – The file content

  • file_ext (str) – The file extension

Returns:

Extracted summary

Return type:

str

add_line_numbers(content)

Add line numbers to the content.

Parameters:

content (str) – The content to add line numbers to

Returns:

Content with line numbers

Return type:

str

DirectoryTreeGenerator

class promptprep.aggregator.DirectoryTreeGenerator(root_dir, exclude_dirs=None, include_files=None)[source]

Class responsible for generating ASCII directory trees.

Parameters:
  • root_dir (str) – The root directory to generate the tree for

  • exclude_dirs (list) – List of directories to exclude (default: None)

  • include_files (list) – List of specific files to include (default: None)

generate_tree()

Generate an ASCII representation of the directory structure.

Returns:

ASCII directory tree

Return type:

str

IncrementalProcessor

class promptprep.aggregator.IncrementalProcessor(last_run_timestamp=None)

Class responsible for incremental processing.

Parameters:

last_run_timestamp (float) – Timestamp of the last run (default: None)

should_process_file(file_path, prev_output_file=None)

Determine if a file should be processed based on its modification time.

Parameters:
  • file_path (str) – Path to the file to check

  • prev_output_file (str) – Path to the previous output file (default: None)

Returns:

Whether the file should be processed

Return type:

bool

extract_timestamp_from_file(file_path)

Extract the timestamp from a previous output file.

Parameters:

file_path (str) – Path to the file to extract the timestamp from

Returns:

Extracted timestamp or None if not found

Return type:

float or None

DiffGenerator

class promptprep.aggregator.DiffGenerator(context_lines=3)

Class responsible for generating diffs between versions.

Parameters:

context_lines (int) – Number of context lines to include in the diff (default: 3)

generate_diff(current_content, prev_file)

Generate a diff between the current content and a previous output file.

Parameters:
  • current_content (str) – The current content

  • prev_file (str) – Path to the previous output file

Returns:

Diff as a formatted string

Return type:

str

Usage Examples

Basic Usage

from promptprep.aggregator import CodeAggregator

# Create an aggregator
aggregator = CodeAggregator(
    directory='./my_project',
    output_file='output.txt',
    exclude_dirs=['venv', 'node_modules'],
    extensions=['.py', '.js']
)

# Aggregate code
content = aggregator.aggregate_code()

# Save output
aggregator.save_output(content)

With Metadata

from promptprep.aggregator import CodeAggregator

aggregator = CodeAggregator(directory='./my_project')

# Generate metadata
metadata = aggregator.generate_metadata()

# Aggregate code
content = aggregator.aggregate_code()

# Combine metadata and content
full_content = metadata + '\n\n' + content

# Save output
aggregator.save_output(full_content)

Incremental Processing

from promptprep.aggregator import CodeAggregator
import time

# First run
aggregator = CodeAggregator(
    directory='./my_project',
    output_file='baseline.txt'
)
content = aggregator.aggregate_code()
aggregator.save_output(content)

# Later, after making changes
timestamp = time.time()
incremental_aggregator = CodeAggregator(
    directory='./my_project',
    output_file='updated.txt',
    incremental=True,
    last_run_timestamp=timestamp
)
updated_content = incremental_aggregator.aggregate_code()
incremental_aggregator.save_output(updated_content)

Generating Diffs

from promptprep.aggregator import CodeAggregator

aggregator = CodeAggregator(
    directory='./my_project',
    output_file='current.txt'
)
content = aggregator.aggregate_code()

# Generate diff with a previous version
diff = aggregator.generate_diff('previous.txt', context_lines=5)

# Save diff to a file
with open('diff.txt', 'w') as f:
    f.write(diff)