Knowledge Base Generator

🔗 Repository: https://github.com/PSavvateev/kb-generator.git

Overview

Purpose

Well-structured knowledge base is one of the key elements of customer service systems, used not only by support agents, but also for customer-facing automatic self-service and further AI-automations as chat bots and agent assists. It is common for a company to have a large knowledge base documented in different formats as a legacy of inconsistent creation approach.

KB Generator automates the creation of knowledge base articles from various document formats. It analyzes document content, extracts structured information, generates markdown-format articles with proper formatting, and creates rich metadata for further LLM usage (RAG agents or chat bots).

Key benefits

Save Time: Convert hours of manual KB article writing into minutes of automated processing
Consistency: Ensure uniform structure, tone, and quality across all articles
Rich Metadata: Automatically generate metadata, tags, and keywords
Multi-format Support: Process PDFs, DOCX, and TXT files
Table Preservation: Accurately extract and format tables with intelligent validation
Flexible AI Providers: Support for Google Gemini, OpenAI, Anthropic Claude, and local Ollama models

Functionality

Core features

Document Parsing
- Extract text, tables, and metadata from PDF, DOCX, and TXT files
- Intelligent table validation to filter malformed extractions
- Preserve document structure and formatting
Content Cleaning
- Fix encoding issues (smart quotes, mojibake, UTF-8 errors)
- Remove artifacts (form feeds, control characters, zero-width spaces)
- Normalize whitespace and line breaks
- Remove duplicate lines
- Standardize bullet points and numbering
- Optional header/footer removal
Content Analysis
- AI-powered document type detection (tutorial, reference, how-to, troubleshooting, etc.)
- Automatic section identification and outlining
- Table placement recommendations
- Target audience identification
Article Generation
- Professional markdown article creation
- Proper heading hierarchy and structure
- Clean table formatting
- Source attribution
- Configurable tone and style
Metadata Generation
- SEO-optimized titles and descriptions
- Relevant tags and keywords
- Difficulty level assessment
- Reading time estimation
- Related articles suggestions
- Prerequisites identification

Supported Document Types

Type	Extensions	Table Extraction	Metadata Extraction
PDF	`.pdf`	✅ Yes	✅ Yes (limited)
Word	`.docx`	✅ Yes	✅ Yes (full)
Text	`.txt`	❌ No	❌ No

Workflow

┌─────────────────────────────────────────────────────────────────┐
│                        KB GENERATOR PIPELINE                    │
└─────────────────────────────────────────────────────────────────┘

    Input Document (PDF/DOCX/TXT)
              │
              ▼
    ┌─────────────────────┐
    │  Stage 1: PARSING   │
    │  Document Parser    │
    │  • Extract text     │
    │  • Extract tables   │
    │  • Validate tables  │
    │  • Get metadata     │
    └──────────┬──────────┘
               │
               ▼
      Parsed Content + Tables
               │
               ▼
    ┌─────────────────────┐
    │ Stage 2: CLEANING   │
    │  Content Cleaner    │
    │  • Fix encoding     │
    │  • Remove artifacts │
    │  • Normalize text   │
    │  • Remove dupes     │
    └──────────┬──────────┘
               │
               ▼
       Clean Text + Tables
               │
               ▼
    ┌─────────────────────┐
    │ Stage 3: ANALYSIS   │
    │  Analysis Agent     │
    │  • Detect doc type  │
    │  • Identify sections│
    │  • Plan structure   │
    │  • Place tables     │
    └──────────┬──────────┘
               │
               ▼
       Content Plan + Structure
               │
               ▼
    ┌─────────────────────┐
    │ Stage 4: WRITING    │
    │  Writing Agent      │
    │  • Generate article │
    │  • Format markdown  │
    │  • Include tables   │
    │  • Apply style      │
    └──────────┬──────────┘
               │
               ▼
         KB Article (Markdown)
               │
               ▼
    ┌─────────────────────┐
    │ Stage 5: METADATA   │
    │  Metadata Agent     │
    │  • Generate title   │
    │  • Create tags      │
    │  • Extract keywords │
    │  • Suggest related  │
    └──────────┬──────────┘
               │
               ▼
    Output: Article + Metadata + JSON files

Architechture

Components

1. Document Parser (`services/document_parser.py`)

Extracts content from various file formats with robust table validation.

Responsibilities:

Parse PDF, DOCX, and TXT files
Extract text content while preserving structure
Identify and extract tables using pdfplumber
Validate tables to filter malformed extractions
Convert tables to markdown format
Extract document metadata (page count, word count, etc.)

Key Features:

Strict table validation to filter PDF extraction errors
Handles empty columns, text blobs, and visual boxes
Preserves document order in DOCX files
Encoding detection for text files

2. Content Cleaner (`services/content_cleaner.py`)

Cleans and normalizes extracted text for optimal LLM processing.

Responsibilities:

Fix encoding issues (UTF-8 mojibake, smart quotes, Latin-1 issues)
Remove artifacts (form feeds, control characters, BOM, zero-width spaces)
Normalize whitespace and line breaks
Remove consecutive duplicate lines
Standardize bullet points and list formatting
Optional removal of page headers/footers

Key Features:

Comprehensive encoding fix database (80+ patterns)
Configurable cleaning options
Statistics tracking for debugging
Conservative defaults to preserve content
Non-destructive cleaning (validates output)

What Gets Cleaned:

Encoding Issues: â€™ → ', Ã© → é, â€œ → "
Artifacts: Form feeds, control characters, zero-width spaces, BOM
Whitespace: Multiple spaces → single space, max 2 consecutive newlines
Bullets: •▪▫▸▹ → • (normalized)
Duplicates: Consecutive identical lines removed
Optional: Page headers/footers (“Page X of Y”)

3. Analysis Agent (`services/analysis_agent.py`)

AI-powered content analysis and structure planning.

Responsibilities:

Detect document type (tutorial, reference, concept, etc.)
Identify target audience and difficulty level
Extract key takeaways
Plan article sections and hierarchy
Recommend table placements
Analyze content style and tone

Key Features:

Multi-stage analysis with JSON output
Intelligent section planning
Table-to-section mapping
Content style detection

4. Writing Agent (`services/writing_agent.py`)

Generates professional markdown articles from content plans.

Responsibilities:

Generate well-structured markdown articles
Apply consistent formatting and style
Place tables in appropriate locations
Create proper heading hierarchy
Add source attribution
Maintain professional tone

Key Features:

Template-based generation
Configurable tone and style
Section-by-section writing
Table integration
Source citation

5. Metadata Agent (`services/metadata_agent.py`)

Creates comprehensive metadata for SEO and discoverability.

Responsibilities:

Generate SEO-optimized titles
Create meta descriptions
Extract and suggest tags
Identify keywords
Estimate reading time
Suggest related articles
Define prerequisites

Key Features:

Rich structured metadata
SEO optimization
Related content suggestions
Prerequisite identification
Comprehensive tagging

6. LLM Client (`services/llm_client.py`)

Unified interface for multiple AI providers.

Responsibilities:

Abstract provider-specific implementations
Handle API authentication and requests
Implement retry logic and error handling
Parse JSON responses robustly
Manage rate limits

Supported Providers:

Google Gemini (gemini-1.5-flash, gemini-1.5-pro)
OpenAI (gpt-4o, gpt-4o-mini, o1-mini, o1-preview)
Anthropic Claude (claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus)
Ollama (local models: llama3.1, qwen2.5, mistral, etc.)

Project structure

📂kb-generator/
├── pipeline.py                     # Main CLI entry point
├── config.py                       # Configuration dataclasses
├── requirements.txt                # Python dependencies
├── .env                            # API keys (not in git)
├── .env.example                    # Example environment file
│
├── 📂 services/                   # Core service modules
│   ├── __init__.py
│   ├── document_parser.py          # Document parsing & table extraction
|   ├── content_cleaner.py          # Text cleaning & normalization
│   ├── analysis_agent.py           # Content analysis & planning
│   ├── writing_agent.py            # Article generation
│   ├── metadata_agent.py           # Metadata generation
│   └── llm_client.py               # LLM provider abstraction
│
├── 📂 outputs/                     # Generated articles (auto-created)
│   └── <document-name>/
│       ├── article.md              # Final article
│       ├── article_metadata.json   # Metadata
│       ├── article_plan.json       # Content plan
│       └── article_parsed.json     # Parsed document
│
├── 📂 logs/                        # Execution logs (auto-created)
│   └── pipeline_YYYYMMDD_HHMMSS.log
│
└── README.md

Knowledge Base Generator

Overview

Purpose

Key benefits

Functionality

Core features

Supported Document Types

Workflow

Architechture

Components

1. Document Parser (`services/document_parser.py`)

2. Content Cleaner (`services/content_cleaner.py`)

3. Analysis Agent (`services/analysis_agent.py`)

4. Writing Agent (`services/writing_agent.py`)

5. Metadata Agent (`services/metadata_agent.py`)

6. LLM Client (`services/llm_client.py`)

Project structure

Versions

Current version

v1.0.0 (20/11/2025)

Knowledge Base Generator

Overview

Purpose

Key benefits

Functionality

Core features

Supported Document Types

Workflow

Architechture

Components

1. Document Parser (services/document_parser.py)

2. Content Cleaner (services/content_cleaner.py)

3. Analysis Agent (services/analysis_agent.py)

4. Writing Agent (services/writing_agent.py)

5. Metadata Agent (services/metadata_agent.py)

6. LLM Client (services/llm_client.py)

Project structure

Versions

Current version

v1.0.0 (20/11/2025)

1. Document Parser (`services/document_parser.py`)

2. Content Cleaner (`services/content_cleaner.py`)

3. Analysis Agent (`services/analysis_agent.py`)

4. Writing Agent (`services/writing_agent.py`)

5. Metadata Agent (`services/metadata_agent.py`)

6. LLM Client (`services/llm_client.py`)