Knowledge Base Generator

AI-powered pipeline for transforming documents into LLM-ready knowledge base articles

Knowledge Base Generator

πŸ”— Repository: https://github.com/PSavvateev/kb-generator.git

Overview

Purpose

Well-structured knowledge base is one of the key elements of customer service systems, used not only by support agents, but also for customer-facing automatic self-service and further AI-automations as chat bots and agent assists. It is common for a company to have a large knowledge base documented in different formats as a legacy of inconsistent creation approach.

KB Generator automates the creation of knowledge base articles from various document formats. It analyzes document content, extracts structured information, generates markdown-format articles with proper formatting, and creates rich metadata for further LLM usage (RAG agents or chat bots).

Key benefits

  • Save Time: Convert hours of manual KB article writing into minutes of automated processing
  • Consistency: Ensure uniform structure, tone, and quality across all articles
  • Rich Metadata: Automatically generate metadata, tags, and keywords
  • Multi-format Support: Process PDFs, DOCX, and TXT files
  • Table Preservation: Accurately extract and format tables with intelligent validation
  • Flexible AI Providers: Support for Google Gemini, OpenAI, Anthropic Claude, and local Ollama models

Functionality

Core features

  1. Document Parsing

    • Extract text, tables, and metadata from PDF, DOCX, and TXT files
    • Intelligent table validation to filter malformed extractions
    • Preserve document structure and formatting
  2. Content Cleaning

    • Fix encoding issues (smart quotes, mojibake, UTF-8 errors)
    • Remove artifacts (form feeds, control characters, zero-width spaces)
    • Normalize whitespace and line breaks
    • Remove duplicate lines
    • Standardize bullet points and numbering
    • Optional header/footer removal
  3. Content Analysis

    • AI-powered document type detection (tutorial, reference, how-to, troubleshooting, etc.)
    • Automatic section identification and outlining
    • Table placement recommendations
    • Target audience identification
  4. Article Generation

    • Professional markdown article creation
    • Proper heading hierarchy and structure
    • Clean table formatting
    • Source attribution
    • Configurable tone and style
  5. Metadata Generation

    • SEO-optimized titles and descriptions
    • Relevant tags and keywords
    • Difficulty level assessment
    • Reading time estimation
    • Related articles suggestions
    • Prerequisites identification

Supported Document Types

TypeExtensionsTable ExtractionMetadata Extraction
PDF.pdfβœ… Yesβœ… Yes (limited)
Word.docxβœ… Yesβœ… Yes (full)
Text.txt❌ No❌ No

Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        KB GENERATOR PIPELINE                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Input Document (PDF/DOCX/TXT)
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Stage 1: PARSING   β”‚
    β”‚  Document Parser    β”‚
    β”‚  β€’ Extract text     β”‚
    β”‚  β€’ Extract tables   β”‚
    β”‚  β€’ Validate tables  β”‚
    β”‚  β€’ Get metadata     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
      Parsed Content + Tables
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Stage 2: CLEANING   β”‚
    β”‚  Content Cleaner    β”‚
    β”‚  β€’ Fix encoding     β”‚
    β”‚  β€’ Remove artifacts β”‚
    β”‚  β€’ Normalize text   β”‚
    β”‚  β€’ Remove dupes     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
       Clean Text + Tables
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Stage 3: ANALYSIS   β”‚
    β”‚  Analysis Agent     β”‚
    β”‚  β€’ Detect doc type  β”‚
    β”‚  β€’ Identify sectionsβ”‚
    β”‚  β€’ Plan structure   β”‚
    β”‚  β€’ Place tables     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
       Content Plan + Structure
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Stage 4: WRITING    β”‚
    β”‚  Writing Agent      β”‚
    β”‚  β€’ Generate article β”‚
    β”‚  β€’ Format markdown  β”‚
    β”‚  β€’ Include tables   β”‚
    β”‚  β€’ Apply style      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
         KB Article (Markdown)
               β”‚
               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Stage 5: METADATA   β”‚
    β”‚  Metadata Agent     β”‚
    β”‚  β€’ Generate title   β”‚
    β”‚  β€’ Create tags      β”‚
    β”‚  β€’ Extract keywords β”‚
    β”‚  β€’ Suggest related  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
    Output: Article + Metadata + JSON files

Architechture

Components

1. Document Parser (services/document_parser.py)

Extracts content from various file formats with robust table validation.

Responsibilities:

  • Parse PDF, DOCX, and TXT files
  • Extract text content while preserving structure
  • Identify and extract tables using pdfplumber
  • Validate tables to filter malformed extractions
  • Convert tables to markdown format
  • Extract document metadata (page count, word count, etc.)

Key Features:

  • Strict table validation to filter PDF extraction errors
  • Handles empty columns, text blobs, and visual boxes
  • Preserves document order in DOCX files
  • Encoding detection for text files

2. Content Cleaner (services/content_cleaner.py)

Cleans and normalizes extracted text for optimal LLM processing.

Responsibilities:

  • Fix encoding issues (UTF-8 mojibake, smart quotes, Latin-1 issues)
  • Remove artifacts (form feeds, control characters, BOM, zero-width spaces)
  • Normalize whitespace and line breaks
  • Remove consecutive duplicate lines
  • Standardize bullet points and list formatting
  • Optional removal of page headers/footers

Key Features:

  • Comprehensive encoding fix database (80+ patterns)
  • Configurable cleaning options
  • Statistics tracking for debugging
  • Conservative defaults to preserve content
  • Non-destructive cleaning (validates output)

What Gets Cleaned:

  • Encoding Issues: Ò€ℒ β†’ ', é β†’ Γ©, Ò€œ β†’ "
  • Artifacts: Form feeds, control characters, zero-width spaces, BOM
  • Whitespace: Multiple spaces β†’ single space, max 2 consecutive newlines
  • Bullets: β€’β–ͺβ–«β–Έβ–Ή β†’ β€’ (normalized)
  • Duplicates: Consecutive identical lines removed
  • Optional: Page headers/footers (β€œPage X of Y”)

3. Analysis Agent (services/analysis_agent.py)

AI-powered content analysis and structure planning.

Responsibilities:

  • Detect document type (tutorial, reference, concept, etc.)
  • Identify target audience and difficulty level
  • Extract key takeaways
  • Plan article sections and hierarchy
  • Recommend table placements
  • Analyze content style and tone

Key Features:

  • Multi-stage analysis with JSON output
  • Intelligent section planning
  • Table-to-section mapping
  • Content style detection

4. Writing Agent (services/writing_agent.py)

Generates professional markdown articles from content plans.

Responsibilities:

  • Generate well-structured markdown articles
  • Apply consistent formatting and style
  • Place tables in appropriate locations
  • Create proper heading hierarchy
  • Add source attribution
  • Maintain professional tone

Key Features:

  • Template-based generation
  • Configurable tone and style
  • Section-by-section writing
  • Table integration
  • Source citation

5. Metadata Agent (services/metadata_agent.py)

Creates comprehensive metadata for SEO and discoverability.

Responsibilities:

  • Generate SEO-optimized titles
  • Create meta descriptions
  • Extract and suggest tags
  • Identify keywords
  • Estimate reading time
  • Suggest related articles
  • Define prerequisites

Key Features:

  • Rich structured metadata
  • SEO optimization
  • Related content suggestions
  • Prerequisite identification
  • Comprehensive tagging

6. LLM Client (services/llm_client.py)

Unified interface for multiple AI providers.

Responsibilities:

  • Abstract provider-specific implementations
  • Handle API authentication and requests
  • Implement retry logic and error handling
  • Parse JSON responses robustly
  • Manage rate limits

Supported Providers:

  • Google Gemini (gemini-1.5-flash, gemini-1.5-pro)
  • OpenAI (gpt-4o, gpt-4o-mini, o1-mini, o1-preview)
  • Anthropic Claude (claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus)
  • Ollama (local models: llama3.1, qwen2.5, mistral, etc.)

Project structure

πŸ“‚kb-generator/
β”œβ”€β”€ pipeline.py                     # Main CLI entry point
β”œβ”€β”€ config.py                       # Configuration dataclasses
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ .env                            # API keys (not in git)
β”œβ”€β”€ .env.example                    # Example environment file
β”‚
β”œβ”€β”€ πŸ“‚ services/                   # Core service modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ document_parser.py          # Document parsing & table extraction
|   β”œβ”€β”€ content_cleaner.py          # Text cleaning & normalization
β”‚   β”œβ”€β”€ analysis_agent.py           # Content analysis & planning
β”‚   β”œβ”€β”€ writing_agent.py            # Article generation
β”‚   β”œβ”€β”€ metadata_agent.py           # Metadata generation
β”‚   └── llm_client.py               # LLM provider abstraction
β”‚
β”œβ”€β”€ πŸ“‚ outputs/                     # Generated articles (auto-created)
β”‚   └── <document-name>/
β”‚       β”œβ”€β”€ article.md              # Final article
β”‚       β”œβ”€β”€ article_metadata.json   # Metadata
β”‚       β”œβ”€β”€ article_plan.json       # Content plan
β”‚       └── article_parsed.json     # Parsed document
β”‚
β”œβ”€β”€ πŸ“‚ logs/                        # Execution logs (auto-created)
β”‚   └── pipeline_YYYYMMDD_HHMMSS.log
β”‚
└── README.md               

Versions

Current version

v1.0.0 (20/11/2025)