Knowledge Base Generator
AI-powered pipeline for transforming documents into LLM-ready knowledge base articles
π Repository: https://github.com/PSavvateev/kb-generator.git
Overview
Purpose
Well-structured knowledge base is one of the key elements of customer service systems, used not only by support agents, but also for customer-facing automatic self-service and further AI-automations as chat bots and agent assists. It is common for a company to have a large knowledge base documented in different formats as a legacy of inconsistent creation approach.
KB Generator automates the creation of knowledge base articles from various document formats. It analyzes document content, extracts structured information, generates markdown-format articles with proper formatting, and creates rich metadata for further LLM usage (RAG agents or chat bots).
Key benefits
- Save Time: Convert hours of manual KB article writing into minutes of automated processing
- Consistency: Ensure uniform structure, tone, and quality across all articles
- Rich Metadata: Automatically generate metadata, tags, and keywords
- Multi-format Support: Process PDFs, DOCX, and TXT files
- Table Preservation: Accurately extract and format tables with intelligent validation
- Flexible AI Providers: Support for Google Gemini, OpenAI, Anthropic Claude, and local Ollama models
Functionality
Core features
-
Document Parsing
- Extract text, tables, and metadata from PDF, DOCX, and TXT files
- Intelligent table validation to filter malformed extractions
- Preserve document structure and formatting
-
Content Cleaning
- Fix encoding issues (smart quotes, mojibake, UTF-8 errors)
- Remove artifacts (form feeds, control characters, zero-width spaces)
- Normalize whitespace and line breaks
- Remove duplicate lines
- Standardize bullet points and numbering
- Optional header/footer removal
-
Content Analysis
- AI-powered document type detection (tutorial, reference, how-to, troubleshooting, etc.)
- Automatic section identification and outlining
- Table placement recommendations
- Target audience identification
-
Article Generation
- Professional markdown article creation
- Proper heading hierarchy and structure
- Clean table formatting
- Source attribution
- Configurable tone and style
-
Metadata Generation
- SEO-optimized titles and descriptions
- Relevant tags and keywords
- Difficulty level assessment
- Reading time estimation
- Related articles suggestions
- Prerequisites identification
Supported Document Types
| Type | Extensions | Table Extraction | Metadata Extraction |
|---|---|---|---|
.pdf | β Yes | β Yes (limited) | |
| Word | .docx | β Yes | β Yes (full) |
| Text | .txt | β No | β No |
Workflow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KB GENERATOR PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input Document (PDF/DOCX/TXT)
β
βΌ
βββββββββββββββββββββββ
β Stage 1: PARSING β
β Document Parser β
β β’ Extract text β
β β’ Extract tables β
β β’ Validate tables β
β β’ Get metadata β
ββββββββββββ¬βββββββββββ
β
βΌ
Parsed Content + Tables
β
βΌ
βββββββββββββββββββββββ
β Stage 2: CLEANING β
β Content Cleaner β
β β’ Fix encoding β
β β’ Remove artifacts β
β β’ Normalize text β
β β’ Remove dupes β
ββββββββββββ¬βββββββββββ
β
βΌ
Clean Text + Tables
β
βΌ
βββββββββββββββββββββββ
β Stage 3: ANALYSIS β
β Analysis Agent β
β β’ Detect doc type β
β β’ Identify sectionsβ
β β’ Plan structure β
β β’ Place tables β
ββββββββββββ¬βββββββββββ
β
βΌ
Content Plan + Structure
β
βΌ
βββββββββββββββββββββββ
β Stage 4: WRITING β
β Writing Agent β
β β’ Generate article β
β β’ Format markdown β
β β’ Include tables β
β β’ Apply style β
ββββββββββββ¬βββββββββββ
β
βΌ
KB Article (Markdown)
β
βΌ
βββββββββββββββββββββββ
β Stage 5: METADATA β
β Metadata Agent β
β β’ Generate title β
β β’ Create tags β
β β’ Extract keywords β
β β’ Suggest related β
ββββββββββββ¬βββββββββββ
β
βΌ
Output: Article + Metadata + JSON files
Architechture
Components
1. Document Parser (services/document_parser.py)
Extracts content from various file formats with robust table validation.
Responsibilities:
- Parse PDF, DOCX, and TXT files
- Extract text content while preserving structure
- Identify and extract tables using
pdfplumber - Validate tables to filter malformed extractions
- Convert tables to markdown format
- Extract document metadata (page count, word count, etc.)
Key Features:
- Strict table validation to filter PDF extraction errors
- Handles empty columns, text blobs, and visual boxes
- Preserves document order in DOCX files
- Encoding detection for text files
2. Content Cleaner (services/content_cleaner.py)
Cleans and normalizes extracted text for optimal LLM processing.
Responsibilities:
- Fix encoding issues (UTF-8 mojibake, smart quotes, Latin-1 issues)
- Remove artifacts (form feeds, control characters, BOM, zero-width spaces)
- Normalize whitespace and line breaks
- Remove consecutive duplicate lines
- Standardize bullet points and list formatting
- Optional removal of page headers/footers
Key Features:
- Comprehensive encoding fix database (80+ patterns)
- Configurable cleaning options
- Statistics tracking for debugging
- Conservative defaults to preserve content
- Non-destructive cleaning (validates output)
What Gets Cleaned:
- Encoding Issues:
Γ’β¬β’β',ΓΒ©βΓ©,Γ’β¬Εβ" - Artifacts: Form feeds, control characters, zero-width spaces, BOM
- Whitespace: Multiple spaces β single space, max 2 consecutive newlines
- Bullets:
β’βͺβ«βΈβΉββ’(normalized) - Duplicates: Consecutive identical lines removed
- Optional: Page headers/footers (βPage X of Yβ)
3. Analysis Agent (services/analysis_agent.py)
AI-powered content analysis and structure planning.
Responsibilities:
- Detect document type (tutorial, reference, concept, etc.)
- Identify target audience and difficulty level
- Extract key takeaways
- Plan article sections and hierarchy
- Recommend table placements
- Analyze content style and tone
Key Features:
- Multi-stage analysis with JSON output
- Intelligent section planning
- Table-to-section mapping
- Content style detection
4. Writing Agent (services/writing_agent.py)
Generates professional markdown articles from content plans.
Responsibilities:
- Generate well-structured markdown articles
- Apply consistent formatting and style
- Place tables in appropriate locations
- Create proper heading hierarchy
- Add source attribution
- Maintain professional tone
Key Features:
- Template-based generation
- Configurable tone and style
- Section-by-section writing
- Table integration
- Source citation
5. Metadata Agent (services/metadata_agent.py)
Creates comprehensive metadata for SEO and discoverability.
Responsibilities:
- Generate SEO-optimized titles
- Create meta descriptions
- Extract and suggest tags
- Identify keywords
- Estimate reading time
- Suggest related articles
- Define prerequisites
Key Features:
- Rich structured metadata
- SEO optimization
- Related content suggestions
- Prerequisite identification
- Comprehensive tagging
6. LLM Client (services/llm_client.py)
Unified interface for multiple AI providers.
Responsibilities:
- Abstract provider-specific implementations
- Handle API authentication and requests
- Implement retry logic and error handling
- Parse JSON responses robustly
- Manage rate limits
Supported Providers:
- Google Gemini (gemini-1.5-flash, gemini-1.5-pro)
- OpenAI (gpt-4o, gpt-4o-mini, o1-mini, o1-preview)
- Anthropic Claude (claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus)
- Ollama (local models: llama3.1, qwen2.5, mistral, etc.)
Project structure
πkb-generator/
βββ pipeline.py # Main CLI entry point
βββ config.py # Configuration dataclasses
βββ requirements.txt # Python dependencies
βββ .env # API keys (not in git)
βββ .env.example # Example environment file
β
βββ π services/ # Core service modules
β βββ __init__.py
β βββ document_parser.py # Document parsing & table extraction
| βββ content_cleaner.py # Text cleaning & normalization
β βββ analysis_agent.py # Content analysis & planning
β βββ writing_agent.py # Article generation
β βββ metadata_agent.py # Metadata generation
β βββ llm_client.py # LLM provider abstraction
β
βββ π outputs/ # Generated articles (auto-created)
β βββ <document-name>/
β βββ article.md # Final article
β βββ article_metadata.json # Metadata
β βββ article_plan.json # Content plan
β βββ article_parsed.json # Parsed document
β
βββ π logs/ # Execution logs (auto-created)
β βββ pipeline_YYYYMMDD_HHMMSS.log
β
βββ README.md