Skip to content

Sitemap Generation

Flow ID: SY-05 | Module(s): job, eshop, blog, category | Complexity: Medium Last Updated: 2026-05-22

Business Overview

The AdvGenerateSitemaps job generates XML sitemaps for search engine crawlers. It collects URLs for all public-facing content -- products, categories, vendors, vendor product lines, vendor-category intersections, blog articles, and CMS pages -- then writes chunked sitemap files (500 URLs per file) plus a sitemap index to the dedicated storage('sitemap') disk under the sitemap/ prefix.

The sitemap index is a static file written by the cron alongside the chunks, not a controller endpoint. For local-disk tenants nginx serves all files (chunks + index) directly from public/sitemap/; for K8s tenants the same files live on S3 and are served via a CDN edge mounted at /sitemap/. Crawlers always hit the same registered URL /sitemap/sitemap_index.xml.

The job runs daily at 01:30 by default and is enabled in the core queue.

Architecture

Nightly cron (no per-request work — no controller in the path):
  AdvGenerateSitemaps
    |
    +--> getUrls()                           collect all URL sets
    |     +--> [homepage, vendors page]       static entry points
    |     +--> getCategories()                product category tree (4 levels deep)
    |     +--> getVendors()                   vendors + exclusive pages + product lines + vendor-category pages
    |     +--> getProducts()                  all active products, priority-sorted
    |     +--> getBlog()                      published blog articles
    |     +--> getPages()                     CMS static pages
    |
    +--> array_chunk(urls, 500)              split into 500-URL chunks
    +--> storage('sitemap')->listContents    list everything under sitemap/
    +--> ->delete(each)                      wipe the prefix
    +--> siteMap(chunk)                      generate XML per chunk
    +--> ->put(sitemap/sitemap-N.xml)        write each chunk
    +--> collect ->url(path) for each chunk  build absolute URLs for the index
    +--> siteMapIndex(urls)                  generate the index XML
    +--> ->put(sitemap/sitemap_index.xml)    write the index alongside the chunks

Request path (crawler hits the registered URL):
  GET /sitemap/sitemap_index.xml
    +--> Local-disk tenants  → nginx serves public/sitemap/sitemap_index.xml directly
    +--> K8s tenants         → CDN edge serves it from S3 at the same path

Key Files

FileRole
ecommercen/job/libraries/AdvGenerateSitemaps.phpJob implementation
application/modules/job/libraries/GenerateSitemaps.phpClient-overridable subclass
ecommercen/helpers/xml_helper.phpsiteMap() and siteMapIndex() XML generators
ecommercen/eshop/models/Adv_product_category_model.phpCategory tree and vendor-category queries
ecommercen/eshop/models/Adv_vendors_model.phpVendor sitemap queries
ecommercen/eshop/models/Adv_lines_model.phpVendor product lines
ecommercen/eshop/models/Adv_product_model.phpgetProductsForSiteMap()
ecommercen/blog/models/Adv_blog_model.phpBlog article list
ecommercen/category/models/Adv_categories_model.phpCMS pages list
src/Storage/Storage.phpLeague Flysystem wrapper used to read/write chunks on local or S3
application/config/storage.phpsitemap storage type — local (root = FCPATH, visibility public) or s3 (with public CDN URL)

Code Flow

1. Initialization

  • Resolves the current language abbreviation for URL generation.
  • Lists every file under storage('sitemap')->listContents('sitemap') and deletes each one. This wipes the previous run's chunks and index before the new set is written. There is a brief window during regeneration where crawlers may see a 404 for a chunk URL they cached earlier — accepted given the daily cadence; matches the behaviour of the pre-storage-refactor implementation.

2. URL Collection

Each content type is collected independently and merged into a single URL array. Each URL entry is a map with loc, changefreq, and priority keys.

Static Pages (always included)

URLFrequencyPriority
Homepage (base_url())daily1.0
Vendors listing (/vendors)weekly1.0

Categories (gated by XML_FEEDS.IS_ENABLED_SITEMAP_CATEGORIES)

Traverses the published category tree up to 4 levels deep via product_category_model->getRecordsTreeMemory(). All published categories are added with:

  • Change frequency: daily
  • Priority: 0.9

Published category IDs are collected for use in the vendor-category intersection later.

Products (gated by XML_FEEDS.IS_ENABLED_SITEMAP_PRODUCTS)

product_model->getProductsForSiteMap() returns all non-deleted, non-zero-price products ordered by: order count (DESC), active status (DESC), total stock (DESC).

Priority assignment:

ConditionPriority
First 500 products (highest-selling)1.0
Inactive products0.1
Out-of-stock (no negative stock allowed)0.5
All other products0.7

URL resolution: uses the product's url field if set; otherwise falls back to {vendors_base_url}/{vendor_slug}/{product_slug}.

Vendors (gated by XML_FEEDS.IS_ENABLED_SITEMAP_VENDORS)

Three sub-collections:

Vendor pages via vendors_model->getVendorsForSiteMap():

  • Top 15% of vendors: priority 1.0 (tracked as "top vendors")
  • Remaining vendors: priority 0.8
  • Exclusive vendors also get a /details page at priority 0.5
  • Change frequency: weekly

Vendor product lines via lines_model->getLinesForSiteMap():

  • Lines belonging to top vendors: priority 0.8
  • Other lines: priority 0.6
  • URL: /vendors/{vendor_slug}/{line_slug}

Vendor-category pages via product_category_model->getVendorsCategoriesForSiteMap():

  • Only for categories that were published (collected during category traversal)
  • Top vendor categories: priority 0.8
  • Other categories: priority 0.6
  • URL: /vendors/{vendor_slug}/{category_slug}

Blog Articles (gated by XML_FEEDS.IS_ENABLED_SITEMAP_BLOG)

Published articles with blog_date <= today via blog_model->getBlogsList():

  • Change frequency: weekly
  • Priority: 0.7
  • URL: /blog/{slug}

CMS Pages (gated by XML_FEEDS.IS_ENABLED_SITEMAP_STATIC_PAGES)

All CMS pages via categories_model->get_list():

  • Change frequency: yearly
  • Priority: 0.3
  • URL: /category/{slug}

3. Chunk and Index Writing

  1. The merged URL array is split into chunks of 500 URLs each (array_chunk).
  2. Each chunk is passed to siteMap() (ecommercen/helpers/xml_helper.php), which generates a compliant <urlset> XML document using DOMDocument.
  3. Each chunk is written to sitemap/sitemap-{N}.xml via storage('sitemap')->put(...). The job also captures storage('sitemap')->url($chunkPath) for each chunk into a $chunkUrls array.
  4. Once all chunks are written, siteMapIndex($chunkUrls) builds the <sitemapindex> XML and the job writes it to sitemap/sitemap_index.xml.

There is no controller in the request path — both chunks and index are static files served directly by nginx (local-disk tenants) or by the CDN edge (K8s tenants).

XML Structure

Sitemap chunk (sitemap-0.xml):

xml
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  ...
</urlset>

Sitemap index (sitemap_index.xml):

xml
<?xml version="1.0" encoding="utf-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap/sitemap-0.xml</loc>
  </sitemap>
  ...
</sitemapindex>

Data Model

No dedicated tables. The job reads from:

TableContent Type
shop_product + shop_product_mui + shop_vendor_muiProducts
shop_product_category + shop_product_category_muiCategories
shop_vendor + shop_vendor_muiVendors
shop_product_lines + shop_product_lines_muiProduct lines
shop_product_category_lpVendor-category intersections
blog + blog_muiBlog articles
categories + categories_muiCMS pages
tmp_shop_order_basketProduct popularity (joined for sort order)

Configuration

Job Scheduling (application/config/jobs.php)

php
['command' => 'GenerateSitemaps', 'schedule' => '30 1 * * *', 'graceTime' => 300, 'retryTimes' => 3]

Runs daily at 01:30 in the core queue.

Registry Settings (feature toggles)

GroupKeyPurpose
XML_FEEDSIS_ENABLED_SITEMAP_CATEGORIESInclude categories in sitemap
XML_FEEDSIS_ENABLED_SITEMAP_PRODUCTSInclude products in sitemap
XML_FEEDSIS_ENABLED_SITEMAP_VENDORSInclude vendors, lines, and vendor-categories
XML_FEEDSIS_ENABLED_SITEMAP_BLOGInclude blog articles
XML_FEEDSIS_ENABLED_SITEMAP_STATIC_PAGESInclude CMS pages

Output Location

Files are written to the dedicated sitemap storage type under the sitemap/ prefix. The actual location depends on the driver configured in application/config/storage.php:

  • Local driver (single-server deployments): chunks and the index land at FCPATH/sitemap/sitemap-N.xml and FCPATH/sitemap/sitemap_index.xml. Nginx serves them as static files at /sitemap/....
  • S3 driver (Kubernetes deployments): same paths but on s3://{SITEMAP_S3_BUCKET}/{FILES_S3_PREFIX}/sitemap/. The configured SITEMAP_S3_URL CDN endpoint serves them. Infra is expected to mount the CDN at https://{app-host}/sitemap/ (typically via CloudFront path-based routing) so the registered URL /sitemap/sitemap_index.xml resolves to the S3 object without changing the public URL.

Driver selection is per-tenant via the SITEMAP_STORAGE_DISK env var (local or s3).

Client Extension Points

  1. Override the job class: Extend AdvGenerateSitemaps in application/modules/job/libraries/GenerateSitemaps.php to:

    • Add custom content types (e.g., event pages, landing pages)
    • Exclude specific categories or vendors
    • Change priority assignment logic
    • Modify URL patterns
  2. Override getUrls(): Add or remove content type methods entirely.

  3. Override individual getters: Override getProducts(), getCategories(), etc. to customize filtering, ordering, or priority logic for specific content types.

  4. Override XML helpers: Replace siteMap() and siteMapIndex() in application/helpers/xml_helper.php to change XML formatting (e.g., add <lastmod> timestamps or <image:image> extensions).

  5. Custom content types: Add new getter methods in a subclass and include them in a getUrls() override.

Business Rules

RuleDescription
Full regenerationThe job lists everything under sitemap/ and deletes it before writing the new chunk set + index; no incremental updates
Static indexThe sitemap index is written by the cron as a static file (sitemap/sitemap_index.xml) — no PHP request path
500 URLs per fileSitemaps are chunked at 500 URLs per file for crawler efficiency
Feature-gated contentEach content type can be independently enabled/disabled via registry
Priority-based rankingProducts sorted by popularity; first 500 get maximum priority
Exclusive vendor pagesVendors with is_exclusive = 1 get an additional /details page
Top vendor boostTop 15% of vendors (by position) receive priority 1.0
Published content onlyCategories must be published = 1; blog articles must have blog_date <= today
Non-zero priceProducts with zero price are excluded
Non-deleted productsSoft-deleted products are excluded
Language-aware URLsProduct and category URLs use the configured language abbreviation