Appearance
Sitemap Generation
Flow ID: SY-05 | Module(s): job, eshop, blog, category | Complexity: Medium Last Updated: 2026-05-22
Business Overview
The AdvGenerateSitemaps job generates XML sitemaps for search engine crawlers. It collects URLs for all public-facing content -- products, categories, vendors, vendor product lines, vendor-category intersections, blog articles, and CMS pages -- then writes chunked sitemap files (500 URLs per file) plus a sitemap index to the dedicated storage('sitemap') disk under the sitemap/ prefix.
The sitemap index is a static file written by the cron alongside the chunks, not a controller endpoint. For local-disk tenants nginx serves all files (chunks + index) directly from public/sitemap/; for K8s tenants the same files live on S3 and are served via a CDN edge mounted at /sitemap/. Crawlers always hit the same registered URL /sitemap/sitemap_index.xml.
The job runs daily at 01:30 by default and is enabled in the core queue.
Architecture
Nightly cron (no per-request work — no controller in the path):
AdvGenerateSitemaps
|
+--> getUrls() collect all URL sets
| +--> [homepage, vendors page] static entry points
| +--> getCategories() product category tree (4 levels deep)
| +--> getVendors() vendors + exclusive pages + product lines + vendor-category pages
| +--> getProducts() all active products, priority-sorted
| +--> getBlog() published blog articles
| +--> getPages() CMS static pages
|
+--> array_chunk(urls, 500) split into 500-URL chunks
+--> storage('sitemap')->listContents list everything under sitemap/
+--> ->delete(each) wipe the prefix
+--> siteMap(chunk) generate XML per chunk
+--> ->put(sitemap/sitemap-N.xml) write each chunk
+--> collect ->url(path) for each chunk build absolute URLs for the index
+--> siteMapIndex(urls) generate the index XML
+--> ->put(sitemap/sitemap_index.xml) write the index alongside the chunks
Request path (crawler hits the registered URL):
GET /sitemap/sitemap_index.xml
+--> Local-disk tenants → nginx serves public/sitemap/sitemap_index.xml directly
+--> K8s tenants → CDN edge serves it from S3 at the same pathKey Files
| File | Role |
|---|---|
ecommercen/job/libraries/AdvGenerateSitemaps.php | Job implementation |
application/modules/job/libraries/GenerateSitemaps.php | Client-overridable subclass |
ecommercen/helpers/xml_helper.php | siteMap() and siteMapIndex() XML generators |
ecommercen/eshop/models/Adv_product_category_model.php | Category tree and vendor-category queries |
ecommercen/eshop/models/Adv_vendors_model.php | Vendor sitemap queries |
ecommercen/eshop/models/Adv_lines_model.php | Vendor product lines |
ecommercen/eshop/models/Adv_product_model.php | getProductsForSiteMap() |
ecommercen/blog/models/Adv_blog_model.php | Blog article list |
ecommercen/category/models/Adv_categories_model.php | CMS pages list |
src/Storage/Storage.php | League Flysystem wrapper used to read/write chunks on local or S3 |
application/config/storage.php | sitemap storage type — local (root = FCPATH, visibility public) or s3 (with public CDN URL) |
Code Flow
1. Initialization
- Resolves the current language abbreviation for URL generation.
- Lists every file under
storage('sitemap')->listContents('sitemap')and deletes each one. This wipes the previous run's chunks and index before the new set is written. There is a brief window during regeneration where crawlers may see a 404 for a chunk URL they cached earlier — accepted given the daily cadence; matches the behaviour of the pre-storage-refactor implementation.
2. URL Collection
Each content type is collected independently and merged into a single URL array. Each URL entry is a map with loc, changefreq, and priority keys.
Static Pages (always included)
| URL | Frequency | Priority |
|---|---|---|
Homepage (base_url()) | daily | 1.0 |
Vendors listing (/vendors) | weekly | 1.0 |
Categories (gated by XML_FEEDS.IS_ENABLED_SITEMAP_CATEGORIES)
Traverses the published category tree up to 4 levels deep via product_category_model->getRecordsTreeMemory(). All published categories are added with:
- Change frequency:
daily - Priority:
0.9
Published category IDs are collected for use in the vendor-category intersection later.
Products (gated by XML_FEEDS.IS_ENABLED_SITEMAP_PRODUCTS)
product_model->getProductsForSiteMap() returns all non-deleted, non-zero-price products ordered by: order count (DESC), active status (DESC), total stock (DESC).
Priority assignment:
| Condition | Priority |
|---|---|
| First 500 products (highest-selling) | 1.0 |
| Inactive products | 0.1 |
| Out-of-stock (no negative stock allowed) | 0.5 |
| All other products | 0.7 |
URL resolution: uses the product's url field if set; otherwise falls back to {vendors_base_url}/{vendor_slug}/{product_slug}.
Vendors (gated by XML_FEEDS.IS_ENABLED_SITEMAP_VENDORS)
Three sub-collections:
Vendor pages via vendors_model->getVendorsForSiteMap():
- Top 15% of vendors: priority 1.0 (tracked as "top vendors")
- Remaining vendors: priority 0.8
- Exclusive vendors also get a
/detailspage at priority 0.5 - Change frequency:
weekly
Vendor product lines via lines_model->getLinesForSiteMap():
- Lines belonging to top vendors: priority 0.8
- Other lines: priority 0.6
- URL:
/vendors/{vendor_slug}/{line_slug}
Vendor-category pages via product_category_model->getVendorsCategoriesForSiteMap():
- Only for categories that were published (collected during category traversal)
- Top vendor categories: priority 0.8
- Other categories: priority 0.6
- URL:
/vendors/{vendor_slug}/{category_slug}
Blog Articles (gated by XML_FEEDS.IS_ENABLED_SITEMAP_BLOG)
Published articles with blog_date <= today via blog_model->getBlogsList():
- Change frequency:
weekly - Priority:
0.7 - URL:
/blog/{slug}
CMS Pages (gated by XML_FEEDS.IS_ENABLED_SITEMAP_STATIC_PAGES)
All CMS pages via categories_model->get_list():
- Change frequency:
yearly - Priority:
0.3 - URL:
/category/{slug}
3. Chunk and Index Writing
- The merged URL array is split into chunks of 500 URLs each (
array_chunk). - Each chunk is passed to
siteMap()(ecommercen/helpers/xml_helper.php), which generates a compliant<urlset>XML document usingDOMDocument. - Each chunk is written to
sitemap/sitemap-{N}.xmlviastorage('sitemap')->put(...). The job also capturesstorage('sitemap')->url($chunkPath)for each chunk into a$chunkUrlsarray. - Once all chunks are written,
siteMapIndex($chunkUrls)builds the<sitemapindex>XML and the job writes it tositemap/sitemap_index.xml.
There is no controller in the request path — both chunks and index are static files served directly by nginx (local-disk tenants) or by the CDN edge (K8s tenants).
XML Structure
Sitemap chunk (sitemap-0.xml):
xml
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
...
</urlset>Sitemap index (sitemap_index.xml):
xml
<?xml version="1.0" encoding="utf-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap/sitemap-0.xml</loc>
</sitemap>
...
</sitemapindex>Data Model
No dedicated tables. The job reads from:
| Table | Content Type |
|---|---|
shop_product + shop_product_mui + shop_vendor_mui | Products |
shop_product_category + shop_product_category_mui | Categories |
shop_vendor + shop_vendor_mui | Vendors |
shop_product_lines + shop_product_lines_mui | Product lines |
shop_product_category_lp | Vendor-category intersections |
blog + blog_mui | Blog articles |
categories + categories_mui | CMS pages |
tmp_shop_order_basket | Product popularity (joined for sort order) |
Configuration
Job Scheduling (application/config/jobs.php)
php
['command' => 'GenerateSitemaps', 'schedule' => '30 1 * * *', 'graceTime' => 300, 'retryTimes' => 3]Runs daily at 01:30 in the core queue.
Registry Settings (feature toggles)
| Group | Key | Purpose |
|---|---|---|
XML_FEEDS | IS_ENABLED_SITEMAP_CATEGORIES | Include categories in sitemap |
XML_FEEDS | IS_ENABLED_SITEMAP_PRODUCTS | Include products in sitemap |
XML_FEEDS | IS_ENABLED_SITEMAP_VENDORS | Include vendors, lines, and vendor-categories |
XML_FEEDS | IS_ENABLED_SITEMAP_BLOG | Include blog articles |
XML_FEEDS | IS_ENABLED_SITEMAP_STATIC_PAGES | Include CMS pages |
Output Location
Files are written to the dedicated sitemap storage type under the sitemap/ prefix. The actual location depends on the driver configured in application/config/storage.php:
- Local driver (single-server deployments): chunks and the index land at
FCPATH/sitemap/sitemap-N.xmlandFCPATH/sitemap/sitemap_index.xml. Nginx serves them as static files at/sitemap/.... - S3 driver (Kubernetes deployments): same paths but on
s3://{SITEMAP_S3_BUCKET}/{FILES_S3_PREFIX}/sitemap/. The configuredSITEMAP_S3_URLCDN endpoint serves them. Infra is expected to mount the CDN athttps://{app-host}/sitemap/(typically via CloudFront path-based routing) so the registered URL/sitemap/sitemap_index.xmlresolves to the S3 object without changing the public URL.
Driver selection is per-tenant via the SITEMAP_STORAGE_DISK env var (local or s3).
Client Extension Points
Override the job class: Extend
AdvGenerateSitemapsinapplication/modules/job/libraries/GenerateSitemaps.phpto:- Add custom content types (e.g., event pages, landing pages)
- Exclude specific categories or vendors
- Change priority assignment logic
- Modify URL patterns
Override
getUrls(): Add or remove content type methods entirely.Override individual getters: Override
getProducts(),getCategories(), etc. to customize filtering, ordering, or priority logic for specific content types.Override XML helpers: Replace
siteMap()andsiteMapIndex()inapplication/helpers/xml_helper.phpto change XML formatting (e.g., add<lastmod>timestamps or<image:image>extensions).Custom content types: Add new getter methods in a subclass and include them in a
getUrls()override.
Business Rules
| Rule | Description |
|---|---|
| Full regeneration | The job lists everything under sitemap/ and deletes it before writing the new chunk set + index; no incremental updates |
| Static index | The sitemap index is written by the cron as a static file (sitemap/sitemap_index.xml) — no PHP request path |
| 500 URLs per file | Sitemaps are chunked at 500 URLs per file for crawler efficiency |
| Feature-gated content | Each content type can be independently enabled/disabled via registry |
| Priority-based ranking | Products sorted by popularity; first 500 get maximum priority |
| Exclusive vendor pages | Vendors with is_exclusive = 1 get an additional /details page |
| Top vendor boost | Top 15% of vendors (by position) receive priority 1.0 |
| Published content only | Categories must be published = 1; blog articles must have blog_date <= today |
| Non-zero price | Products with zero price are excluded |
| Non-deleted products | Soft-deleted products are excluded |
| Language-aware URLs | Product and category URLs use the configured language abbreviation |
Related Flows
- SY-01 Cron Job Framework -- job scheduling and execution
- AD-14 SEO Management -- broader SEO system including robots.txt and meta tags