Introduction

This document provides information about TownNews archive policies, covering archive loading for BLOX CMS.

Requirements

All data being provided to TownNews must be provided in a format that can be parsed by software for inclusion into our various product lines. Each series of products may have different data requirements. These requirements are outlined throughout the document.

Editorial

Ingestion

Standard formats for data ingestion into BLOX CMS include:

  1. NITF - an XML format for newspaper data exchange. The standard is managed by the IPTC and the full specification can be read online.

  2. All other document formats will need to meet the following requirements:

  • Unique Document ID - a unique document ID is required for all articles. The unique ID is used to re-map either from an existing web site or from an editorial system. It is important to note that the unique ID is what governs the automated URL migration of pages for a site.

  • Start Date/Time - we require the start date and time of an article to properly place the document in the archive of the site. Without this information, documents will be considered published based on the date of ingestion.

  •  Article File Formats - ideally the articles are provided in a XML format, one file per article. If possible, a schema file should be provided to explain the structure of the format. We also can accept a MySQL database dump, though this will not be accepted without documentation explaining the layout and behavior of the database provided.

  • Media File Formats - we require images to be in JPEG, PNG, or GIF file formats. We also support Flash files (SWF) and we support video files in FLV or MP4 format. Audio files can only be provided in MP3 format.

  • AP Removal - if you have old content that declares ‘(AP)’ or can clearly be defined as content from the associated press, we are able to remove this content upon request. For some newspapers, depending on how the content was loaded, it may be a violation of the terms of use with AP for archiving the content. Newspapers should check with their AP Bureau chief for more information.

Extraction

We will provide data extraction of editorial content in BLOX CMS archive format (JSON files containing metadata about the asset in question) free of charge.

An alternative archive format is NITF (XML files) but is subject to a processing fee.

Automated URL Migration

When moving sites from 3rd party Content Management Systems, the BLOX CMS platform provides a unique mechanism for automating the task of updating various search engine URL links.

The process involves the user accessing the legacy URL, which may normally produce a 404 or 415 error message. On that page, a snippet of code checks to see if the old URL resolves to a unique document ID from the previous site. If a match is found, instead of serving an error page, a 301 redirect is issued, causing the user (or search engine) to go to the new page. Since the URL is a 301, search engines such as Google and Yahoo! will automatically update their database to point to the new URL.

In order to make sure this process will work for your site, it is important to understand that the unique document ID from the legacy site must somehow be incorporated into the new ID on your BLOX CMS site. You will need to make sure that legacy URLs contain enough information to identify the new document in BLOX CMS.