Draft:International Standard Content Code

Novel hash-based identification system for digital objects From Wikipedia, the free encyclopedia


The International Standard Content Code (ISCC) is a similarity-aware code for digital media assets including text, images, audio, and video. Unlike traditional identifiers such as ISBN or DOI, which are assigned by registration authorities, ISCCs are generated algorithmically from the digital content itself.[1] This allows independent parties to derive identical codes from the same file without central coordination.

  • Comment: Sources do not support the text. E.g. source 13, https://iscc.codes/concept
    does not support base32. The CODE and UNIT composition is also not directly mentioned. ChrysGalley (talk) 19:44, 9 February 2026 (UTC)
  • Comment: In accordance with the Wikimedia Foundation's Terms of Use, I disclose that I have been paid by my employer for my contributions to this article. Etma 1222 (talk) 09:35, 19 January 2026 (UTC)

An ISCC is a composite code consisting of multiple units that capture different characteristics of content, from metadata and semantic meaning to perceptual features and exact byte sequences. The system is designed to preserve similarity relationships: content that is perceptually similar produces similar codes, enabling near-duplicate detection and content matching across different formats and encodings.[2]

History

Development

The ISCC was invented by Titusz Pan (ORCID 0000-0002-0521-4214[3]), a software developer specializing in content identification and digital media processing.[4][5] Development began in 2016 as part of the Content Blockchain Project, a collaborative initiative funded by the Google Digital News Initiative and undertaken by a consortium of publishing, legal, and technology companies including Zeit Online, Golem.de, and dpa.[6]

The project aimed to explore how blockchain technologies and content identification could support journalism and digital media authenticity. Pan designed the ISCC as an open, decentralized system for content identification where the code is generated directly from the content itself rather than assigned by a central authority.[7] The project was recognized in 2019 when it won one of Germany's inaugural Digital Publishing Awards at the Leipzig Book Fair, with the jury praising it as "one of the few blockchain projects to have an immediate utility for the publishing industry."[8]

Standardization

In May 2019, the International Organization for Standardization (ISO) established Working Group 18 within Technical Committee 46, Subcommittee 9 (TC 46/SC 9, Identification and description) to develop the ISCC as an international standard.[9]

After review by national standards bodies worldwide, ISO 24138:2024 Information and documentation — International Standard Content Code (ISCC) was officially published in May 2024.[1] The standard defines the syntax, structure, and algorithms for generating ISCC codes and describes their use in conjunction with existing identifier schemes such as DOI, ISAN, ISBN, ISRC, ISSN, and ISWC.[10]

Industry adoption

The ISCC was incorporated into ONIX, the international standard for book trade metadata maintained by EDItEUR, in July 2020.[11] This enables publishers to include content-derived identifiers alongside traditional product identifiers such as ISBN in their metadata feeds.

In May 2024, the ISCC was added to the Coalition for Content Provenance and Authenticity (C2PA) approved soft binding algorithm list as the first fingerprint-based algorithm on the registry.[12] C2PA soft binding algorithms enable recovery of content provenance manifests when metadata has been stripped from media assets.

Structure

An ISCC-CODE is a composite code assembled from multiple ISCC-UNITs, each derived from different aspects of the digital content.[13]

Units

The standard defines five unit types:

  • Meta-Code: A similarity hash generated from referent metadata: name (required), description (optional), and meta (optional). Enables clustering of digital assets and discovery of items with similar metadata.
  • Semantic-Code: Reserved in ISO 24138 but not yet standardized. Experimental implementations use deep learning embeddings to capture semantic meaning of text and images.
  • Content-Code: Modality-specific perceptual fingerprints:
  • Data-Code: A similarity-preserving hash of raw file bytes using content-defined chunking with MinHash.
  • Instance-Code: A BLAKE3 cryptographic hash of the file for exact data integrity verification.

Each ISCC-UNIT can be used standalone or combined into an ISCC-CODE. Units support variable bit-lengths (64 to 256 bits in 32-bit increments) while remaining prefix-compatible, allowing applications to choose between compact codes and higher precision.[1] A minimum valid ISCC-CODE requires the Data-Code and Instance-Code; the other units are optional.[1]

Format

Each ISCC-UNIT consists of a 2-byte header encoding the main type, subtype, version, and bit length, followed by a variable-length body containing the fingerprint data (typically 8 bytes per unit). Multiple units are concatenated and encoded using Base32 for the canonical string representation.[13]

A typical ISCC-CODE appears as:

ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2M5AEGQY

Similarity preservation

Unlike cryptographic hashes where any change produces a completely different output, the ISCC is designed so that similar content produces similar codes. The Hamming distance between two ISCC-UNITs correlates with the degree of similarity between the underlying content, enabling applications such as near-duplicate detection and content clustering.[14]

Comparison with other identifiers

More information Identifier, Scope ...
IdentifierScopeAssignmentSimilarity detection
ISBNBook editionsAssigned by national agenciesNo
DOIScholarly worksAssigned by registration agenciesNo
ISRCSound recordingsAssigned by national agenciesNo
ISWCMusical compositionsAssigned via collecting societiesNo
ISCCDigital media filesGenerated from contentYes
Close

Traditional content identifiers are assigned by registration authorities and identify abstract works or specific editions. The ISCC identifies the digital manifestation itself and can be generated by anyone with access to the file. A single book edition (one ISBN) may correspond to numerous ISCCs representing different digital formats, compression levels, or excerpts.[2]

The ISCC is designed to complement rather than replace existing identifiers. ISO 24138 specifically describes the use of ISCC in conjunction with DOI, ISAN, ISBN, ISRC, ISSN, and ISWC.[1]

Applications

The ISCC supports various use cases in digital content management:[14]

  • Content deduplication: Identification of exact and near-duplicate content for database synchronization and storage optimization
  • Integrity verification: Detection of unauthorized modifications using the Instance-Code
  • Version tracking: Identification of relationships between different versions of content
  • Rights management: Registration of content declarations with timestamped, cryptographically signed assertions[2]
  • AI training data management: Researchers at the Fraunhofer Institute for Secure Information Technology (Fraunhofer SIT) proposed the ISCC as a robust hashing method for EU AI Act compliance, enabling identification of AI-generated content and management of text and data mining opt-out declarations.[15]
  • Media verification: Source discovery and anomaly detection for fact-checking applications

CommonsDB

CommonsDB is a European Commission-funded pilot project using ISCC as the primary identifier for a public registry of rights information for public domain and openly licensed works.[16] Led by Open Future in collaboration with the Europeana Foundation, Wikimedia Sverige, and the Institute for Information Law, the project enables users to verify the rights status of content by generating an ISCC from a media file and searching for matching declarations.[17][18]

Implementation

Reference implementation

Titusz Pan maintains the official reference implementation of ISO 24138 as open-source software on GitHub under the ISCC Foundation organization.[19] The core repositories include:

More information Repository, Description ...
RepositoryDescription
iscc-core[20]Reference implementation of ISO 24138 algorithms
iscc-sdk[21]High-level Python SDK for ISCC generation
iscc-schema[22]JSON Schema definitions and metadata models
iscc-web[23]REST API service powering the public demo
Close

Additional repositories provide extended functionality:

More information Repository, Description ...
RepositoryDescription
iscc-sct[24]Semantic Text-Code generator using deep learning
iscc-sci[25]Semantic Image-Code generator
iscc-ieps[26]ISCC Enhancement Proposals (specifications)
Close

Third-party implementations

Community implementations include iscc-core-ts,[27] a TypeScript port for JavaScript environments.[19]

ISCC Foundation

The ISCC Foundation (Stichting ISCC) is an independent international nonprofit organization based in Hengelo, Netherlands, established as a Dutch Stichting.[28] The foundation serves as the steward of the ISCC ecosystem, with activities including:[29]

  • Research and development of content identification technology
  • Standardization work with bodies including ISO and W3C
  • Maintenance of open-source reference implementations
  • Community engagement and adoption promotion

The foundation's advisory board includes representatives from the National Information Standards Organization (NISO), the publishing industry, library science, and digital rights management sectors.[28]

See also

References

Related Articles

Wikiwand AI