Draft:International Standard Content Code
Novel hash-based identification system for digital objects
From Wikipedia, the free encyclopedia
The International Standard Content Code (ISCC) is a similarity-aware code for digital media assets including text, images, audio, and video. Unlike traditional identifiers such as ISBN or DOI, which are assigned by registration authorities, ISCCs are generated algorithmically from the digital content itself.[1] This allows independent parties to derive identical codes from the same file without central coordination.
Submission declined on 9 February 2026 by ChrysGalley (talk).
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
|
Comment: Sources do not support the text. E.g. source 13, https://iscc.codes/conceptdoes not support base32. The CODE and UNIT composition is also not directly mentioned. ChrysGalley (talk) 19:44, 9 February 2026 (UTC)
Comment: In accordance with the Wikimedia Foundation's Terms of Use, I disclose that I have been paid by my employer for my contributions to this article. Etma 1222 (talk) 09:35, 19 January 2026 (UTC)
An ISCC is a composite code consisting of multiple units that capture different characteristics of content, from metadata and semantic meaning to perceptual features and exact byte sequences. The system is designed to preserve similarity relationships: content that is perceptually similar produces similar codes, enabling near-duplicate detection and content matching across different formats and encodings.[2]
History
Development
The ISCC was invented by Titusz Pan (ORCID 0000-0002-0521-4214[3]), a software developer specializing in content identification and digital media processing.[4][5] Development began in 2016 as part of the Content Blockchain Project, a collaborative initiative funded by the Google Digital News Initiative and undertaken by a consortium of publishing, legal, and technology companies including Zeit Online, Golem.de, and dpa.[6]
The project aimed to explore how blockchain technologies and content identification could support journalism and digital media authenticity. Pan designed the ISCC as an open, decentralized system for content identification where the code is generated directly from the content itself rather than assigned by a central authority.[7] The project was recognized in 2019 when it won one of Germany's inaugural Digital Publishing Awards at the Leipzig Book Fair, with the jury praising it as "one of the few blockchain projects to have an immediate utility for the publishing industry."[8]
Standardization
In May 2019, the International Organization for Standardization (ISO) established Working Group 18 within Technical Committee 46, Subcommittee 9 (TC 46/SC 9, Identification and description) to develop the ISCC as an international standard.[9]
After review by national standards bodies worldwide, ISO 24138:2024 Information and documentation — International Standard Content Code (ISCC) was officially published in May 2024.[1] The standard defines the syntax, structure, and algorithms for generating ISCC codes and describes their use in conjunction with existing identifier schemes such as DOI, ISAN, ISBN, ISRC, ISSN, and ISWC.[10]
Industry adoption
The ISCC was incorporated into ONIX, the international standard for book trade metadata maintained by EDItEUR, in July 2020.[11] This enables publishers to include content-derived identifiers alongside traditional product identifiers such as ISBN in their metadata feeds.
In May 2024, the ISCC was added to the Coalition for Content Provenance and Authenticity (C2PA) approved soft binding algorithm list as the first fingerprint-based algorithm on the registry.[12] C2PA soft binding algorithms enable recovery of content provenance manifests when metadata has been stripped from media assets.
Structure
An ISCC-CODE is a composite code assembled from multiple ISCC-UNITs, each derived from different aspects of the digital content.[13]
Units
The standard defines five unit types:
- Meta-Code: A similarity hash generated from referent metadata: name (required), description (optional), and meta (optional). Enables clustering of digital assets and discovery of items with similar metadata.
- Semantic-Code: Reserved in ISO 24138 but not yet standardized. Experimental implementations use deep learning embeddings to capture semantic meaning of text and images.
- Content-Code: Modality-specific perceptual fingerprints:
- Text-Code: Based on normalized character n-grams
- Image-Code: Based on perceptual hashing
- Audio-Code: Based on chromaprint fingerprinting
- Video-Code: Based on MPEG-7 video frame signatures
- Data-Code: A similarity-preserving hash of raw file bytes using content-defined chunking with MinHash.
- Instance-Code: A BLAKE3 cryptographic hash of the file for exact data integrity verification.
Each ISCC-UNIT can be used standalone or combined into an ISCC-CODE. Units support variable bit-lengths (64 to 256 bits in 32-bit increments) while remaining prefix-compatible, allowing applications to choose between compact codes and higher precision.[1] A minimum valid ISCC-CODE requires the Data-Code and Instance-Code; the other units are optional.[1]
Format
Each ISCC-UNIT consists of a 2-byte header encoding the main type, subtype, version, and bit length, followed by a variable-length body containing the fingerprint data (typically 8 bytes per unit). Multiple units are concatenated and encoded using Base32 for the canonical string representation.[13]
A typical ISCC-CODE appears as:
ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2M5AEGQY
Similarity preservation
Unlike cryptographic hashes where any change produces a completely different output, the ISCC is designed so that similar content produces similar codes. The Hamming distance between two ISCC-UNITs correlates with the degree of similarity between the underlying content, enabling applications such as near-duplicate detection and content clustering.[14]
Comparison with other identifiers
| Identifier | Scope | Assignment | Similarity detection |
|---|---|---|---|
| ISBN | Book editions | Assigned by national agencies | No |
| DOI | Scholarly works | Assigned by registration agencies | No |
| ISRC | Sound recordings | Assigned by national agencies | No |
| ISWC | Musical compositions | Assigned via collecting societies | No |
| ISCC | Digital media files | Generated from content | Yes |
Traditional content identifiers are assigned by registration authorities and identify abstract works or specific editions. The ISCC identifies the digital manifestation itself and can be generated by anyone with access to the file. A single book edition (one ISBN) may correspond to numerous ISCCs representing different digital formats, compression levels, or excerpts.[2]
The ISCC is designed to complement rather than replace existing identifiers. ISO 24138 specifically describes the use of ISCC in conjunction with DOI, ISAN, ISBN, ISRC, ISSN, and ISWC.[1]
Applications
The ISCC supports various use cases in digital content management:[14]
- Content deduplication: Identification of exact and near-duplicate content for database synchronization and storage optimization
- Integrity verification: Detection of unauthorized modifications using the Instance-Code
- Version tracking: Identification of relationships between different versions of content
- Rights management: Registration of content declarations with timestamped, cryptographically signed assertions[2]
- AI training data management: Researchers at the Fraunhofer Institute for Secure Information Technology (Fraunhofer SIT) proposed the ISCC as a robust hashing method for EU AI Act compliance, enabling identification of AI-generated content and management of text and data mining opt-out declarations.[15]
- Media verification: Source discovery and anomaly detection for fact-checking applications
CommonsDB
CommonsDB is a European Commission-funded pilot project using ISCC as the primary identifier for a public registry of rights information for public domain and openly licensed works.[16] Led by Open Future in collaboration with the Europeana Foundation, Wikimedia Sverige, and the Institute for Information Law, the project enables users to verify the rights status of content by generating an ISCC from a media file and searching for matching declarations.[17][18]
Implementation
Reference implementation
Titusz Pan maintains the official reference implementation of ISO 24138 as open-source software on GitHub under the ISCC Foundation organization.[19] The core repositories include:
Additional repositories provide extended functionality:
Third-party implementations
Community implementations include iscc-core-ts,[27] a TypeScript port for JavaScript environments.[19]
ISCC Foundation
The ISCC Foundation (Stichting ISCC) is an independent international nonprofit organization based in Hengelo, Netherlands, established as a Dutch Stichting.[28] The foundation serves as the steward of the ISCC ecosystem, with activities including:[29]
- Research and development of content identification technology
- Standardization work with bodies including ISO and W3C
- Maintenance of open-source reference implementations
- Community engagement and adoption promotion
The foundation's advisory board includes representatives from the National Information Standards Organization (NISO), the publishing industry, library science, and digital rights management sectors.[28]

LLM-generated pages with the below issues may be deleted without notice.
These tools are prone to specific issues that violate our policies:
Instead, only summarize in your own words a range of independent, reliable, published sources that discuss the subject.
See the advice page on large language models for more information.