BLOOM (language model)
Multilingual open-access large language model
From Wikipedia, the free encyclopedia
The BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is an open-access large language model (LLM) released in 2022.[1] It was created by a volunteer-driven research effort to provide a transparently-created alternative to proprietary AI models.[2]
| BLOOM | |
|---|---|
| Original author | BigScience research workshop |
| Initial release | July 12, 2022 |
| Written in | Python |
| License | BigScience Responsible AI License (RAIL) v1.0 |
| Website | bigscience |
| Repository | huggingface |
With 176 billion parameters, BLOOM is a transformer-based autoregressive model designed to generate text in 46 natural languages and 13 programming languages. The model is distributed under the project's "Responsible AI License".[3]
Development
BLOOM is the main outcome of the BigScience initiative, a one-year-long research workshop.[2] The project was coordinated by Hugging Face using funding from the French government and involved several hundred volunteer researchers and engineers from academia and the private sector.[2] The model was trained between March and July 2022 on the Jean Zay public supercomputer in France, managed by GENCI and IDRIS (CNRS).[4][5] Unlike GPT-3, BLOOM was trained to be multilingual.[3][6]
The source code is released under the Apache 2.0 license. The model's parameters are released under BigScience's "Responsible AI License" (RAIL), which grants open access and reuse rights but with some usage restrictions.[7][8]
BLOOM was used in the chatbots BLOOMChat and HuggingChat due to its multilingual abilities.[6]
BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. In total, the model was trained on approximately 366 billion (1.6TB) tokens.[9][10] It was developed using the open-source libraries DeepSpeed Megatron.[3]
BigScience then released xP3, a multilingual dataset for LLM supervised learning. It also released BLOOMZ, a variant of BLOOM fine-tuned on xP3 to follow instructions.[11]