Large language models are deep learning algorithms that have been intensively pre-trained on large quantities of text parameters and continuously optimized to independently recognize, understand, predict, and generate humanly understandable text expressions. Simply put, LLMs are powered by algorithms to read and understand texts as humans would. They can also be utilized for other natural language processing (NLP) operations and fused alongside other technologies in producing other expressions in image, video, and speech formats.
The development of language models has been many years in the making and mostly close-circuited, restricted, and controlled by heavily backed research departments of large corporations and universities, mainly due to the financially tasking and extremely technical processes of training language models. Later on, the publication of the “Attention is all you need” research paper on Transformer architecture and other technological advancements simplified the training process, allowing for more organizations, mainly the Big Tech companies (Google, Microsoft, Open AI, etc.), to come into the picture in developing these models. However, the AI Wars between these big tech companies on secretive LLM development and their capitalistic focus on profit have brought back historic monopoly concerns and the possibility of stifling future innovation in this field. This necessitates the need to democratize the development of LLMs in line with the open-source way, allowing for wide accessibility and unrestricted information sharing for everyone.
“The power of Open Source is the power of the people. The people rule.”
– Philippe Kahn, Co-Founder, Fullpower AI
In a nutshell, open-source LLMs are large language models whose source code is publicly available through non-restrictive and cost-effective licenses for usability, modification, and redistribution. They allow for agile exhaustive peer reviews by a diverse community of technology practitioners, contributing to reduced biases, bug fixes, and continuous technology enhancement. Training, experimentation, and performance monitoring with these models are also open and available to regular enthusiasts and non-experts due to the lowered barriers to entry. The availability of open models is done with a focus on enhanced transparency, collaboration, research for continuous innovation, and sheltered integration with societal activities.
In sharp contrast to open-source LLMs, the internal workings of closed-source LLMs are only accessible, customizable, and maintained by the internal teams of the developing organizations, as well as requiring high licensing fees. However, the usage of both open- and closed-sourced LLMs is bound by the rules and guidelines of their developers.
This article explores the types, use cases, effectiveness, accessibility, best practices, and rationale behind open-source LLM models.
Open Source LLM Terminologies
Commonly mentioned terminologies in open-source language model discourse include:
- AI Democratization: making the development of AI technologies available to all.
- Corpus: all documents from a dataset.
- Fine-Tuning: modifying the architecture of the model by re-training it to perform specific tasks.
- Hallucinations: when a model generates nonsensical or irrational text expressions.
- LangChain: an integration framework for providing open-source LLM APIs and tools for building applications and services.
- Parameters: adjustable numerical values that describe and control the behavior of a model for performance improvement.
- Tokens: single units of meaning in texts, e.g., words, punctuation marks, numbers, and symbols.
- Weights: numerical values that determine the strength of connections between neurons across different layers in the model.
Examples of Open Source LLMs
Since the introduction and wide acceptability of language models into the public discourse and the push for the democratization of LLMs, there has been an ever-expanding trend of open-source LLM releases in varying parameter/token sizes, licenses, and functionalities. These are some of the best open-source LLMs available today.
LLaMA 2 is a family of pre-trained and fine-tuned large language models released in July 2023 as an improvement to LLaMA 1 for a wide variety of use cases. It is designed to enable developers, researchers, and enterprises to generative AI-powered tools and services on top of its infrastructure.
- Developer: Meta AI
- Repository: HuggingFace, GitHub
- Parameters: 7-70 Billion
- Tokens: 2 Trillion
- License: Available free of charge and without attribution for research and commercial uses within the confines of its community license agreement and usage policies.
- Limitations: Training and fine-tuning require heavy computing power.
FalconLLM has been designed for creative text generation and complex problem-solving. Unveiled in March 2023 as an improved variant of the GPT-3 design, its architecture(weights) can be optimized per requirements and used for creating commercial applications.
Dolly 2.0 is an instruction-following language model released in April 2023 as an updated version of Dolly based on the Pythia model family. It can creatively generate text in multifaceted formats such as educational learning resources, poems, emails, etc., as well as code generation, and allows organizations to own, modify and fuse its infrastructure within their tools and offerings.
- Developer: Databricks
- Repository: HuggingFace, GitHub
- Parameters: 12 Billion. 3 and 7 billion-trained smaller models are also available.
- License: Apache 2.0 license for research and commercial use.
- Limitations: Struggles with programming queries and occasional hallucinations
StarCoder is a responsibly-developed code-focused LLM trained exclusively on permissive licensed and ethically-aware data on GitHub. This data includes over 80+ programming languages, commits, issues, and notebooks while ignoring personally identifiable information (names, emails, passwords, etc). It is used for code generation, auto-completion, modification, and explanation. Although trained on code, the model can generate text expressions due to the text it learned from other text-based information fed to it during the training.
- Developer: ServiceNow Research and Hugging Face
- Repository: Hugging Face, GitHub
- Parameters: 15 Billion Parameters
- Tokens: 35 Billion Tokens
- License: Open and Responsible AI License (OpenRAIL)
- Generated code could have security vulnerabilities and is not always guaranteed to work as intended.
- Code generation in other natural languages asides from English is difficult due to its training data.
Released in April 2023 to downstream applications similar to ChatGPT, StableLM has been built to effectively generate text and code. It was trained on the Pile, an open-source dataset.
- Developer: Stability AI
- Repository: GitHub
- Parameters: 3-7 Billion
- Tokens: 1.1 Trillion
- Licenses: Creative Commons CC BY-SA 4.0, Non-Commercial Creative Commons CC BY-NC-SA 4.0 for its base and fine-tuned model checkpoints.
- Limitations: Lacks guardrails for sensitive content
Effectiveness of Open Source LLMs
Continuous announcements and releases of language models by closed ecosystems had indicated a trend and race toward larger and larger models. This was fueled by the assumption that language models with higher parameter sizes would perform better. However, the concentration on more parameters is inadequate because attaining the most optimum performances from language models is determined by a proper combination of multiple elements.
Open-source model developers instead focused on training more tokens with lesser parameters (1-70 Billion), hence doing more with less. This has shifted the conversations and research into language model developments as open models are increasingly proven to achieve similar or better performances than closed models with the usage of more tokens (words) rather than parameters. LLaMA 2 and its predecessor, alongside Stanford’s Alpaca and, most recently, MosaicML’s MPT-30B, have all proven to outperform GPT-3 on several tasks. This is because smaller models with more tokens have a streamlined process of being re-trained and fine-tuned.
It is also cheaper to develop, train, and fine-tune open-sourced language models. APIs which give access to its infrastructure for building tools applications are almost always free, giving better economies-of-scale for research, experimentation, and commercial purposes.
The development and innovation of these models are much faster as multidisciplinary teams across the world can easily collaborate. Fully integrated and privacy-conscious language models are now able to run locally on consumer-grade computers and infrastructure without an internet connection or heavy GPUs through services like GPT4All. Open models are now running on phones, and other devices could also get a feel of it as strategic partnerships occur toward AI democratization.
How to access Open Source LLMs
There are ever-increasing ways of accessing and utilizing open-sourced LLMs:
Hub by Hugging Face
The Hugging Face Hub, also known as the GitHub of machine learning, is an AI hosting platform and the largest open-source LLM aggregator. It houses over 280k models, 99k demo apps (Spaces), and 50k datasets (as of Aug 6, 2023), all open-sourced and publicly available for collaboration in machine learning development. It also allows its community members to host their model checkpoints for simple storage, discovery, and sharing. Hugging Face tracks, evaluates, and ranks open-source language models hosted on its Hub through an Open LLM Leaderboard according to the Language Model Evaluation Harness framework.
Working with Language Models using the Hugging Face Hub
Hugging Face has a simplified process for experimentation with open-sourced models hosted on its Hub. The Transformers library provides access to all models for a wide variety of use cases, text generation, translation, summarization, etc. Courses and materials are available for learning, as well as instructional guide articles like this one on using HuggingFace models.
The Google Colab file used for this illustration can be accessed here.
OpenLLM by BentoML
Recently announced OpenLLM is an open-source platform for running open-sourced large language models in production. It operates under the Apache 2.0 license, providing easy-to-use APIs, and is designed to streamline the deployment and operation of open-sourced language models to the cloud or on-premises and build applications on top of them. The OpenLLM GitHub repository highlights the models it currently supports, as well as best practices for usage and deployment.
Working with Language Models using OpenLLM
The OpenLLM architecture can be utilized from local computers for different commands, preferably through virtual environments. These commands enable the supported state-of-the-art open models to be started as REST or gRPC servers and queried through web and command line interfaces or clients. Customizable AI applications can also be built on them, utilizing services from LangChain, BentoML, and HuggingFace, as well as other use cases.
Petals by BigScience
Petals is an open-source, decentralized system for running and fine-tuning language models. In petals, language models are broken down into layers and stored on millions of end-user computers (servers) across the world. It is likened to a torrenting operation for AI and aims to reduce the costs of generating text-based AI expressions. Its research and development demonstrate the huge possibilities of collaborative volunteer computing in utilizing open-sourced large language models. It achieves faster speeds for both inference and fine-tuning than conventional hardware and can be used in either client or server mode for training and fine-tuning or hardware provisioning. Tutorials, guides, and additional information on Petals can be found on its GitHub repository.
Working with Language Models using Petals
The Google Colab file used for this illustration can be accessed here.
Implications of Open Source LLMs
The breakthrough and development in many AI technologies have depended on transparent and open experimentation and testing by researchers and enthusiasts from all spheres of life, and language models should be no different.
“Most of the progress in the past five years in AI came from open science and open source”
– Clement Delangue, CEO/Co-Founder, HuggingFace
The big tech companies have taken notice and now realize that, although they currently have some of the best models in terms of performance and quality, they cannot and will not dominate the large language model space as they have done with other innovative technologies through enclosed ecosystems.
In a leaked internal document, an expert at Google already admitted to this. We have no Moat, and Neither does OpenAI; these are pretty damning words but an honest assessment of the realities of the industry at the moment. However, it is worth mentioning that a major catalyst in the push for open-sourced LLMs was the release of LLaMA and its successor by Meta AI, possibly to dilute the aggressive advantage of its competitors.
Many large companies are now concentrating their efforts on releasing non-restrictive models for greater collaboration and refined use cases. Chinese Tech Giant, Alibaba, has just announced the release of their Qwen-7B and Qwen-7B-Chat models in a bid to attain the title of the largest open-source LLM. This is significant as it marks the first time a major Chinese tech has open-sourced its language model.
Drawbacks of Open-Source LLMs
There are daunting issues with open-sourced language models, mainly the lack of clear guardrails provided by the big tech companies and their closed-sourced language models. Generated expressions involving racism, misogyny, antisemitism, sectarian strife, and other issues can easily be prevented or controlled by these companies, as they can be held to account and want to protect their reputations and maintain investor confidence.
Security vulnerabilities, a lack of community support, and poor documentation for newly released and untested models could pose a challenge for people looking to utilize them for small projects and large applications.
While open-sourced language models have reduced the barriers to entry, allowing enthusiasts and non-experts to experiment with these models, working with them still requires a minimal understanding of some programming and technology concepts. This is opposed to the ready-to-go use cases for PaLM 2 and GPT-4 powered web interfaces Bard and ChatGPT, which only require an email or a Google account.
The AI big tech war is a good thing, as they have induced global conversations around AI technologies, especially on LLMs. However, open-sourced LLMs have initiated a resurgence in the push for global interconnectedness and interdependence through the decentralization and democratization of technology – the historical promise of the internet.
Active contributions toward developing these open models will help reduce their risks and challenges while enabling AI-powered products and services to be more accessible, coherent, equitable, and beneficial for the world at large. It will push the boundaries of human ingenuity and collaboration with technology across multiple industries to newer heights.