Introduction
Recently, NVIDIA’s CEO drew attention to the transformative potential of Generative AI. The CEO’s statement highlights the notion that these AI-driven systems can empower individuals to become programmers. This statement, while inspiring, raises a critical question: How close are we to turning this vision into reality? Can we truly harness the capabilities of LLMs to democratize programming and enable a broader audience to harness the power of AI for various tasks? The NVIDIA CEO’s statement underscores the immense possibilities that LLMs offer, but it also prompts us to examine the practical steps needed to bridge the gap between this vision and its realization.
“Human rational behavior is shaped by scissors whose blades are the structure of task environments and the computational capabilities of the actor.”
— Herbert Simon, “Invariants of Human Behaviour” in Annual Review of Psychology
A recent incident involving a New York law firm highlights the potentially dangerous consequences of relying solely on LLMs like ChatGPT for mission-critical tasks. In this case, a lawyer used ChatGPT to conduct legal research, with quite interesting but not relatable results. The lawyer integrated ChatGPT’s responses directly into a legal brief without manually verifying the cited references. Consequently, the brief contained six fabricated court decisions, leading to confusion and the dismissal of the case by the judge. The lawyer’s misguided trust in ChatGPT’s content highlights a broader issue – misconceptions surrounding AI systems like LLMs. While LLMs excel in generating fluent and confident-sounding text, they cannot inherently validate or verify the accuracy of their output. This limitation becomes particularly problematic when used as the sole source of truth in fields like law, healthcare, or finance. While LLMs like ChatGPT offer huge potential, they must be integrated into a comprehensive solution that prioritizes trustworthiness and correctness, particularly in critical research and decision-making scenarios.
As we delve deeper into this analysis, it’s important to recognize that the landscape of AI is constantly evolving. With the constant influx of news about groundbreaking LLM models and cutting-edge tools, it might seem as though traditional Machine Learning (ML) best practices have been cast aside. However, among this surge of groundbreaking ideas, the core principles of ML project lifecycles remain remarkably intact.
Drawing inspiration from the renowned Andrew Ng’s Lifecycle of ML Projects, this unfolds the profound impact of LLMs on the questions asked, the decisions made, and the activities engaged in at each stage of project development.
Step 1 - Scoping: Navigating the New Horizons
This analysis begins with the scoping. At this point, it is important to clarify the reasons behind the project, define what needs to be built, establish success criteria, and make high-level design decisions.
To increase the likelihood of creating a desirable and highly utilized product, it is essential to gain a deep understanding of your target customers’ needs. This understanding can be achieved through various methods, including rapid prototyping, user research, and other product discovery techniques.
Defining acceptance or success criteria is crucial in this process. You can use a combination of optimizing and satisfying metrics to measure the product’s performance. These metrics cover various aspects such as cost, latency (response time), development time (time-to-market), and model quality, ensuring a comprehensive evaluation.
In the context of LLMs, it’s important to recognize the expanded possibilities they offer. Prompt engineering is a valuable tool for quickly exploring new ideas and determining if they’re worth pursuing. This involves choosing between closed-source LLM providers like OpenAI or Google’s Bard and open-source models like Meta’s Llamma2. You also need to decide whether to fine-tune the model or focus on prompt engineering, adapting your approach to your specific use case.
Step 2 - Collect Data and Define Metrics: The Foundation of Quality
In this step, the focus is on gathering data, especially if you plan to fine-tune your model, and evaluating the quality of the model itself. Data quality plays a critical role in determining the overall model quality. Additionally, it’s essential to establish metrics for assessing the quality of the model and set up an evaluation process, forming the basis for making metric-driven improvements in the next step.
Concerning LLMs, they possess the capability to handle new and complex tasks that often lack standard metrics. For instance, if you’re developing an AI customer service agent, you might want to assess qualities like factuality, perceived intelligence, empathy, or a combination of these factors. The challenge lies in determining how to measure each of these behaviors effectively. Interestingly, LLMs can be employed to evaluate such subjective behaviors, a task that traditionally relied on crowdsourced labeling.
Furthermore, LLM applications might encounter limited publicly available data for new tasks. However, the good news is that the dataset sizes required are typically smaller compared to training models from scratch. LLMs can also assist in generating synthetic data with the use of a few-shot prompt approach, addressing the data scarcity issue.
Step 3 - Modeling: Iteration and Experimentation
Modeling is a step where a significant portion of your iterative work takes place, whether you’re fine-tuning the model or designing prompts. The pace at which you conduct each iteration plays an important role in determining your experimentation speed. In the context of prompt engineering, an iteration involves conducting error analysis, potentially modifying the prompt to address any identified errors, executing the evaluation pipeline established in the previous step, and ultimately comparing the metric values before and after the change to measure its impact on model quality. It’s imperative to maintain a high experimentation velocity to systematically test numerous changes within a single day.
Regarding LLMs, there exists a diverse range of base models to choose from based on your specific requirements. Should you choose prompt engineering, you won’t require extensive training data, but a significant portion of your time will be dedicated to experimenting with various prompt engineering techniques.
In cases where strict consent to rules is required, such as when generating a code, this creativity can lead to the creation of non-existent functions or variables, resulting in invalid code. This phenomenon is referred to as “model hallucination,” and there are methods available to detect and mitigate it. Additionally, LLMs necessitate the implementation of behavioral guardrails to prevent undesirable or harmful behaviors from occurring.
Step 4 - Production: The Path to Deployment
Following numerous iterations, you will have developed a high-quality model or prompt that aligns with the acceptance criteria. At this point, the final step involves deploying it in a production environment. To ensure its successful operation, you must implement telemetry and monitoring systems to track the model’s performance and measure customer engagement and satisfaction.
In the context of LLMs, traditional approaches involving large, distributed data pipelines for feature engineering are only sometimes used in this step. Instead, there is a shift toward the adoption of vector search databases that facilitate Retrieval Augmented Generation (RAG), replacing the conventional production components. For proprietary LLMs, they are typically offered a service featuring fully managed GPU pools accessible via an API endpoint. This approach eliminates the need for training or deploying new models, simplifying the process. Most of the heavy lifting in terms of Machine Learning Operations (MLOps) is handled for you. While prominent open-source LLMs are gradually being offered a service, in cases where this is unavailable, you may need to take on the responsibility of model provisioning. It’s worth noting that the efficient operation of most LLMs in production environments often requires Graphics Processing Units (GPUs), which can be in high demand.
Step 5 - Responsible AI Model: Ethics at Every Step
This is not a standalone step but a central theme that runs through every part of the project. As ChatGPT’s rapid expansion gained widespread attention, the societal and economic implications of LLMs took center stage in public discussions. Within each of the previously mentioned steps, the responsible AI-related tasks are included in order to evaluate and prevent potential harm or improper use of the LLM application.
Conclusion
The process of testing LLMs mirrors the complexity and dynamism inherent in these powerful systems. While our outlined approach provides a structured framework, it is essential to recognize that testing LLMs is a multifaceted effort. It often leads to unexpected discoveries and new use cases and even prompts reconsidering the initial tool design.
The question of whether this rigorous testing process is too labor-intensive is a valid one. However, it is crucial to weigh it against the alternatives. Benchmarks can be inadequate for generation tasks with multiple correct answers, and collecting human judgments can be even more demanding, with diminishing utility as the model evolves. The choice not to test can result in a lack of understanding of the model’s behavior, which can be damaging. On the other hand, thorough testing can yield several advantages: it aids in identifying and rectifying bugs, provides valuable insights into the task itself, and uncovers significant issues in the specification early in the development process, allowing for necessary adjustments before it’s too late.
It not only ensures the reliability and effectiveness of the model but also contributes to a deeper understanding of its capabilities and limitations. Ultimately, testing serves as an essential tool in harnessing the potential of LLMs while mitigating risks and pitfalls along the way.