How to use Large Language Models (like ChatGPT) with internal company data?
As large language models (LLMs) become more advanced, they offer increasingly valuable capabilities for analysing and processing language in the business world. Integrating LLMs with a company's data infrastructure can enable a more sophisticated analysis of customer feedback, financial documents, or other critical information. Potential applications, such as informed chatbots or minute summarisation, are endless.
However, this can be complex and challenging, requiring careful planning and execution. While the benefits of integrating LLMs with corporate data are clear, how to build a system capable of it may not be obvious. So, where to start? The typical scenario involves the process of fine-tuning. More recent approaches, tailored to LLMs’ rising popularity, involve attaching some context to the queries. Let’s tackle them step-by-step.
The “old” way: fine-tuning
Fine-tuning refers to adapting a pre-trained language model to a specific task or domain. This involves re-training the model on a dataset related to the specific task or domain to improve its performance. So, in a typical scenario, one might want to take an off-the-shelf model trained on some large general dataset and then feed them with carefully prepared proprietary data.
Fine-tuning is not always easy, however. It requires careful data engineering, and the data needs to be in an appropriate amount (which usually means a lot of data). Sometimes models “forget” the previously learned things in favour of the newly acquired knowledge. In machine learning literature, this is called catastrophic forgetting. The downside of this approach is also the fact that it needs full deep learning infrastructure capable of training LLMs. As the training takes time, this approach might also be too slow to utilise for use cases with fast-changing data - any data additions mean re-training.
The “new” way: in-context leaning
The advent of LLMs brought the popularity of a new paradigm: in-context learning. For a general audience, the ability to depend on previous information (context) was one of the most impressive features of ChatGPT. In-context learning is the ability to learn how to solve new tasks without changing the model. In other words, our network is able to solve tasks not seen at the training time without changing their weights (as in fine-tuning). Moreover, models are able to learn without being specifically trained to learn. On top of that, only a few examples are usually enough, which starkly contrasts with a typical perception of machine learning models as data-hungry. The roots of this phenomenon may rely on Bayesian inference, and they are a subject of an intense research effort right now.
In-context learning through prompt engineering
The simplest way to use in-context learning is prompt engineering, where the prompt is simply a piece of text sent to the model. This refers to designing and refining prompts or instructions given to a language model like ChatGPT to get the desired responses. It involves crafting specific and clear instructions with examples that guide the model's behaviour and encourages it to generate accurate and relevant output. Prompt engineering is crucial because language models like ChatGPT generate responses based on the information provided, including the initial prompt and any subsequent context.
Quite often prompt engineering is an iterative process that often requires experimentation and refinement. It may involve using specific keywords or providing relevant examples. By adjusting and tweaking the prompts, users can improve the model's performance, enhance the quality of responses, and mitigate potential biases or errors that may arise. It's important to note that prompt engineering is specific to the design and configuration of the language model and does not involve modifying the model's underlying architecture or training data.
Is it that simple, then? Well, not really. Due to the limited size of the query window (or, more formally, the token limit), we can attach only a limited piece of information to the context. As the company data has a tendency only to grow, this approach might quickly become insufficient - not to mention the costs.
Smarter in-context learning with indexing and vector stores
Fortunately, there are smarter ways to attach the context to the queries. If you have a huge amount of your own data, attaching everything to the prompt is not an option. However, some techniques allow you to provide the right context. One of them considers building a set of indices on top of your data. These indices help to identify relevant “chunks” of your data and attach only the needed ones to the query.
In essence, before sending the prompt to the model, we first search for the relevant chunks of our data and then attach them to the prompt. For example, with LlamaIndex, we can store different documents (e.g. PDF files or SQL data) as indexed nodes, which are made in such a structure that they can be queried later. Naturally, one has to take into account indexing speed and costs. Another approach would consider vector databases, such as Pinecone. Vector database stores embeddings, which are a form of numerical representation of data. Such embeddings can be later queried for approximate similarity with our query, and this feature can be used to build the proper context for the prompt.
Other factors to consider
There are notable differences between on-premise deployment and utilising external models (such as ChatGPT through an API) when it comes to serving a language model. On-premise deployment involves hosting within an organisation’s own servers or data centres, providing complete control over the infrastructure. This approach allows to manage data security and privacy according to their specific requirements. It may be preferable for sensitive or confidential data that needs to remain within the organisation’s network. However, on-premise deployment typically requires significant upfront investment in hardware (such as GPUs), infrastructure setup, and ongoing maintenance. It also lacks the inherent scalability and flexibility of external solutions. Therefore, the final decision should always be based on meticulous planning and calculations. Cloud solutions might be used to meet halfway.
Conclusions
At yellowShift, we specialise in assisting organisations with the seamless integration of their data with language models. Our expertise lies in bridging the gap between your valuable data sources and the powerful capabilities of language models like ChatGPT. By understanding your specific requirements and objectives, we can guide you through preparing and structuring your data for optimal utilisation. Our team of data engineers and experts can help you clean, preprocess, and transform your data into suitable formats for integration with language models to make AI work for you.