Techniques and Challenges of In-Context Learning
In one of our last articles, we gave an introduction to different techniques for making LLMs usable in business scenarios. Most of these scenarios require the models to be aware of business data and documents to be queried. A promising approach for achieving this awareness is the concept of in-context learning where the LLM is provided with the relevant data on the fly while querying. As the model is not fine-tuned, one can save the computational costs of adjusting model weights and be flexible with regards to the context of the inquiry at the same time. In this article, we are going to exemplarily explore basic in-context learning approaches and their caveats with externally provided and self-hosted LLMs.
The Charm and Curse of Document Querying
In-context learning means enclosing the data you would like to query into the prompt of the LLM: for example entire contracts for finding a particular detail, or long sales tables for finding the weeks with highest revenue, or candidate CVs for finding the best match with the newest open position - and all that with just one precise definition of the question in plain English. This approach may appear intriguingly simple: feed all documents you consider relevant to the LLM via the prompt, add your question, and wait for the LLM to answer it. But even if you get the desired result in the end, you will probably be disappointed by the amount of time you have waited for your answer or be discouraged from following this approach by the bill from the model provider. As you are feeding your documents to the LLM, you are increasing the number of tokens it needs to process and thus the time and costly computational resources required to fulfill your request. In short: we need a better strategy than passing all documents to the model without upfront selection.
Selecting Relevant Documents
In the following we will explore these strategies with an example from HR using the LlamaIndex library: We want to ask the LLM who it deems the best fit for an open position among a set of potential candidates. We advise it to ingest a description of the job alongside CVs of the candidates. The Llama Index library provides us with some very useful toolset for selecting and passing the appropriate documents to the LLM. Collections of documents are broken down into nodes which, in turn, are then organized in indices. These indices can be ordered lists or trees of documents or an unordered set of documents in a vector space, called vector indices. Choosing the right index for each type of document is crucial for a reliable retrieval of the desired documents and the performance of the query.
Vector Indices
Vector indices are particularly interesting for scenarios with a plethora of documents that would be too much to be fed into the LLM at once. They are built by assigning each document in your collection a vector representation, a so-called embedding. These embeddings encode semantic similarity of the documents and are created using another embedding LLM (usually a less-proficient model than the one prompted to save costs or resources). Querying the index by semantic proximity now means retrieving the documents whose embeddings are closest to your query’s embedding.

The business documents are represented as embeddings in a document index. When the documents are to be queried, the index is looked up for the suitable documents. These are then passed to the LLM alongside the query.
A Naïve Approach to Indexing
Going back to our HR use case with the LlamaIndex library, a first try might be feeding all documents, meaning job offers and candidate profiles, into a single vector index and have the library choose the right one’s for us.We first load the documents separately using LlamaIndex’s SimpleDirectoryReader s:
Then we create a vector index from it.
For querying the LLM, we simply create a query engine from the index object:
While this might theoretically work, in our example it turned out that rerunning this yielded random results:
Upon closer inspection, it turns out that only two documents are passed to the LLM in total, rendering the LLM unable to make a statement about all potential candidates.
Custom Retrieval Mechanisms
We hence need to make LlamaIndex choose documents according to our needs in this particular case. For the aforementioned query this means passing all candidate profiles to the LLM, as we want a statement concerning all candidates, and the single job offer in question. To this end, we implemented a custom document retriever depending on two document indices: a list index of all candidate profiles and a vector index of all potential job offers.
In the custom retriever class, we implemented the method _retrieve , such that for each query all candidate profiles are returned alongside the job offer which is semantically closest to the query. It uses the method _get_embeddings to obtain embeddings of the query and the job offer documents to semantically compare the job offers with the query.
Alternatively, one could have chosen a simple keyword matching approach, which might also suffice for finding the one job offer in question. A query can now be dispatched with the custom document retriever in the following way:
Now, all candidate profiles and the job offer in questions are consistently passed to the LLM and it states consistent scores for each candidate. Also, more detailed inquiries about specific capabilities of candidates are now possible.
Potentials of Self-Hosted Models
When using LlamaIndex out of the box, as we did above, the library uses OpenAI’s LLMs for generating the embeddings and answering the inquiries. In business scenarios with proprietary data this is often undesirable as the data is subject to data protection or non-disclosure policies. As ever more powerful open-source and business-friendlily licensed models, such as gpt4all or MPT 7B, are appearing, data-aware LLM querying in a controlled and secure environment, such as in a private cloud network or on premise, is now just a stone’s throw away.
As a proof of concept, we ported our HR use case from the default LlamaIndex setup with OpenAI into a GPU-equipped cloud notebook where we loaded the MPT 7B model. With this setup we were also able to query our documents. Because of the steep scaling of the memory requirement with the size of the input, the cloud environment quickly reached its limit when passing more than two documents to the LLM. Other than that, for doing inference with the MPT 7B model, using a top-notch GPU is necessary for getting immediate responses from the LLM.
While this shows that running data-aware LLM applications on your own hardware or in your own cloud network is possible, it outlines that a well-considered approach for selecting and feeding documents in the LLM is necessary because compute resources are scarce in the order of magnitude required by today’s LLMs.
Conclusion
In this article we explored how LLMs can be made business-ready using in-context learning. In this technique documents are fed to the LLM via its prompt in a case-based manner: depending on the use case and the semantics of the query, input data are selected from a document index. It is important to carefully weigh up the amount of data needed to answer the query against the amount of compute resources necessary to perform the LLM inference with that data.
In an HR use case we showed how to use a custom LlamaIndex retriever to implement such a case-based document selection logic. We further examined self-hosted models and found that they are an ever more available alternative to foreign-hosted LLMs which would require sending sensitive business data to a third party.
The recently accelerated evolution of LLMs and the emergence of document integration libraries suggest that enterprises are soon going to get the chance of integrating LLM-based data querying into a diverse set of business processes.