In RAG systems, large language models (LLMs) – including OpenAI's GPT model series, Mistral AI's Mistral series, and Meta's Llama – can be expanded with additional external knowledge sources (e.g., the AI Act) without the need for time-consuming retraining of the LLM on this additional data.
The acronym RAG stands for the individual steps that the RAG system goes through to generate a response to a user query:
In connection with the use of RAG systems (e.g., for knowledge management in companies), the question repeatedly arises as to where data flows occur and how data privacy is handled. The current implementation of the RTR AI Act chatbot shows that it is possible that all data remain local throughout every processing step: external knowledge sources, user input, and the LLM can all be operated on locally hosted hardware.
For a RAG system to answer a query correctly, in a first step, the data relevant to the answer has to be collected. There are various ways of achieving this, one of which is "neural retrieval." This method assumes that the documents most relevant to answering a question are those that are semantically "most similar" to it. Technically, a so-called "embedding model" is used for this purpose. To process a question, it has to be converted into a so-called query vector using the embedding model.
Vectors are lists of numbers, or coordinates, in a multidimensional space, and these numbers or coordinates represent certain properties or characteristics of the text (see below for a detailed explanation). The same procedure is applied to external knowledge sources (in this case, the text of the AI Act and material from the AI Service Desk at RTR), whereby the same embedding model must be used in both cases. However, before texts from external knowledge sources can be vectorized, they must first be divided into sections, known as chunks. This can be done automatically (in continuous text, for example, by overlapping parts) or manually. In the case of the AI Act RAG, a separate tool was used: the annotation Tool. Each chunk should represent a logically coherent text unit (a "train of thought"). For the AI Act, this typically corresponds to articles, recitals, and annexes, while for the AI Service Desk material, it is done thematically (e.g., according to FAQs). For optimal retrieval performance, chunks should contain between 500 and 1000 characters (including spaces). However, since some articles and annexes of the AI Act comprise around 10,000 characters or more (e.g., Articles 3 and 5 or Annex III), further subdivision is necessary in such cases. This additional subdivision is carried out according to thematic aspects.
In the RAG system, the query vector is compared with the vectors of the external knowledge sources (see figure above, RAG step 1 - Retrieve). The resulting, so-called context is collected – that is, the documents from the external knowledge sources that have the highest semantic similarity to the current user query. These passages include articles, recitals, and annexes of the AI Act, as well as material from the AI Service Desk, with cross-references also displayed in each case. These cross-references were defined in the annotation tool for each chunk. Especially in legal texts, definitions or recitals are often necessary for precise interpretation of certain terms, which may not be identified by semantic similarity alone, as initial tests have shown.
In the RAG system used for the AI Act chatbot, the context can comprise a maximum of 22,000 tokens, which corresponds to approximately 88,000 characters (including spaces). This limit is determined by the LLM model used, as well as by the GPU of the hardware employed. The context is marked in black under "Relevant Sources" in the chatbot's output. Below the documents that are within this limit, other passages from external knowledge sources are listed in gray. These passages are less similar to the current query and are not passed to the LLM because the maximum token limit has already been reached. The prompt, query, and context are then combined (RAG step 2 – Augment) and passed to the LLM, which generates a response (RAG step 3 – Generate).
Both the embedding model (Arctic Embed 2 von Snowflake) and the LLM (Mistral Small von Mistral) are open source and hosted locally on an RTR server, using the Ollama framework. The current implementation of the RTR AI Act chatbot thus demonstrates that all data can remain local. Data flows from the query to the embedding model and then to the LLM, before returning from the LLM as a generated response (see figure above). While fully local hosting is possible, it is not required: external services can be used for retrieval, and LLMs can also be accessed via API calls.
The individual RAG steps are explained in detail below.
To retrieve relevant sources, the respective query is (semantically) compared with the external knowledge sources (in this case, the AI Act and material from the AI Service Desk). The goal is to identify articles, recitals, and annexes of the AI Act, as well as material from the AI Service Desk, that contain information relevant to answering the respective query. Preparation for this process already takes place when the external knowledge sources are integrated into the RAG system while data is pre-processed and indexed (see figure below, step 1). As described in the previous section, the external texts must first be divided into chunks before they are converted into mathematical vectors through vector embedding. The resulting vectors are lists of numbers or coordinates in a multidimensional space, and these numbers or coordinates represent certain properties or characteristics of the text. The AI Act chatbot uses Snowflake's Arctic Embed 2 embedding model, whose vectors have 1024 dimensions (i.e., features or elements). Each of the 1024 dimensions of the vector is represented by a 32-bit floating point number within the range 1 to -1.
When representing a term as a vector, semantic similarity plays a central role – that is, the similarity of the meaning of terms in their respective context. For example, synonyms such as “car” and “vehicle” are semantically similar because they describe the same underlying object using different words. In embedding, terms are represented so that their semantic relationships and contextual information are captured. The relative positions of the vectors thus reflect the semantic similarities and relationships between terms. This means that homonyms - terms that share the same spelling but have different meanings depending on the context that they are used in, such as "bank" (financial institution, something to sit on) - can also be distinguished based on the surrounding context ("bank employee", "sitting on the bank").
A query to the AI Act chatbot is likewise converted into a vector (see figure below, step 2). It is essential that the same embedding model is used for vectorizing both the external knowledge source and the query. The resulting query vector is then matched against the vectors of the external knowledge base (see figure below, step 3). Vectors from the external knowledge source that have the highest similarity to the current query are selected. The semantic similarity between terms can be calculated using the so-called cosine similarity of their vectors (see example below). When a query is submitted to the AI Act chatbot, the context – that is, the relevant sources - is listed in order of semantic similarity in the user interface, with the corresponding percentage value representing the cosine similarity.
To improve the quality of the AI Act chatbot's response, the respective cross-references from the database are also loaded and transferred in the context of each chunk - for example, the relevant recitals for an article (see figure below, step 4).
The table below provides a fictional and strongly simplified example of vector embedding for several terms (rows: cats, cars, rose, baby). The embedding is performed (also in simplified form) using features (columns: animal, human, plural, plant), which are intended to represent different aspects of the terms. Each column represents one dimension of the concept vector, and the numerical value indicates the characteristic strength of that feature. The column marked [...] refers to additional dimensions that have been omitted in this simplified example. For comparison: the embedding model Arctic Embed 2 from Snowflake used for the AI Act chatbot creates vectors with 1024 dimensions, which would correspond to a table with 1024 columns.
The resulting vector for cats is [.91, .19, …, .94, -.51], which is a list of numbers that also correspond to coordinates in a multidimensional space.
| Animal | Human | ... | Plural | Plant | |
|---|---|---|---|---|---|
| Cats | .91 | .19 | ... | .94 | -.51 |
| Cars | -.56 | .31 | ... | .94 | -.5 |
| Rose | -.67 | .29 | ... | -.51 | .89 |
| Baby | .01 | .87 | ... | -.11 | -.51 |
Highly simplified example of vector embedding © RTR (CC BY 4.0)
Vector embedding also takes into account the context in which terms occur. In the case of homonymous terms, such as "bank," the respective meaning becomes apparent only in context. As the figure below illustrates, the vectors for the phrases "bank employee" and "dealing with a bank" are close to each other in the vector space. The vector for the term "bank" in the phrase "sitting on the bank," on the other hand, points in a different direction, but is close to the vector for the term "chair" in the phrase "sitting on a chair." Again, this is a highly simplified example.
The angle θ between the two concept vectors (see figure below) determines the similarity between two terms. This angle is typically denoted by the lowercase Greek letter theta (θ). The smaller the angle θ between the concept vectors, the greater the semantic similarity of the terms.
The cosine of the angle θ between the two vectors a and b represents the so-called cosine similarity (abbreviated as sim, from similarity) of the two vectors. Mathematically, cosine similarity is calculated using the following formula:
The inner product (dot product) of the vectors a and b is divided by the product of the magnitudes, i.e., the lengths, of the vectors
.
The cosine similarity yields values between 1 and -1 with the following meaning:
In a system, the semantic similarity of the three terms “cow”, “sheep”, and “rose” is to be determined using cosine similarity. To demonstrate the calculation of cosine similarity, vectors with only two dimensions are used here.
The concept vectors:
The concept vectors can be represented graphically in a coordinate system (see figure below). At first glance, it is obvious that the vectors for the terms "cow" and "sheep" are almost identical. The vector for the term "rose", on the other hand, points in a significantly different direction than the other two vectors.
How semantically similar are the terms cow and sheep?
Calculation of the cosine similarity of cow (vector a) and sheep (vector b):
The cosine similarity of the terms cow and sheep is approximately 0.971, i.e., the terms cat and dog are semantically similar by approximately 97%. The angle θ between the vectors is only 13.83 degrees, which corresponds to a high semantic similarity.
For comparison: the percentages given for the relevant sources in the AI Act chatbot correspond to the cosine similarity relative to the query vector, where the vectors have 1024 dimensions.
How semantically similar are the terms rose and cow?
The cosine similarity between the terms "rose" and "cow" is approximately 0.6309, meaning that the terms "rose" and "sheep" are approximately 63% semantically similar. The angle θ between the two vectors is 50.88 degrees, which corresponds to a significantly lower similarity.
The prompt plays an important role in a RAG chatbot because it instructs the underlying LLM to generate a specific type of response. To create the prompt, the original user query is augmented with context – specifically, the previously identified relevant documents from external knowledge sources.
Typically, RAG prompts consist of components that remain the same for each query (such as role assignment, task definition, and, if necessary, additional instructions) and variable components that differ depending on the query (user query and context).
The prompt of the AI Act Chatbot AI Service Desk is composed as follows:
The prompt is passed to the LLM for the creation of the response ("Generate"). This is where the advantages of a RAG system over a pure LLM system become apparent. LLMs can be augmented with additional external knowledge sources (e.g., a database) without the LLM having to be retrained with this additional data. The answers provided by a RAG system are more precise and relevant because current and context-specific information can be included. In addition, external knowledge sources can be used to limit so-called “hallucinations”, which occur when relevant information is not available in the knowledge base of the LLM system. Ultimately, the required information can be accessed quickly without having to conduct time-consuming research in extensive documents. Internal knowledge bases can thus be used more efficiently.
Another important aspect to consider when looking at RAG systems is their limitations. A RAG system only works well if the relevant context can actually be found in the external knowledge sources. This requires well-organized and searchable data sources. If this is not the case, the accuracy of the answers may suffer. In addition, the structure of the answers is highly dependent on the quality of the prompt and the capabilities of the LLM. If these are not clear or well defined, the answers may be incomplete or less coherent.
Alternatively, or in addition to a RAG, comprehensive fine-tuning of the LLM may be more appropriate in certain cases, especially when it comes to considering a very large amount of specific data or adhering to certain output styles. Well-tuned fine-tuning can significantly increase the performance of an LLM for a specific task by better adapting the model parameters to the specific requirements of the application.
For quality assurance purposes, it is necessary to measure the reliability of the RAG objectively and comprehensibly. In the current implementation of the RTR AI Act chatbot, this quality assurance is performed automatically using a predefined pool of questions, each consisting of a question about the AI Act, the relevant sources, and the answer. Only half of the questions were known to the development partner (in German; these are listed here ). For the evaluation, the sources found in the retrieve step were compared with the relevant sources from the predefined question pool. In addition, the responses of the RTR AI Act chatbot were compared with the responses in the question pool using Chat GPT-4o and classified into four categories (no difference in content, excessive response with/without contradiction, contradictory response). The style of the response was also evaluated in terms of professionalism. The following prompt was used for Chat GPT-4o (originally in German, translated in this article):
"Below is a question about the AI Act and two answers to that question. Your task is to assess the equivalence of the answers from a legal perspective.
Question:
{question}
Answer A:
{existing_answer}
Answer B:
{new_answer}
Assessment task: Above, a legal question about the AI Act was answered twice as "Answer A" and "Answer B."
Compare the two texts and determine how much Answer B differs from Answer A.
Consider cases where Answer B covers all aspects of Answer A but goes beyond it.
This is a legal textbook on AI law in the EU. The questions are printed there. Legal differences are relevant. It is relevant whether other legal conclusions and subsumptions are drawn. Pure changes in wording are not relevant.
How do the two answers differ? Please adhere strictly to the following structure and evaluate in:
1: No difference: Both answers contain only the same information. Both answers come to the same legal conclusions.
2: Excessive answer, no contradiction: Both answers contain the same legal conclusions. One answer deals with more legal aspects than the other answer.
3: Excessive response, contradiction: Both responses contain the same legal conclusions. One response addresses even more aspects, but these potentially trigger a contradiction. This is particularly the case when different conclusions are reached regarding the categorization of an AI system.
4: Contradictory response: Both responses contain legal conclusions that contradict each other.
Additionally, evaluate the style of response B:
Good: Professional response, it provides legal clarity.
Poor: Unprofessional response; it is not professional in a legal context and for a textbook on the AI Act. For example, the reader is addressed informally, or reference is made to further discussions and assistance. Or reference is made to the GDPR or other laws that are not the AI Act.
Start your answer with the school grade, directly with the number. Then explain in two sentences how the conclusion is composed. Next, evaluate the style of the answer with "good" or "bad" and explain your decision in one sentence. Start with "good" or "bad."
Please summarize the comparison between two questions under the heading "Summary:".
Please remove all ** before and after the words. For example, **Question:** should be changed to Question:.
Please adhere strictly to the following structure and ALWAYS include all sections, even if they are empty:
Question:
Existing answer:
New answer:
Summary:"
Another element of quality assurance is an integrated feedback option on the website.
For each request, the energy consumption of the processor (CPU), graphics card (GPU), and RAM is specified, with power consumption measured directly at the hardware level and an average value calculated for the usage period of the request(see script here):
The Power Usage Effectiveness (PUE) value indicates the ratio of total energy consumption to the energy consumption of the hardware itself. According to the data center operator (Hetzner, see here), the PUE value is between 1.1 and 1.16, i.e., between 10 and 16% of the measured energy consumption is used for cooling the hardware, etc.
The comparative examples for energy consumption are calculated as follows:
a. Hair dryer with 2200 watts of power:
(Total energy consumption (watt-hours) / 2200 watts) * 3600 = operating time (seconds)
b. Cell phone battery with 5000 milliampere-hours capacity, 5 volts voltage, charging efficiency is 80 percent:
(Total energy consumption (watt-hours) / 5 volts) * 1000
=> conversion to milliampere hours
(milliampere-hours / 5000 milliampere-hours) * 0.8 * 100 = mobile phone battery charge (percent)
All components of the AI Act chatbot run on dedicated hardware. Specifically, a server with an NVIDIA RTX 4000 ADA graphics card, an Intel Core™ i5-13500 processor, and 64 GB of RAM is deployed. The system operates on Ubuntu 24.04 LTS as the operating system. The total disk space consumption of the server - including data, the operating system, the language model, and the embedding model - remains below 100 GB.