A new acronym is gaining ground in the language industry: Retrieval Augmented Generation, also known as RAG, a technique that uses prompting to instruct large language models (LLMs) with knowledge, data, and context.

In other words, RAG is a sophisticated way of finding more relevant content in response to queries.

If a client or language services provider (LSP) has a TMS, that system typically contains an extensive dataset, which could include a translation memory (TM), termbase, translated documents, and a range of metadata. The data is often quite clean, structured, and validated by human experts.

But those clients and LSPs are rarely satisfied with the results they see from off-the-shelf LLMs, which lack internal knowledge, such as the preferred terminology and tone of voice.

Speaking at SlatorCon Remote June 2024, Roeland Hofkens, Chief Product and Technology Officer at LanguageWire, explained that for these companies, building a new, bespoke LLM is very difficult and expensive; full fine-tuning or continued training of a foundation model is still challenging and costly, and hard to maintain with changing data.

So LanguageWire set out to leverage its customers’ linguistic assets to build an automated RAG pipeline, which would support a better user experience.

The standard RAG process begins by analyzing the user’s intent with a look at the original prompt. The system then searches for data (including documents, databases, and other formats) that could be related to that prompt. It identifies matching pieces, retrieves the text, and uses that text to create a much richer prompt.

This augmented prompt stitches all the data together to include more information about the business. When fed to the LLM, it should return a much more relevant response.

Evolution of an Editing Tool

LanguageWire’s real-world application is a content authoring assistant with a user-friendly UI and an emphasis on automation. The pipeline has already gone through a few iterations.

The first method, Hofkens explained, used TMs and term bases as data sources. These were stored in a vector database and accessed via a semantic search that could detect pieces of text with meaning similar to the prompt. The idea was to retrieve relevant segments and terms from among the results and use a template to build a new, improved prompt with those results. 

“When we found segments and things that did have relevance, they did not provide enough context in the prompt because segments tend to be rather short. It’s a sentence or a part of a sentence, sometimes single words and things like that,” Hofkens said. “It’s not enough to really instruct an LLM to really get good results. So that was definitely disappointing.”

A new architecture replaces TMs with “content memories” — documents customers had sent to an LSP. These could include Word, PowerPoint, XML, and other file types. The appeal of content memories is that they contain more words for a semantic search, for instance, a paragraph or even higher-level content versus simple segments. 

Putting these “chunks” of context into the database, and then the vector database, provided bigger pieces of information for the semantic search, which in turn produced higher quality completions. 

The third level of architecture was adopted once LanguageWire noticed that larger chunks (i.e., those with more text) helped the search identify significantly more semantic similarities.

The vector database performed semantic searches and ranked the results by similarity. The system then identified the “top-level” chunks and sent them to the LLM for summarization, compressing them into a smaller text set that still contained all the relevant keywords and fit into the final prompt. This improved prompt would then be plugged in to generate the final output. 

Quality Jump

This multistep RAG approach, which uses the LLM to improve on RAG itself, “gave us another quality jump because we were simply able to automatically generate better prompts that were a better fit for the LLM,” Hofkens added.

The product is now being used in Beta, and LanguageWire has plans for a full and public launch in September 2024. 

Even by then, however, there could be major changes. “[RAG] is a very, very active field of research,” Hofkens noted. Research currently focuses on simple, few-shot approaches to craft prompts based on similar translations from TMs.

Right now, super-fast and MT-focused models work better than LLMs for translation, but Hofkens expects the speed and cost disadvantages limiting LLMs to disappear soon.

Incredible Hardware Coming

“There is some incredible hardware coming these days that will speed it up and then bring down the cost gigantically,” Hofkens said. “So, you know, we should experiment with techniques that might lead to better machine translation.”

Context windows, for example, were originally quite small, but have expanded, with Gemini now at 1m tokens — enough “space” for 1,500 pages of text in one prompt. But RAG still offers advantages, specifically with regard to operational cost, recall and precision.

LLMs are being used to evaluate the information coming back from RAG, to assess whether it can be used as is or requires another round of RAG. 

Beyond the technology, Hofkens reminded attendees, that relationships with clients are critical to the success of a given pipeline or system. 

“We do not hold all the information of our customers. We have what they send to us for translation, and that’s what’s in the TMS in the end. So it’s a very nice multilingual data set, but it does not cover everything,” Hofkens said. “Basically you need to compose your RAG pipeline with these different data sets.”



Source link