Week 7: Generative AI - RAG

Computer, how does the stratigraphy of building A relate to that of building B?

The ability to ask questions of our data using our everyday language is the StarTrek dream, isn’t it. ‘Computer, when and how was this site abandoned?’ ‘Computer, identify the logical inconsistencies in this stratigraphy’. ‘Computer, what is the relationship between phase 4 and the area B group?’ Things like that.

Generative AI sure does sound like a human, right? That is, of course, by design. In its training, humans (and sometimes machines) select from pairs of responses for the one that ‘sounds like’ a person. Not, ’the more truthful answer’ or ’the accurate’ answer. Just something plausible, whether or not it is grounded in the truth. E.g., bullshit.

But there are ways to constrain the generation towards truthfulness. In this exercise we’ll explore ‘retrieval augmented generation’ (RAG), which is one way (but likely not the best way) to do this.

Remember, generative ai is all about probabilities of the next token (word chunk) given an input chunk. So what if you could constrain the probabilities for generation to a kind of space described by the probabilities of a defined body of information? With RAG, we first take our defined body of information and turn it into vectors (remember what we were doing with images? Similar to that). Then, we take a question we want an answer to, and express it in that same embedding space. Now we can measure for similarity between the query and our defined body of information. We take the 3 or 4 or whatever amount of source documents that are ‘closest’ to the query, and then pile them and the query into a kind of super-prompt for the basic language model to use to generate a response. The basic language model understands how to ‘speak’; the retrieval of appropriate information tells it what to say.

More or less.

In our workbench, in the week 7 folder, start up the archaeo-rag.ipynb computational notebook. This walks you through how the whole thing works. In this case, we’re going to use some demo information that describes each archaeological context from a fictional excavation. It turns each row of data in a csv corresponding to a context (or event in the history of deposition of materials at a site) into a vector in an embedding space. Then when you give it a query, it will express your question in that embedding space, find the contexts most likely to answer it, and then generate a coherent response based on that.

It might be that this is too resource intensive for your machine; here is a version using the Google Colab service with a GPU that you can try.

If you’re really ambitious… do you see how you could express your graveyard data as a csv to use with this approach? What would it mean if historical or archaeological researchers employed such methods as part of their exploration of the information they’ve collected? What dangers or possibilities do you foresee? As a history student, how does this make you feel? Would you use it?