[Notes] from “So you think you can prompt”

Dipam Vasani
10 min readJun 23, 2024

--

When more than 2 people ask you for something, turn it into an article.

Introduction

Just cataloging my notes and follow up thoughts here from an event I attended called “So you think you can prompt”. As the event description said, it was a high signal-to-noise event.

First of all, it was surreal to meet the people I follow and learn from on Twitter in person like Eugene Yan (who is super chill btw), Hamel Husain, Charles from Modal and Jonathan Whitaker.

Dr. Bryan Bischof is a great prompt engineer for humans. His questions were well researched and his follow ups were insightful.

The event was split into two sessions. In the first half there were 3 lightning talks. In the second half, we were all seated on tables of 5–7 and discussed our best LLM hacks. The best hack from each table then presented it to the room and the top 2 won 3D printed trophies. The evaluation criteria for the hacks was

  • Is the idea novel? (surprise element)
  • Is it useful, i.e. can the people in this room take this idea, try it and gain value from it? (business element)
  • … and I forget the third one 😭

With that, let’s get to my notes.

Session #1 — Lightning talks

^ Shoutout to his great pants btw.

Lightning talk #1 — RAGs

I did not get the name of the speaker for this talk, if I do find it I will update the article.

My notes:

The talk was about retrieval.

Most people create vector dbs by chunking and embedding their documents. However, these documents have additional signals which we should make use of. For example, “when was the document created?” can be a useful signal.

If the document was created 10 years ago or if it was created a long time ago and then never opened by anyone else, it might not be that useful. The speaker suggested using open source encoders / custom encoders to encode these signals / metadata and finding a way to stitch them together with the chunked embeddings to get a final embedding.

Follow up question: “Why do you suggest encoding these signals rather than pre filtering or post filtering using them?”

Answer: You can do it but there are two challenges to it

  1. If the user asks a question like “what was the recent news about xyz”?

Now you have to interpret recent as last 7 days or last 10 days etc and it can be subjective

2. Doing it as a separate filter is hard cz you are dealing with multiple indexes and it gets even more complex when it scales (I didn’t understand this part)

My thoughts:

I like this idea and would want to try it. It fits in with query expansion where you use an LLM to expand things like “recent” and then find embeddings for those specific dates / ranges.

One thing I am unclear on is how the subjectivity of “recent” is solved by using embeddings cz even it would learn to associate “recent” with a specific date range.

Lightning talk #2 — Function calling

Speaker: Shishir Patil, PhD student at UC Berekely

My notes:

The talk was about agents and function calling.

Shishir spoke about gorilla-openfunctions-v2 which is an open source model with good function calling capabilities. They also have a function calling leaderboard and given that they’re not associated with any AI lab they can be unbiased about it and push back fairly.

Finally, they also have something called API Zoo which has documentation over 60k+ APIs.

When API Zoo started, it was Shishir and others crawling the documentation pages and keeping it up to date. But now, companies are willingly contributing to it.

The thing with APIs is that the incentive are aligned. If an AI lab is scraping the web for data and training a model onit, the company they scraped the data from gets nothing.

However, if they are making function calls to Stripe, Stripe gets paid and so the incentives are well aligned.

Some other notes:

  • Parallel function calling: If the LLM has access to weather API and you ask for the weather of Berkeley and the weather of San Francisco, it needs to understand that it has to make two API calls and that it can do them in parallel.
  • How to think about adding APIs?: They don’t think about specific APIs rather which languages / frameworks they want to support like Python or RESTful APIs.
  • What evals are used with function calling? One interesting thing mentioned here was, with function calling, it’s possible to detect hallucination. For example, if the LLM is trying to figure out which tool / API to use, there is a right answer, there is a wrong answer and then there is an API which is not even available.
  • How do you figure out which APIs are available? He mentioned something called “Abstract syntax tree” which I need to look into but it basically tells you “Is the API a subset of all the available APIs at this point in time?” And new APIs can be added. This eval is consistent and clear unlike LLM as a judge which can be vague and subjective
  • What other evals do you use? Execution is another. Seeing if you get the response you expect. It’s not just about was the call successful, but was the answer correct? You can do things like ask it for the stock price of Tesla on May 11 or the price of spot instances on AWS and since you already know the answer, the comparison becomes easy.

My thoughts:

This was a really good talk in general, lots of takeaways and followups for me since I use function calling for structured extraction.

Lightning talk #3 — Evals

Speaker: Raza Habib, CEO @ Humanloop

My notes:

The talk started with the speaker’s research into Active learning which is a way in machine learning to select which examples are worth adding to your train set. You don’t want to add examples your model is already good at.

The speaker mentioned some common techniques like trying to oversample rare labels which might be counterintuitive since we want to build a representative test set. No we don’t. The test set for active learning can be biased. The objective of the test set is to get your model good at it’s objective.

Having said that, the speaker reiterated the importance of a hold out test set for the actual model since a lot of new engineers might be entering the world of LLMs. He gave a lot of good nuggets between academia vs industry. For example:

  • In academia you care about how generalized the model is, in industry you care about how good it is at the task you want it to do
  • In academia the dataset can be fixed for years, in industry it gets stale very fast
  • In academia public benchmarks matter a lot, in industry benchmarks curated for your specific usecase matter much more

Raza also suggested that in the past with ML, the product team would come up with a spec and pass it along to the engineers. However, with LLMs, the domain experts, product managers and people working with customers should be much more involved in writing the prompts and the evals because they understand what good looks like.

Writing your evals is writing the spec
- Raza

Finally, he talked a little bit about LLM as a judge.

He said that it works if used appropriately. What is appropriately?

It’s a mistake if you ask it for a score. For example, “On a scale of 1 to 10, how good is this writing?” because it’s subjective.

It works if you ask it binary questions. For example, they worked with a company that helped draft patents. And patents have certain rules where something should always be included or excluded. “Is X in the sentence” is a good way to use LLM as a judge. Another example can be job descriptons that may or may not have the salary range based on the location they’re hiring in.

To finish the talk, Raza said human evals are obviously a part of any eval system. However, the techniques mentioned above can help augment and scale human eval.

My thoughts:

Another great talk, can’t believe this event was free 💯

Session #2 — BYOH (bring your own hacks)

Image from twitter

Didn’t take notes on all the talks but here’s the ones I did:

1. Probably

Peter is developing algorithms to programatically iterate on and optimize prompts.

The setup starts with multiple models to establish a baseline and then it iterates on the prompt for each model based on their strenghts.

The framework is not open source yet, but will be published soon and is gonna be called “Probably

2. Don’t think of a pink elephant phenomenon with LLMs

This idea is not new but a mental model that I will use when working with LLMs.

When you tell humans, “Don’t think of a pink elephant” the first thing they will think of is a pink elephant. The same is true for LLMs.

The speaker Rani had some geographic data and wanted the LLM to come up with a location but not the zipcode. However, no matter how much they tried (pleading, threatning, etc) some time or the other the LLM would return a zipcode. So using “don’t” in your prompts is not a good idea.

Follow up question: So what should you do?

If you have to ask it to focus on what not to do, frame it positively, not negatively. And then of course theres’ fine tuning.

3. Don’t remove bad data when fine tuning

Charles presented a hack for fine tuning. There is nothing called bad data. Bad data is actually good data.

Show your model both beautiful images and shitty images. For shitty images, have humans caption what the problem is with that image, why is it shitty.

He also mentioned something about chess players and Elo rankings which I didn’t quite get.

Follow up question #1: Do you have a sense of improvement you get by doing this?

Charles: “I would have to go back and check but it’s those last few points when you’re trying to get to the top of a leaderboard.”

Follow up question #2: Is this similar to contrastive learning? Is that a good mental model to think about it?

Charles: “In contrastive you’re learning to differentiate b/w two things vs here you’re learning what’s good and what’s bad”

4. Soft prompting to compress

Forgot the speakers name.

Funny story, the speaker won silver trophy and Charles won gold. They both worked on a project together last year where they learnt the hacks they presented.

Soft prompting only works with open source models where you have access to weights. It works with models that allow you to send embeddings directly.

I am not fully clear as to how this is done but there is some training involved with prompts and expected outputs and use these to freeze the whole model except the input embedding layer and you train an embedding such that instead of a long prompt, you just use the embedding.

You’re using a vector to cache the prompt.

If I find more documentation on this I will link it here.

Follow up question: Any use case where you’ve seen improvement?

Speaker: Yes, in an AI code editor company where they have huge prompts for generation and retrieval and they found efficiencies by using this hack

If we can’t optimize human language, why not optimize vector space?

Audience comments:

  • Gradient free optimization: Estimate the gradients if you have the output
  • Mention of a paper on exploiting black box models

5. Setting temperature for agents

Again, forgot the speakers name but I think it was Han.

The suggestion was, for agents, which are basically LLMs in a for loop, start with high temperature to increase the space of possible solutions and as you go thru the loop lower it.

6. Few shot

Spoke about value of using a lot of examples relative to tweaking instructions especially for structured outputs, the ROI was significantly better.

Btw, I don’t remember who said this but I also heard for doing few shots with function calling, don’t just put the output in your prompt but the input API call as well. Improves performance.

7. Inspect AI

The speaker had a demo however I can’t find this company online (too many other companies called inspect).

The idea was that universal interfaces are really valueable when working with LLMs.

The speaker showed a demo of using a prompt in google sheets. Google sheets has functions where you can reference other cells. For example, you can do sum of A1 and B1 and apply it to whole columns.

Similarly, you can have a prompt and then you can use a “classify” function with certain cells to get LLM output in google sheets. The demo was about classifying Jira tickets into certain categories.

Follow up question: What’s the usecase for this?

Universal interfaces make it very easy to give access to LLMs. The speaker worked with a company of 5000 people and with this sheets plugin, all of them could use it in their daily workflows.

Super interesting.

Conclusion

I go for very selective events, however, when anyone mentioned in this article is present, I know I will come back inspired :)

~ happy learning

--

--

No responses yet