Banana
Banana provided serverless GPU inference for AI models, a CI/CD build pipeline and a simple Python framework (
Potassium
) to server your models.
This page covers how to use the Banana ecosystem within LangChain.
Installation and Setupβ
- Install the python package
banana-dev
:
pip install banana-dev
- Get an Banana api key from the Banana.dev dashboard and set it as an environment variable (
BANANA_API_KEY
) - Get your model's key and url slug from the model's details page.
Define your Banana Templateβ
You'll need to set up a Github repo for your Banana app. You can get started in 5 minutes using this guide.
Alternatively, for a ready-to-go LLM example, you can check out Banana's CodeLlama-7B-Instruct-GPTQ GitHub repository. Just fork it and deploy it within Banana.
Other starter repos are available here.
Build the Banana appβ
To use Banana apps within Langchain, you must include the outputs
key
in the returned json, and the value must be a string.
# Return the results as a dictionary
result = {'outputs': result}
An example inference function would be:
@app.handler("/")
def handler(context: dict, request: Request) -> Response:
"""Handle a request to generate code from a prompt."""
model = context.get("model")
tokenizer = context.get("tokenizer")
max_new_tokens = request.json.get("max_new_tokens", 512)
temperature = request.json.get("temperature", 0.7)
prompt = request.json.get("prompt")
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]
'''
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=temperature, max_new_tokens=max_new_tokens)
result = tokenizer.decode(output[0])
return Response(json={"outputs": result}, status=200)
This example is from the app.py
file in CodeLlama-7B-Instruct-GPTQ.
LLMβ
from langchain_community.llms import Banana
See a usage example.