Drop LangChain and Instructor: Use This Alternative for Structured Output Generation

December 8, 2024

Structure of a compound AI system

As we move towards more compound AI systems in which the LLM is just one component among others, making sure the LLM can generate structured outputs becomes ever more crucial. For a compound AI system to work, there should be communication between its decoupled components. And for that to happen, each has to send structured responses to the others in a way that matches their respective APIs.

From any component to the LLM, it is easy. LLMs can ingest any textual input both with or without predetermined structure.

From the LLM to any other component, the generation of structured outputs (JSON, or XML for example) becomes necessary to fit the API exposed by the component that the LLM tries to interact with.

A component can be something as simple as a function to something as complex as a fully integrated API that interacts with multiple systems, handles authentication, data processing, and provides extensive functionalities to support diverse operational workflows.

Let’s say you have a function add(a, b) that you want your LLM to use whenever it has to perform some addition. Then you want your LLM to be able to generate a structured response, maybe a JSON with the parameters of that function.

{
  "a": 10,
  "b": 15
}

This way, you can just pass the parameters to the function, execute it, and get the accurate result you need. A common way to generate structured outputs is to use LangChain output parsers or instructors.

In some cases, you can even directly use the client of the LLM provider you are using. For the OpenAI’s API for instance, that implies using the “response_format” parameter.

But if you are into prompt optimization and building compound AI systems using DSPy, then I have great news for you. You can use it for structured outputs generation, and the APIs to do so is neat. Let’s look at some examples:

Employee data extraction

You want to extract the salary as int and the name as str for all employees. You just have to define that in your signature like so:

import dspy
from dotenv import load_dotenv

load_dotenv()

# Define the language model using LiteLLM
lm = dspy.LM(model="openai/gpt-4o")

dspy.configure(lm=lm)

employee_data = """John Doe is a software engineer at OpenAI.
He has been working with the company for 5 years.
He earns $100,000 per year."""

pred = dspy.Predict("employee_data:str -> salary:int, name:str")

print(pred(employee_data=employee_data))

Here is the output you get:

Prediction(
    salary=100000,
    name='John Doe'
)

Nice, isn’t it?

Extraction of clinical entities

You can even extract more complex data structures. In the following script, the goal is to parse clinical entities. DSPy makes it so easy.

import dspy
from dotenv import load_dotenv

load_dotenv()

# Define the language model using LiteLLM
lm = dspy.LM(model="openai/gpt-4o")

dspy.configure(lm=lm)


class ClinicalEntitiesExtractor(dspy.Signature):
    """Extract medical and clinical named entities of drug, frequency,
    dosage, form, and duration from the below data."""

    clinical_note: str = dspy.InputField(
        desc="The clinical note containing medical information."
        )
    answer: list[dict[str, str]] = dspy.OutputField(
        desc="""The extracted medical named entities for example
        {'drug': 'Lisinopril',
        'frequency': 'once daily',
        'dosage': '20 mg',
        'form': 'tablet',
        'duration': 'NA'}""")


pred = dspy.Predict(ClinicalEntitiesExtractor)

TEXT = """Patient John Doe, a 45-year-old male, has been prescribed
20 mg Lisinopril once daily in tablet form to manage hypertension.
Additionally, he should take 500 mg Metformin twice a day in capsule
form for blood sugar control. For his recent diagnosis of acid reflux,
the doctor advised taking 10 mg Omeprazole in the form of a delayed-release
capsule every morning before breakfast for a duration of 6 weeks.
If necessary, the patient may also use 10 ml Gaviscon suspension
up to four times daily for immediate relief."""

results = pred(clinical_note=TEXT)

print(results)

Here is the output you get:

Prediction(
    answer=[{'drug': 'Lisinopril', 'frequency': 'once daily', 'dosage': '20 mg', 'form': 'tablet', 'duration': 'NA'}, {'drug': 'Metformin', 'frequency': 'twice a day', 'dosage': '500 mg', 'form': 'capsule', 'duration': 'NA'}, {'drug': 'Omeprazole', 'frequency': 'every morning before breakfast', 'dosage': '10 mg', 'form': 'delayed-release capsule', 'duration': '6 weeks'}, {'drug': 'Gaviscon', 'frequency': 'up to four times daily', 'dosage': '10 ml', 'form': 'suspension', 'duration': 'NA'}]
)

Pydantic support

You can define the structure of your data using a pydantic model (similar to what you do when using Instructor).

from pydantic import BaseModel, Field

class EmployeeInfo(BaseModel):
    name: str = Field(...,
                      description="The name of the employee.")
    title: str = Field(...,
                       description="The title of the employee",
                       examples=["Data scientist", "Software engineer"])
    department: str = Field(...,
                            description="The department the employee works in")
    salary: float = Field(...,
                          description="The salary of the employee.")

One obvious application of this is synthetic data generation. Let’s say I want to generate 10 fake employees. I just have to define the schema using pydantic, and use DSPy to enforce correct data generation. Here is how:

import dspy
from dotenv import load_dotenv
from pydantic import BaseModel, Field

load_dotenv()

# Define the language model using LiteLLM
lm = dspy.LM(model="openai/gpt-4o")

dspy.configure(lm=lm)


class EmployeeInfo(BaseModel):
    name: str = Field(...,
                      description="The name of the employee.")
    title: str = Field(...,
                       description="The title of the employee",
                       examples=["Data scientist", "Software engineer"])
    department: str = Field(...,
                            description="The department the employee works in")
    salary: float = Field(...,
                          description="The salary of the employee.")


class EmployeeGeneration(dspy.Signature):
    """Generate employee information from the given text."""

    employee_id: int = dspy.InputField(
        desc="The unique identifier for the employee."
    )
    employee_info: EmployeeInfo = dspy.OutputField(
        desc="The generated employee information."
    )


pred = dspy.Predict(EmployeeGeneration)

for i in range(10):
    print(pred(employee_id=i))

Amazing, right? Now if you still want to use LangChain or Instructor, that’s your choice. But remember, you can do the same with DSPy and benefit from prompt optimization and other perks in the process.

Happy coding!