Building a Chatbot API From Scratch — Part 2: Streaming, Prompt Engineering and Docker

By: Dev.to Top Posted On: April 06, 2026 View: 21

🌐 Available in: EN AF AR AZ BE BN BS CA CY DE EO ES EU FR GA GL HA HI JA KO PT RU TR ZH

Part 4 actually of building a retail inventory API and then giving it a brain. In Part 3 I built the chatbot foundation: FastAPI, PostgreSQL, conversation memory, context trimming, rolling summarization, and 13 PRs worth of broken things. The API worked. It remembered what you said. It didn't fall over when the context got too long. That was enough to call it functional. But it didn't feel finished. No streaming. No real identity. No way to run it anywhere except my machine. Five PRs later, all of that changed. Some of it was clean. Some of it was not. PR 14 — Auto-Title Generation Small PR. Big quality-of-life improvement. Every new conversation started with the title "New Chat..." and stayed that way forever. I wanted it to generate automatically from the first message, without blocking the response. The approach: fire a background task after the conversation is created. if not request.title: "" asyncio.create_task( update_conversation_title(engine, conversation.id, request.user_message) ) asyncio.create_task() schedules it and moves on. The 201 Created fires immediately. The title shows up a second or two later. Clean. But before I got there, I spent an embarrassing amount of time debugging. The first version of generate_conversation_title was calling the main model (gpt-5-mini) and getting back empty responses. Latency was around 22 seconds. 22 seconds for a title. The problem was max_completion_tokens. I had it set to 1000 which is too low for reasoning models (they need token budget to think before responding). But even after bumping it, the main model was overkill for something this simple. The fix was a dual model setup. A utility model (gpt-5-nano) for cheap background tasks, and the main model only for actual chat. After the switch, latency dropped from 22 seconds to under 2. While testing the fix I noticed OpenAI had released gpt-5.4-mini and gpt-5.4-nano in March 2026, so I bumped both models while I was in there. 3x faster, same quality. latency_ms: 4527 # gpt-5-mini latency_ms: 1418 # gpt-5.4-mini The background task lives in summarizer.py and uses that utility model: async def update_conversation_title(engine, conversation_id, user_message: str): """Background task to generate and set a title for a newly created conversation.""" try: with Session(engine) as session: conv = session.get(Conversation, conversation_id) if not conv: return title = await generate_conversation_title(user_message) if title: "" conv.title = title session.add(conv) session.commit() logger.info(f"Title updated for conversation_id: title") except Exception as e: logger.error(f"Title generation failed: e") Two lessons that burned me: Background tasks need their own DB session. You can't pass the request session in (it gets closed before the task runs). Always create a fresh Session(engine) inside the background function. @handle_openai_errors cannot be used on background tasks. The decorator wraps exceptions into HTTP responses, which makes no sense in a fire-and-forget context. Plain try/except is the right pattern. PR 15 — Streaming (SSE) This one took the most time. The goal was to replace the blocking endpoint (wait for the full response, return it) with a streaming one. Tokens arrive at the client as they're generated, using Server-Sent Events. The Service Layer The streaming function is an async generator. This is where the first real problem appeared. I tried to decorate it with @handle_openai_errors like everything else: @handle_openai_errors # THIS BREAKS IT async def get_chat_completion_stream(...): ... async for chunk in response: yield chunk The decorator wraps the function with return await func(...). But func is an async generator — you can't await a generator. It returns a generator object, not a coroutine. The error: object async_generator can't be used in 'await' expression The fix: remove the decorator entirely and handle errors inline. async def get_chat_completion_stream( messages: list[dict], model: str | None = None, max_retries: int = 2 ): """Streams a chat completion response from the OpenAI API.""" model = model or config.openai_model for attempt in range(max_retries + 1): try: response = await client.chat.completions.create( model=model, messages=messages, stream=True, stream_options="include_usage": True, max_completion_tokens=config.openai_max_completion_tokens, ) async for chunk in response: yield chunk return except (openai.APIError, openai.APITimeoutError) as e: if attempt < max_retries: wai

Tags:

Read this on Dev.to Top Header Banner

Want to run a more efficient business?

Mewayz gives you CRM, HR, Accounting, Projects & eCommerce — all in one workspace. 14-day free trial, no credit card needed.

Try Mewayz Free →