Explore

R&D Projects

CustomGPTurbo

A UX Prototype for Faster CustomGPT Performance

Bill French

You can play with a prototype of this approach

here⁠

CustomGPTurbo

Creating applications that exhibit near-zero latency is not easy. With web protocols, servers, and all sorts of rendering challenges, there’s a a tension between display performance and practical implementation choices. The strategies narrow quickly.

A key concern with this project was to push response times within the

Doherty Threshold⁠

of 400ms, the point where users begin to get anxious.

Productivity soars when a computer and its users interact at a pace (<400ms) that ensures that neither has to wait on the other.

Do not confuse this work with other projects that utilize

Groq⁠

, the LPU inference engine. Groq also achieves straight-line performance approximating the feel of real time through hardware and software architecture. CustomGPTurbo achieves its performance through software alone. But Groq is noteworthy in this discussion because it has struck a nerve in generative AI circles. It has revealed a significant demand for real-time AI applications.

Recently, Groq
announced⁠
that more than 70,000 new developers are using GroqCloud™and more than 19,000 new applications are running on the LPU™ Inference Engine via the Groq API. The rapid migration to GroqCloud since its launch on March 1st indicates a clear demand for real-time inference as developers and companies seek lower latency and greater throughput for their generative and conversational AI applications.

⁠

Get CustomGPT

⁠

I’m also happy to consult on more complex use cases for SmartSuite AI Live. Reach out anytime. I believe there is a horizon of opportunities to build more advanced applications based on this concept.

⁠

bill.french@gmail.com⁠

⁠

970-389-3126⁠

⁠

To some degree, Groq inspired CustomGPTurbo.

IMPORTANT!

This is a prototype.

It will fail.

It will not always be live, but I will do my best to ensure it has high availability.

TLDR; This incredibly fast AI app:

Uses a forward-caching architecture (FCA);

Designed to identify the most important elements of a CustomGPT RAG solution;

Prepositions answer content using a client-side embeddings layer;

Leverages answers provided in the corpus as well as GPT-generated answers;

Learns to cache answers forward over time.

The forward-caching architecture makes it possible to rapidly recall information and generate responses instantly at the client layer without forcing every question-answer process to traverse the RAG infrastructure and associated LLM(s).

Economic Drivers

Key design objectives unfurled when I pondered these questions:

What if you knew the probability that your prompt would succeed (or fail) before sending it to the LLM?

⁠

⁠

You might change your prompting or query strategy if you had more data about the likelihood of success BEFORE hitting enter. How many times do you craft complex prompts, send them to the LLM without the foggiest idea what will come back? And what happens when the outcome is poor? You continue to shape the prompt hoping to zero in on the right way to nudge and bend the AI to your will.

CustomGPTurbo is my attempt to end this madness. In a Q&A solution, the goal is to put your worker’s eyes on top of the information they need as quickly and effortlessly as possible. Chatting endlessly into the ether is not an ideal interface for getting answers to questions.

What if the interface produced direct pathways to answers in Q&A RAG solutions?

⁠

Begin to type your prompt and just a few words in, CustomGPTurbo has predicted the likely completion and matched your unfinished prompt with high similarity candidates displayed in a small mind-map diagram.

At the root of the map is your verbatim prompt linked to nodes that are drawn from your corpus. The nodes are color-coded to indicate the probability each contains what you’re looking forward to.

What if the conversational UI understood the boundaries and strengths of the RAG corpus and prevented you from wasting time?

⁠

⁠

Ideally, we want generative AI solutions to be smart enough to guide us, but this often requires more than guardrails that prevent jailbreaking and hallucinations.

We need generative AI solutions — the chat interface itself — to posses information that dynamically tests the legitimacy of new prompts before you consume CPUs and GPUs on your company’s services.

It should give you a clear sense that what you’re about to ask will probably not result in a successful outcome.

A simple visual queue is all that’s required to avoid wasting your time.

What if the average response time was about 10x faster?

⁠

⁠

CustomGPT inferences average about five to eight seconds. For every 100ms beyond the Doherty Threshold, a system will lose 7% to 14% of the user base. This is known data developed over twenty years at Google Search. Google also discovered that an extra 1/2 second in search page generation time caused traffic to

drop by 20%⁠

I built this system as a veneer to

CustomGPT.ai⁠

to address these and many more economic and technical drivers. In early tests this UX is working well to eliminate unnecessary GPU/LLM activity while significantly speeding up generative AI responses.

Why CustomGPT?

I had to start somewhere. But the approach likely applies to many other generative AI applications.

Design Goals

While this is an experimental project, it was designed with the intent to achieve repeatability not just with CustomGPT, but with arbitrary generative AI RAG environments that provide a suitable API. While this prototype is biased toward improving RAG performance, earlier experiments demonstrate this approach is also capable of providing advantages for almost every generative AI system, even those that and less rigid the way RAG systems attempt to introduce deterministic outputs.

Fast. Users must feel like they can think and work at a real-time pace.

Effortless prompting. Users must be able to use simple keywords to get started.

Learns. Prompts that intentionally blend different artifacts from the corpus must automatically be memorialized with other cached conversations.

Architecture

Performance is the prime directive for everything about CustomGPTurbo. To users, milliseconds matter. CustomGPTurbo aims to make five to ten second response times from generative AI conversations a distant memory

⁠

⁠

Built With?

Replit

NodeJS

Pure HTML, CSS, and Javascript

CustomGPT API

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.