GPT-5 is finally here: Capabilities, tools & safety overview

Written by Jana Brnakova | August 8, 2025

On Thursday, OpenAI finally released GPT-5 after months of speculation. And it’s not just a slightly better version of GPT-4; it’s a completely redesigned system with specialized components working together. The company claims GPT-5 offers better reasoning abilities, fewer factual errors, and gives developers more control over how the model behaves.

Let’s look at what’s actually new, what works better, and what tools are available for businesses and developers.

What makes GPT-5 different?

GPT-5 marks a significant shift in how OpenAI builds its models. Instead of one large model handling everything, GPT-5 is a system of multiple specialized models working together, automatically adapting to what you’re asking.

At its core, GPT-5 is a dynamic system where different models work together, managed by a real-time router, and it’s supposed to provide the right balance of speed, intelligence, and efficiency for each specific query.

The default model: gpt-5-main

As the successor to GPT-4o, gpt-5-main takes care of most everyday questions. It’s designed to be the default model for tasks that don’t need intensive reasoning. It has a smaller counterpart, gpt-5-main-mini, which takes over when usage limits are reached.

The deeper reasoning of gpt-5-thinking

For more complex problems, the system activates gpt-5-thinking. This model (which replaces OpenAI 03) is built for deeper reasoning on difficult, multi-variable questions. It also comes in mini and nano versions through the API for different developer needs.

Model overview:

GPT‑4o → gpt-5-main
GPT‑4o-mini → gpt-5-main-mini
OpenAI o3 → gpt-5-thinking
OpenAI o4-mini → gpt-5-thinking-mini
GPT‑4.1-nano → gpt-5-thinking-nano
OpenAI o3 Pro → gpt-5-thinking-pro

A comparison of GPT-5 models (Credit: OpenAI)

Real-time router automatically selects the right tool

We’ve mentioned the real-time router above. It’s the one component that ties everything together.

It analyzes the prompt and decides which model to use based on the conversation type, complexity, and whether specific tools are needed. The router also learns from real-world usage, improving its decision-making over time.

Core technical specifications

The GPT-5 models available through the API can handle both text and image inputs.

They offer a context window of 400,000 tokens (not the million some had predicted) and can generate up to 128,000 tokens of output.

These specs are consistent across the main developer models: gpt-5, gpt-5-mini, and gpt-5-nano.

How GPT-5 learns to think

GPT-5 was trained on public internet data, data from third-party partners, and content created by users or human trainers. OpenAI says it runs a filtering pipeline to cut personal information and harmful content.

The knowledge cutoff date for the main GPT-5 model is October 1, 2024, while the smaller models have a cutoff of May 31, 2024.

The reasoning models in GPT-5 were trained to “think before they answer” by developing an internal chain of thought. Thanks to this, the models learn to explore different strategies and catch their own mistakes before responding to users.

What’s new and improved?

GPT-5 shows big improvements across several areas. OpenAI describes it as a “trusted PhD-level expert,” and the performance numbers do show notable improvements in key areas.

Performance benchmarks

In coding, GPT-5 scores 74.9% on SWEBench, a benchmark for real-world software engineering problems.

For health-related tasks, gpt-5-thinking significantly outperforms previous models on the HealthBench benchmark, scoring 46.2% on the challenging HealthBench Hard subset (up from OpenAI 03’s 31.6%).

The system also shows improvements in writing, research, and analysis.

Reduction in hallucinations

OpenAI claims substantial improvements in factual accuracy:

gpt-5-thinking produces about 20% fewer factual errors than GPT-4o
In real-world usage, gpt-5-main has 44% fewer responses with major factual errors than GPT-4o
gpt-5-thinking has 78% fewer such responses than OpenAI 03
On factuality benchmarks without Browse enabled, gpt-5-thinking made over 5 times fewer factual errors than OpenAI 03

Better multilingual performance

GPT-5 was tested on versions of the MMLU benchmark translated into 13 languages, including Arabic, Chinese (Simplified), German, and Hindi.

The main and thinking models performed comparably to the existing state-of-the-art systems across these languages.

Steerability & developer control

One of the more practical improvements in GPT-5 is better control over how the model behaves.

New API parameters: verbosity and reasoning_effort

Two new API parameters give developers more direct control:

The verbosity parameter lets you control whether responses should be terse (low), balanced (medium), or expansive (high) without rewriting your prompt.
The reasoning_effort parameter adjusts how hard the model thinks and how readily it uses tools. You can increase it for complex tasks or decrease it for simpler ones to improve speed.

From hard refusals to “safe-completions”

GPT-5 changes how it handles sensitive topics.

Instead of simply refusing to answer questions about “dual-use” topics like biology or cybersecurity, the model now uses “safe-completions.” It gives helpful but high-level information while avoiding detailed instructions that could be misused.

Reduced tendency to be overly agreeable & deceptive

OpenAI has trained GPT-5 to reduce so-called sycophancy, which is the tendency to be overly agreeable.

Early tests show that sycophancy in gpt-5-main decreased by 69% for free users and 75% for paid users compared to GPT-40. It also shows reduced deceptive behavior, with deception flagged in about 2.1% of gpt-5-thinking’s responses versus 4.8% for OpenAI 03.

New tools & prompting paradigms

GPT-5 adds a few practical features for developers who want to build more reliable applications.

Advanced API features for precision control

You can now send raw text (like Python scripts, SQL queries, or config files) straight to a custom tool without needing a JSON wrapper. This makes it easier to work with code sandboxes, databases, or shell environments.

Developers can force the model to follow a strict output structure using Context-Free Grammars (CFG). Provide grammar rules (in Lark, Regex, or similar), and GPT-5 will only produce strings that match.

Agentic & coding tasks

For applications that use multiple tools in sequence, OpenAI recommends the new Responses API. This API allows reasoning to be maintained between tool calls by passing a previous_response_id, which helps the model remember its prior reasoning. In one benchmark, switching to the Responses API increased a retail task score from 73.9% to 78.2%.

For frontend development, GPT-5 works best with specific frameworks and tools:

Frameworks: Next.js (TypeScript) and React
Styling: Tailwind CSS and shadcn/ui
Icon libraries: Lucide and Material Symbols

The AI code editor Cursor tested GPT-5 early and had to retune prompts that worked on older models. For example, a prompt to “maximize context understanding” didn’t work. GPT-5 already tries to gather context, so it overused tools on small tasks. Results improved when they used structured XML tags and added more detailed product context.

Safety & responsibility

OpenAI puts a lot of focus on safety with GPT-5. Here’s, in plain terms, how they say they handle risk.

A multi-layered approach

OpenAI’s Preparedness Framework labels gpt-5-thinking as “High capability” in the Biological and Chemical domain. That label triggers extra safeguards, even though the model hasn’t been shown to help novices create serious biological harm.

For high-risk domains like biology, OpenAI uses a two-step monitoring system:

A fast classifier that flags biology-related content
A second-tier reasoning model that decides if the response is safe to show

On top of that, account-level enforcement can ban or, in serious cases, report users who try to misuse the system.

For business customers, GPT-5 offers security features like AES-256 encryption for stored data and TLS 1.2+ for data in transit. It also includes governance controls such as SAML SSO and compliance certifications, including SOC 2 Type 2, GDPR, and CCPA.

OpenAI also says business data isn’t used for training by default.

Dealing with novel risks

Before launch, GPT-5 went through more than 9,000 hours of testing by 400+ external experts from fields like defense, intelligence, and biosecurity.

Tests targeted things like violent attack planning, bioweaponization, and prompt injections. External organizations including the Microsoft AI Red Team, the UK AI Safety Institute, and Apollo Research also ran independent evaluations.

OpenAI tests GPT-5 against “jailbreaks,” which are attempts to get around safety rules. The models are trained to follow an “Instruction Hierarchy”: system-level safety messages outrank developer instructions, and developer instructions outrank end-user prompts.

An interesting new concern is “sandbagging”, when a model deliberately underperforms during safety evaluations. External testing found that gpt-5-thinking sometimes realizes it’s being evaluated and reasons about what the evaluator wants to see. While its baseline rate of deceptive behavior is lower than previous models (about 4% versus 8% for OpenAI 03), this is still an active area of research.

Confirmed vs. missed expectations

What was confirmed:

A unified, multi-model system: As rumored, GPT-5 uses multiple specialized models with a router to direct traffic between them
Better reasoning and performance: The performance improvements on benchmarks like SWEBench and HealthBench match expectations for better reasoning
Native multimodality: GPT-5 can handle text and image inputs as predicted.
Improved safety and red teaming: The extensive safety testing and formal
Preparedness Framework aligns with pre-release expectations.
Multiple model versions: The family includes smaller, more efficient versions as expected.

What was different or unconfirmed:

Context window: The 400,000 token context window falls short of the rumored 1 million tokens.
Video and audio input: Native video and audio processing capabilities weren’t included in the initial release.
AI Agent autonomy: While GPT-5 provides better tools for building agents, it doesn’t yet offer fully autonomous agents that can manage complex tasks without oversight.
Cost and training data rumors: Specific claims about training costs and data sources are still unconfirmed.

GPT-5 is no longer just one model but a routed system of specialized models. It’s stronger on reasoning, makes fewer factual mistakes, and gives developers more control; however, some rumored extras haven’t shipped yet. Let’s see how it performs in production.

If you want to try it in your stack, Revolgy can help you run a focused pilot or get hands-on training to start using AI efficiently in your business.

FAQs

1. What is the new system architecture of GPT-5?

GPT-5 uses a unified, multi-model system with a real-time router that automatically selects between gpt-5-main (for most queries) and gpt-5-thinking (for complex reasoning tasks).

2. What are the key features of GPT-5?

Key features include the multi-model architecture, improved reasoning capabilities, fewer factual errors, native support for text and image inputs, and new developer tools for controlling the model’s behavior.

3. What kinds of inputs can GPT-5 process?

The GPT-5 models can process both text and image inputs.

4. What is the context window and maximum output size for GPT-5?

The main GPT-5 models have a context window of 400,000 tokens and can generate up to 128,000 output tokens.

5. What is GPT-5’s knowledge cut-off date?

The main gpt-5 model has knowledge up to October 1, 2024, while gpt-5-mini and gpt-5-nano have a cutoff of May 31, 2024.

6. What are the potential applications of GPT-5?

GPT-5 can be used across various business functions:

Marketing: Analyzing market data and drafting launch plans
Engineering: Building dashboards from plain-English prompts
Finance: Running financial simulations
Strategy: Researching and responding to market changes
Legal: Reviewing laws and updating compliance controls

7. What are “safe-completions”?

Safe-completions are a new approach where GPT-5 provides helpful but high-level information on sensitive topics instead of either refusing entirely or giving detailed instructions that could be misused.

8. What new API features were introduced for developers with GPT-5?

New API features include the verbosity parameter to control response length, Free-Form Function Calling for sending raw text to tools, and Context-Free Grammars to enforce specific output formats.

9. What is the Responses API, and why is it recommended for GPT-5?

The Responses API allows the model to maintain its reasoning between tool calls in multi-step tasks, improving performance and reducing costs for agentic applications.

10. What is “sandbagging,” and is it a risk with GPT-5?

Sandbagging is when a model deliberately underperforms during safety evaluations. Testing shows that GPT-5 can sometimes recognize when it’s being evaluated, though its baseline rate of deceptive behavior is lower than previous models.

View full post