NVIDIA B300

NVIDIA B300
NVIDIA B300 graphics card review

NVIDIA B300: Why Blackwell Ultra Received 288 GB of HBM3E

NVIDIA B300 is a data center accelerator of the Blackwell Ultra generation. The main difference from the B200 is the increased memory size: 288 GB of HBM3E instead of 192 GB. For large AI models, this increase may be more important than peak performance, as long contexts and concurrent requests quickly hit memory limits.

Large language models need to store not only weights but also intermediate data, including KV-cache. The longer the request, the more reasoning steps are required, and the higher the parallel load, the faster the HBM usage. The B300 is designed for large LLMs, MoE models, long documents, and inference with a high number of simultaneous requests.

What is NVIDIA B300

The B300 belongs to the Blackwell Ultra family-an enhanced version of Blackwell for servers and AI infrastructure. It is not a consumer graphics card nor an accelerator for standard workstations. Its place is in data centers, DGX systems, and rack-level platforms such as GB300 NVL72.

It is important to distinguish the names. B300 refers to the accelerator itself. DGX B300 is an NVIDIA server with eight of these GPUs. GB300 NVL72 is a rack-level system where dozens of Blackwell Ultra GPUs are combined with fast NVLink interconnect.

The B300 is best considered not as a standalone board, but as part of a platform. NVIDIA sells not only GPUs but also a bundle of NVLink, NVSwitch, networking solutions, CUDA, TensorRT-LLM, and ready-made server configurations.

The Main Upgrade - 288 GB HBM3E

The B300 has up to 288 GB of HBM3E per GPU. This is a key characteristic for the inference of large language models. The B200 has a lower memory capacity-up to 192 GB-so the increase with the B300 is not just formal but significant for real workloads: more space for the model, longer context, and parallel requests.

Especially important is the KV-cache. These are the data the model stores during generation to avoid recalculating the entire previous context from scratch. The longer the dialogue, document, or chain of reasoning, the more memory this cache occupies. If many users are served simultaneously, the load on the HBM increases even more rapidly.

The additional 96 GB of memory compared to the B200 can provide more benefits than an increase in computational units. They allow more data to be kept in the GPU's memory, reduce the need to shard the model across accelerators, and decrease data transfer time. For a data center, this affects response latency, the number of concurrent requests, and generation costs.

Why B300 is Important for Long Contexts and Reasoning

AI inference is becoming more demanding. Previously, a typical request to a model was often short: a question and an answer. Now, models work with large documents, codebases, tools, and tasks that require several reasoning steps. Such scenarios create more intermediate data and put more strain on memory.

Therefore, the B300 appears not merely as an upgraded version of the B200 but as the next step in Blackwell for mass inference. The H200 was a powerful accelerator of the Hopper generation. The B200 was the first significant transition to Blackwell. The B300 enhances this line with a larger HBM capacity and a better focus on long contexts.

For such tasks, comparing only TFLOPS provides little insight. What matters more is how many users can be served by a single GPU, how long a context the system can handle, and the cost of producing a response.

FP4 and NVFP4: Performance for Inference

For the B300, traditional FP32 metrics are secondary. The main area of focus for this accelerator is Tensor Cores and low-precision computations: FP8, FP4, and the proprietary NVFP4 format. It is in this area that NVIDIA seeks to reduce inference costs.

Low precision reduces data volume and accelerates computations. If a model can be effectively run in FP4 without noticeable quality loss, the data center achieves more tokens per second with the same infrastructure. Therefore, the B300 should be evaluated not as a universal GPU, but as an accelerator for models optimized for such formats.

The hardware works in conjunction with the software stack. CUDA, TensorRT-LLM, Transformer Engine, and ready-made optimizations for LLM help achieve real performance, not just good figures in specifications.

How B300 Differs from B200 and H200

The B300 does not introduce a new architecture following the B200. It is the evolution of Blackwell with a stronger emphasis on memory and inference. The main difference from the B200 is the 288 GB of HBM3E instead of 192 GB. For long contexts, KV-cache, and parallel request servicing, such an increase can be critical.

The difference from the H200 runs deeper. The H200 belongs to the Hopper generation and was also designed for heavy AI tasks, but the B300 transitions to Blackwell Ultra: more capabilities for low precision, higher inference density, and better scaling within NVIDIA’s new server platforms.

Therefore, the B300 should be viewed not as a simple upgrade of the accelerator in a server but as part of the transition from model training to their continuous operation. Training is an expensive, but time-limited stage. Inference operates continuously and more rapidly impacts costs.

DGX B300 and GB300 NVL72

The DGX B300 illustrates how NVIDIA envisions this accelerator in practice. It is not a set of individual boards but a ready-made AI server with eight B300s, large GPU memory, fast interconnect, and networking interfaces for clusters.

The GB300 NVL72 is the next level: a rack with dozens of Blackwell Ultra GPUs and Grace CPUs. In such a system, the B300 operates as part of an overall computing platform. For large models, this is essential: the faster the GPUs exchange data, the less downtime for computational units and the more effective the utilization of expensive hardware.

In large AI workloads, what matters is not just a single specification figure but the stable scaling of the entire system. Therefore, NVIDIA promotes not only GPUs but also ready-made servers and racks.

Competitors: AMD is Close on Hardware, NVIDIA is Stronger on Platform

The main competitor to the B300 is the AMD Instinct MI355X. It is also aimed at heavy AI workloads and offers a large amount of HBM3E. By specific characteristics, AMD can no longer be considered a player significantly lagging in hardware.

However, in data centers, memory is not the only deciding factor. Large customers care about the software stack, support for popular models, scaling between GPUs, and the availability of ready-made server solutions. NVIDIA holds a strong position here due to CUDA, TensorRT-LLM, Transformer Engine, NVLink/NVSwitch, and a large number of LLM inference optimizations.

AMD may be attractive where price, openness, and reducing dependence on NVIDIA matter. But if a company needs the most predictable infrastructure for large models, the B300 appears to be a more obvious choice.

Limitations of B300

The B300 is a powerful but complex accelerator to operate. It cannot be evaluated separately from power, cooling, network, and rack costs. At this level, infrastructure directly affects the total cost of ownership.

For a small lab, the B300 may be excessive. Its advantages are revealed where there are large models, constant inference load, an optimized stack, and tasks that effectively utilize FP4, HBM, and fast inter-GPU communication.

There is also a strategic nuance: the B300 is an enhancement of Blackwell, not the latest generation of NVIDIA. The company is already preparing the next architectures, so the B300 is interesting as the top version of Blackwell Ultra for the upcoming cycle of AI infrastructure.

Conclusion

The NVIDIA B300 is important not just for a record number but for the combination of 288 GB of HBM3E, high memory bandwidth, FP4/NVFP4, and scaling through the NVIDIA platform. It is an accelerator for tasks where not just the chip price matters, but also response cost, latency, and the number of requests per rack.

The B300 is not for everyone. It is too expensive and specialized for regular computations. But for clouds, AI companies, and large data centers, it is one of the key accelerators of the Blackwell Ultra generation. It shows a market shift: an individual GPU is no longer what matters; it’s the complete system that reliably serves large models under realistic loads.

Basic

Label Name
NVIDIA
Platform
Desktop
Launch Date
September 2025
Model Name
B300
Generation
Server Blackwell
Base Clock
1665 MHz
Boost Clock
2600 MHz
Bus Interface
PCIe 5.0 x16
Transistors
104 billion
Tensor Cores
?
Tensor Cores are specialized processing units designed specifically for deep learning, providing higher training and inference performance compared to FP32 training. They enable rapid computations in areas such as computer vision, natural language processing, speech recognition, text-to-speech conversion, and personalized recommendations. The two most notable applications of Tensor Cores are DLSS (Deep Learning Super Sampling) and AI Denoiser for noise reduction.
640
TMUs
?
Texture Mapping Units (TMUs) serve as components of the GPU, which are capable of rotating, scaling, and distorting binary images, and then placing them as textures onto any plane of a given 3D model. This process is called texture mapping.
640
Foundry
TSMC
Process Size
5 nm
Architecture
Blackwell Ultra

Memory Specifications

Memory Size
144GB
Memory Type
HBM3e
Memory Bus
?
The memory bus width refers to the number of bits of data that the video memory can transfer within a single clock cycle. The larger the bus width, the greater the amount of data that can be transmitted instantaneously, making it one of the crucial parameters of video memory. The memory bandwidth is calculated as: Memory Bandwidth = Memory Frequency x Memory Bus Width / 8. Therefore, when the memory frequencies are similar, the memory bus width will determine the size of the memory bandwidth.
4096bit
Memory Clock
2000 MHz
Bandwidth
?
Memory bandwidth refers to the data transfer rate between the graphics chip and the video memory. It is measured in bytes per second, and the formula to calculate it is: memory bandwidth = working frequency × memory bus width / 8 bits.
4.10TB/s

Display and Media

Outputs
No outputs

Theoretical Performance

Pixel Rate
?
Pixel fill rate refers to the number of pixels a graphics processing unit (GPU) can render per second, measured in MPixels/s (million pixels per second) or GPixels/s (billion pixels per second). It is the most commonly used metric to evaluate the pixel processing performance of a graphics card.
62.40 GPixel/s
Texture Rate
?
Texture fill rate refers to the number of texture map elements (texels) that a GPU can map to pixels in a single second.
1664.0 GTexel/s
FP16 (half)
?
An important metric for measuring GPU performance is floating-point computing capability. Half-precision floating-point numbers (16-bit) are used for applications like machine learning, where lower precision is acceptable. Single-precision floating-point numbers (32-bit) are used for common multimedia and graphics processing tasks, while double-precision floating-point numbers (64-bit) are required for scientific computing that demands a wide numeric range and high accuracy.
426.0 TFLOPS
FP64 (double)
?
An important metric for measuring GPU performance is floating-point computing capability. Double-precision floating-point numbers (64-bit) are required for scientific computing that demands a wide numeric range and high accuracy, while single-precision floating-point numbers (32-bit) are used for common multimedia and graphics processing tasks. Half-precision floating-point numbers (16-bit) are used for applications like machine learning, where lower precision is acceptable.
1.664 TFLOPS
FP32 (float)
?
An important metric for measuring GPU performance is floating-point computing capability. Single-precision floating-point numbers (32-bit) are used for common multimedia and graphics processing tasks, while double-precision floating-point numbers (64-bit) are required for scientific computing that demands a wide numeric range and high accuracy. Half-precision floating-point numbers (16-bit) are used for applications like machine learning, where lower precision is acceptable.
105.525 TFLOPS

Miscellaneous

SM Count
?
Multiple Streaming Processors (SPs), along with other resources, form a Streaming Multiprocessor (SM), which is also referred to as a GPU's major core. These additional resources include components such as warp schedulers, registers, and shared memory. The SM can be considered the heart of the GPU, similar to a CPU core, with registers and shared memory being scarce resources within the SM.
160
Shading Units
?
The most fundamental processing unit is the Streaming Processor (SP), where specific instructions and tasks are executed. GPUs perform parallel computing, which means multiple SPs work simultaneously to process tasks.
20480
L1 Cache
256 KB (per SM)
L2 Cache
50 MB
TDP
1400W
OpenCL Version
3.0
CUDA
10.3
ROPs
?
The Raster Operations Pipeline (ROPs) is primarily responsible for handling lighting and reflection calculations in games, as well as managing effects like anti-aliasing (AA), high resolution, smoke, and fire. The more demanding the anti-aliasing and lighting effects in a game, the higher the performance requirements for the ROPs; otherwise, it may result in a sharp drop in frame rate.
24
Suggested PSU
1800 W

Benchmarks

FP32 (float)
Score
105.525 TFLOPS

Compared to Other GPU

FP32 (float) / TFLOPS
166.668 +57.9%
106.896 +1.3%
105.525
80.086 -24.1%
66.228 -37.2%