AI engineer world fair 2025

Introduction

The AI engineer world fair in San Francisco pulled together around 3,000 engineers, founders, and AI leads. I was excited to attend my first conference in a while in pure student mode - no presentations, no workshops to lead, no selling, just learning. The vibe was a decent mix of hype and real engineering challenges, with the world’s leading labs, startups, and chipmakers (NVIDIA, AMD, Cerebras, Groq) all present. This page is a synthesis of my raw notes and the key themes that emerged.

Opening the black box: optimizing, aligning, rewarding

A central theme was the move away from treating models as immutable black boxes. The most interesting research and engineering, in my opinion, is happening at the intersection of algorithms, hardware, and data. Duh.

Quantization & hardware co-design

The conversation has shifted from hardware-agnostic quantization to deeply integrated, hardware-aware strategies. The biggest gains in inference are no longer just about the model and precision, but how the model’s math is mapped onto silicon.

What’s New:
- HALO: A post-training, hardware-aware quantization method that considers circuit-level delays to boost throughput by ~2.7× and cut energy by ~51% with minimal accuracy loss.
- Dual Precision Quantization: Storing weights at 4-bit while performing computations at 8-bit float (W4A8), striking a balance between efficiency and accuracy for modern accelerators.
- LightMamba: A co-design of quantization and FPGA architecture to accelerate State Space Models (SSMs) like Mamba, achieving 4–6× better energy efficiency than GPUs.
- On-Device Qwen2.5: Integrates activation-weight quantization (AWQ) directly into FPGA pipelines, getting 55% compression and doubling token throughput.
My Reflection: Quantization must be aware of circuit-level constraints to unlock performance across diverse hardware. This is where systems thinking provides a massive edge.

Key Question: How can advanced quantization techniques be co-designed with novel AI hardware architectures to maximize inference efficiency? The Answer: Through joint algorithmic/hardware optimization that targets critical path delays, memory bandwidth, and error characteristics, as demonstrated by HALO and LightMamba.

Fine-tuning maze: SFT, DPO, GRPO, …

Fine-tuning remains more of an empirical art than a science. The practitioners on stage were honest: there is no universal recipe or complete understanding of how these methods do what they do.

What’s New:
- The community is closing the gap on closed-source labs, but the giants still lead with step-change advances, largely through a combination of proprietary data, refinement techniques (RFT/DPO), and massive scale.
- A clever technique for fine-tuning function-calling emerged: create a synthetic dataset by (D1) inverting function calls to generate likely user commands and (D2) using a larger model to distill commands into function calls. Then, SFT a smaller model on this combined dataset. Hundreds of examples are surprisingly effective.
- A philosophical debate that maybe will become mathematical (pls share with me if you see this!): does post-training (especially RL) add new circuits and knowledge, or does it simply re-weight existing pathways in the model? The jury is still out.
What’s Working:
- RFT boosts pass@1 accuracy, but often at the cost of pass@k. It appears to fit the model to the “best” trace for a given problem, narrowing its creative range.
- LoRA/PEFT are indispensable, enabling targeted fine-tuning without requiring full-model updates. This protects base knowledge and dramatically reduces resource consumption.
Challenges:
- The ordering of SFT, DPO, and RFT is highly empirical. No one can reliably tell you which sequence to use for a given model or task. The common refrain was “run the experiment.”
- Preventing SFT-acquired knowledge from being “forgotten” or corrupted during RFT is a major challenge. Regularization via KL divergence and trust region optimization (PPO) helps but doesn’t fully solve the problem.

Inference hardware

The hardware race is pivoting from raw training power to inference efficiency. The focus on “inference” is creating openings for new players and architectures.

What’s New:
- AMD is aggressively acquiring specialized teams and targeting edge inference on consumer devices.
- Specialized inference accelerators from companies like Cerebras and Groq are demonstrating massive speed and energy advantages for inference tasks.
Barriers:
- NVIDIA’s software moat (CUDA, Triton) remains a significant advantage.
- Standardization of low-precision formats is needed. NVIDIA’s Blackwell architecture is pushing new MXFP formats, further fragmenting the landscape. There is a natural limit on these kind of quantization advances.

Reasoning & agent learning

Building agents that can reason robustly is less about finding a single, god-like model and more about designing intelligent systems.

From reward functions to self-play

How do we teach a model what “good” looks like? The design of reward functions is a critical, and still unsolved, part of the puzzle. Agents complicate this further. Since an agent is a system that orchestrates many API calls and tools, you can’t backpropagate errors end-to-end. So how does it learn?

The answer is to treat the entire agentic workflow as a reinforcement learning problem, but with a more sophisticated reward structure.

The credit assignment problem: If an agent completes a 10-step task and fails, which step was at fault? A single reward at the end is an inefficient learning signal. How to design a system with both a final reward (did the overall task succeed?) and instrumental, intermediate rewards that assess the quality of each decision along the way?
Modular pptimization: This framing allows each node in the agent’s graph - whether it’s a model call, a tool use, or a routing decision - to be treated as an independently optimizable policy. A model in the middle of a chain can be rewarded for making a good decision, even if a downstream step leads to a final failure.
Applying modern RLHF to agents: With this modular view, each node becomes a target for policy optimization. You can theoretically apply methods like GRPO or DPO to individual components, refining the agent’s “sub-policies” based on their specific contribution to the overall goal. This bridges the gap between single-model RLHF and complex, multi-step agent learning.
Key Ideas:
- Reward should be evaluated over an entire chain of thought, not just per-token.
- For math and code, distance-based rewards (how close was the answer?) or execution-based rewards (did the code run?) can accelerate learning far more than simple binary scoring.
- A powerful pattern is hybrid expert routing: use the LLM for complex reasoning, but route trivial tasks (like 2+2) to tiny, efficient evaluators. Why waste a 70B parameter model on arithmetic?
- The use of advantage functions (PPO, GRPO) is active research, particularly whether to normalize rewards per-token or per-turn.
Open Question: How does using an LLM-as-a-judge inside a reward function relate to AlphaGo-style self-play? The early consensus is that they are deeply related, especially when the judge evaluates an entire reasoning trace, creating a powerful reinforcement loop.

Orchestration over monoliths

The most effective agentic systems are not single, monolithic models. Instead, they are hierarchies of specialized models orchestrated to solve complex tasks. This modular approach, where different agents handle different parts of a problem, consistently outperforms a single, general-purpose LLM.

Evals, evals, evals: from vibe-checks to continuous verification

If there was one mantra at the conference, it was “evals, evals, evals.” Gut-feeling “vibe checks” are out; rigorous, continuous, and automated evaluation is in. The most forward-thinking approaches treat evaluation not as a final step, but as an integral part of the reasoning process itself.

Evaluation moves inside the reasoning loop. The frontier is about building agents that can self-correct in real-time. Instead of generating a final answer and waiting for an external grade, the model engages in a continuous feedback cycle. This is happening in a few ways:
- Self-critique & refinement: An agent generates a plan or a piece of code. Then, a second call—acting as an internal critic or “judge”—evaluates the output against a set of rules or principles. The agent then refines its initial output based on this critique. This internal dialogue mimics a human reasoning process and is heavily inspired by work like OpenAI’s “LLM as a Judge” and Anthropic’s Constitutional AI.
- Reinforcement from self-play: In more complex scenarios, this resembles the self-play mechanism from DeepMind’s AlphaGo, where the system learns by playing against itself and rewarding entire trajectories that lead to success.
Long live the arena. The community is finally acknowledging that static benchmarks like MMLU are insufficient. They don’t test for strategic thinking, deception, or collaborative behavior in dynamic environments. The future is interactive and often adversarial evaluation.
- How it works: Instead of asking a model to answer a fixed question, new frameworks pit agents against each other in simulated environments. This stress-tests their decision-making under pressure and in unpredictable situations. Projects like Arena-Hard, which benchmarks LLMs in adversarial multi-agent games, and Google’s BIG-Bench are pioneering this more holistic approach.
Quantified uncertainty. A major challenge for trust and safety is that LLMs are often confidently wrong. The push is to build systems that know what they don’t know.
- How it works: This involves moving beyond single-point estimates and toward probabilistic outputs. Instead of one answer, a model might produce a distribution of possible answers or an explicit uncertainty score. This is being explored through several avenues: Bayesian methods, running ensembles of models to see if they agree, and hybrid neuro-symbolic architectures that combine the pattern-matching of neural networks with symbolic reasoning. Formalizing uncertainty in language models is a mission-critical step in many valuable domains.

My top-level questions, revisited

I started with a list of questions. After the conference, here’s where I’ve landed:

Quantization + hardware co-design → efficient inference?
- Yes. HALO, LightMamba, and others are definitive proof. The next step is baking quantization-awareness directly into compilers and chip pipelines.
Distillation + modular training to shrink models?
- Yes. Synthetic data generation combined with distillation and LoRA can dramatically reduce model size while preserving reasoning. But we need better benchmarks that go beyond pass@1.
How to choose: compression vs. bigger base model?
- Compress when latency/energy is a hard constraint (e.g., edge devices). Use a larger base when multi-task flexibility is paramount. Often, a smaller model with hardware-aware quantization provides a better ROI.
New architectures for better capability/cost scaling?
- Sparse networks are still in early stages. Modular SSMs like Mamba, when paired with FPGAs (LightMamba), are showing near-linear scaling with significant compute savings.
How to scale alignment as models get more creative?
- Through a layered approach: combine RLHF, DPO, LLM-as-a-judge evaluations, and uncertainty calibration frameworks to create robust, multi-faceted alignment.

Final thoughts

The AI Engineer World’s Fair crystallized a few key ideas for me:

Co-design is the future. The boundary between hardware, software, and model architecture is dissolving. Maybe some composability layer will emerge; doubt economic incentives allow that.
Systematic pipelines win. The most successful teams will be those who master the full stack: from data synthesis and multi-stage fine-tuning (SFT → DPO → RFT) to modular routing and co-designed deployment.
Reward engineering is a frontier. Defining “good” for a model is one of the most leveraged activities in AI right now.

The era of treating LLMs as magical black boxes is over. We’re now in the systems engineering era of AI.