StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles) (app.uniclaw.ai)

88 points by skysniper 3 hours ago

mgw 15 minutes ago

Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.

Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output

MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.

WhitneyLand 3 hours ago

StepFun is an interesting model.

If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179

tarruda 2 hours ago

Since that discussion, they released the base model and a midtrain checkpoint:

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...

I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.

They also released the entire training pipeline:

- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...

- https://github.com/stepfun-ai/SteptronOss

skysniper 2 hours ago

thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks

grimm8080 an hour ago

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

skysniper 42 minutes ago

what kind of tasks did you try?

hadlock 3 hours ago

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

NitpickLawyer 2 hours ago

> the most popular model

It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.

MaxikCZ 2 hours ago

Exactly. When I read the headline I thought: "Ofc it is, its free."

skysniper an hour ago

skysniper 3 hours ago

the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.

smallerize 3 hours ago

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

dmazin 2 hours ago

why do half the comments here read like ai trying to boost some sort of scam?

sunaookami 37 minutes ago

Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.

skysniper 19 minutes ago

it's actually pretty good at openclaw type of tasks for non technical users: lots of tool calls, some simple programing

skysniper 2 hours ago

another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.

skysniper 3 hours ago

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.

The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.

Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.

Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

vessenes 30 minutes ago

Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.

skysniper 2 minutes ago

TBH that was my initial thought too, but I found some problem using this approach:

Essentially I'm using the relative rank in each battle to fit a latent strength for each model, and then use a nonlinear function to map the latent strength to Elo just for human readability. The map function is actually arbitrary as long as it's a monotonically increasing function so it preserves the rank. The only reliable result (that is invariant to the choice of the function) is the relative rank of models.

That being said, if I use score/cost as metrics, the rank completely depends on the function I choose, like I can choose a more super-linear function to make high performance model rank higher in score/cost board, or use a more sub-linear function to make low performance model rank higher.

That's why I eventually tried another (the current) approach: let judge give relative rank of models just by looking at cost-effectiveness (consider both performance and cost), and compute the cost-effectiveness leaderboard directly, so the score mapping function does not affect the leaderboard at all.

johndough 2 hours ago

Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.

skysniper an hour ago

both are shown in battle detail page already. Time is shown in Scores table. Number of tokens are shown in Cost details at the bottom of the Scores. (I thought most people just want to see cost in USD so I put token details at the bottom)

hadlock 3 minutes ago

refulgentis 3 hours ago

Please don’t use AI to write comments, it cuts against HN guidelines.

skysniper 3 hours ago

sorry didn't know that. Here is my hand writing tldr:

gemini is very unreliable at using skills, often just read skills and decide to do nothing.

stepfun leads cost-effectiveness leaderboard.

ranking really depends on tasks, better try your own task.

refulgentis 3 hours ago

citizenpaul an hour ago

>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance

This has also been my subjective experience But has also been objective in terms of cost.