Improving mathematical reasoning with process supervision

We’ve educated a mannequin to realize a brand new state-of-the-art in mathematical downside fixing by rewarding every right step of reasoning (“process supervision”) as an alternative of merely rewarding the proper remaining reply (“consequence supervision”). In addition to boosting efficiency relative to consequence supervision, process supervision additionally has an vital alignment profit: it immediately trains the mannequin to provide a chain-of-thought that’s endorsed by people.