How we score
Why the main result is the level, how we work with evidence, and what SP Score is for.
Scoring follows one simple rule: real behavior matters more than self-assessment.
If someone says they use AI in an advanced way but cannot describe any repeatable workflow or concrete output, the result reflects that. If someone rates themselves modestly but shows clear examples and finished workflows in the interview, the result can be higher.
#The main result is the level
First we determine the level of work with AI. It says how much AI is part of someone's real work:
- experimenting,
- regular use,
- building personal workflows,
- changing a working system.
The level is not a simple sum of points. It is an evaluation of evidence: what the person actually does, how often, with what impact, and what outputs are created.
#SP Score is a supporting number
SP Score does not say "how good someone is". It shows their position inside the level.
That matters mainly in repeated measurement. A person can stay on the same level but move significantly inside it. For example, from early L2 to strong L2 close to L3. The level name alone would not capture that movement.
#What we look at during evaluation
When scoring, we look at several things at once:
| Area | What we care about |
|---|---|
| Frequency | Whether AI is an occasional attempt or a normal part of work |
| Depth of work | Whether someone accepts the first answer or iterates and improves the output |
| Tools | Whether they use the right mode for the type of work |
| Outputs | Whether concrete documents, templates, tools, or decisions are created |
| Impact | Whether AI changes one task or an entire repeated workflow |
No single item is enough on its own. For example, using an advanced tool does not automatically mean a higher level. What matters is what the person does with it.
#Why two people at the same level can have different numbers
Two people can both be L2 Builder because both build repeatable workflows with AI. One can still have an SP Score of 60 and the other 75, because the evidence is not equally strong.
The numbers are not a standalone grade. They show how developed and convincing the same level looks in practice.
An L2 Builder with a lower score typically has first working templates, an assistant, or a workflow. They use it mainly for themselves, the impact is described more qualitatively, and the workflow is still becoming stable.
An L2 Builder with a higher score shows the same level in a more developed way. The workflow is used repeatedly, it produces concrete artifacts, the person can explain why it works, they choose more suitable tools, and there is visible impact on time, quality, or other people's work.
That does not mean the first person is "worse" or that the second person received points for using more tools. It means the second person is closer to the L3 boundary inside L2. That is why an individual report reads the number together with the level, profile, and evidence.
#What can move the number up or down
The score starts from the determined level. Then adjustments move it within that level based on how convincing the evidence is.
| What we see in the evidence | How it affects the score reading |
|---|---|
| A clear repeatable workflow | The number tends to move up, because it is more than a one-off attempt |
| Concrete artifacts | The number tends to move up when they have real impact |
| More suitable tools | The number moves up only when the tools change the quality of work |
| Weak or generic evidence | The number stays lower, even if someone says they use AI often |
| Ad hoc use without a stable output | The number can stay lower even with high frequency |
Adjustments are not reward points. Their main job is to prevent the same number from meaning completely different realities for two people.
#Why evidence is not counted twice
If building personal repeatable workflows is what places someone at L2, we do not add the same fact again as a bonus to the score. Otherwise the number would exaggerate.
The score inside a level should show what is above the normal representative of that level: output quality, breadth of use, stability of habits, or impact on other people.
#What re-assessment is for
Repeated measurement is often more useful than the first result. It shows what changed after a program, coaching, or several months of personal practice.
Example:
First measurement: L2 Builder, early L2
Repeat measurement: L2 Builder, strong L2 close to L3The level is the same, but the progress is meaningful. This is where SP Score is most useful.
#What scoring does not do
- It does not assess personality or job performance.
- It does not reduce a person to one number.
- It does not determine the level backwards from the score alone.
- It does not punish beginners. AI Explorer is a legitimate starting phase.
If the number and concrete behavior point in different directions, behavior wins. The score is a tool, not the main truth.