Methodology

Test Design

ABTI presents 16 scenario-based questions across 4 behavioral dimensions. Each question describes a realistic AI agent scenario with two response options — each reflecting opposite poles of a dimension.

Forced-choice format — the agent must pick one of two approaches
4 questions per dimension — each dimension is measured 4 times for stability
Scenario-based — questions use concrete situations, not abstract preferences

The Four Dimensions

🎯

Autonomy (P/R)

Proactive vs Responsive

Does the agent take initiative and anticipate needs, or wait for explicit instructions?

⚙️

Precision (T/E)

Thorough vs Efficient

Does the agent prioritize completeness and detail, or speed and conciseness?

💬

Transparency (C/D)

Candid vs Diplomatic

Does the agent give direct, unfiltered feedback, or soften its communication?

🔄

Adaptability (F/N)

Flexible vs Principled

Does the agent bend rules for context, or follow them strictly?

Scoring Method

Each question is a forced-choice between two options (A or B), each reflecting opposite poles of a dimension. Option A scores 1 (first pole), option B scores 0 (second pole). The sum per dimension (0–4) determines the type letter:

Score ≥ 2 → first pole (P, T, C, F)
Score < 2 → second pole (R, E, D, N)

4 dimensions × 2 poles = 16 possible types (PTCF, PTCN, PTDF, … REDN).

Test-Retest Reliability

94.9%

37 out of 39 models produced the same type across all 3 test runs — strong reliability for a behavioral assessment.

We tested 39 models 3 times each under identical conditions. Only 2 models showed any inconsistency:

gemma3-12b — varied on a single dimension
tinyllama — varied on a single dimension

Both inconsistent models only deviated on one dimension — the other three dimensions were stable. This suggests the test reliably captures core behavioral patterns.

Consistency map — each square is one model (37 consistent, 2 inconsistent):

Pink = consistent across all runs · Gray = inconsistent on one dimension

Type Distribution

Across 60 tested agents, some types appear far more often than others:

PTCF dominates because most LLMs are trained to be helpful (Proactive), thorough, honest (Candid), and adaptable (Flexible). This reflects training alignment objectives, not test bias.

FAQ

Is this like MBTI?

Similar framework — 4 binary dimensions producing 16 types — but ABTI measures AI operational behavior, not human psychology. The dimensions (Autonomy, Precision, Transparency, Adaptability) are designed for how agents act, not how people feel.

Why do so many models get PTCF?

Training alignment objectives favor helpfulness (→ Proactive), thoroughness, honesty (→ Candid), and adaptability (→ Flexible). PTCF is the behavioral archetype that RLHF and instruction tuning naturally produce.

How is reliability measured?

The same model is tested 3+ times under identical conditions. If it produces the same 4-letter type every time, it is considered consistent. 94.9% of tested models (37/39) passed this check.

Can I test my own agent?

Yes! Use the CLI: npx @kagura-agent/abti test, or integrate via the REST API. You can also take the interactive test in the browser.

How ABTI Works