Qualitative research — in-depth interviews, focus groups, ethnography, diary studies — produces rich material that is expensive to process and difficult to correlate with quantitative data (questionnaires, administrative records, panels). The classic path runs through manual coding in NVivo / Atlas.ti / MAXQDA, aggregation into indicators, and only then correlation with numbers. The bottleneck is time and inter-coder variance — Krippendorff’s α often barely exceeds 0.7.
LLMs change this equation on two sides at once: coding becomes 100× cheaper and 1000× faster, but a new problem appears — the model as “coder” has to be validated, because it carries its own theoretical and cultural biases inherited from the training corpus.
The simplest application: codebook → prompt. A classic coding manual (categories, definitions, positive and negative examples) becomes the system prompt. Each transcript fragment comes back from the model as structured JSON:
{
"fragment_id": "P07-min12-15",
"categories": ["emotional_support", "social_isolation"],
"intensity": 0.7,
"confidence": 0.85,
"evidence_quote": "..."
}
Three techniques boost reliability:
Categorical classification is just the start. The stronger layer: translating free-form text into quantitative scales. Example: a patient speaks 200 words about their wellbeing in an interview. The LLM receives a prompt with a scale definition (PHQ-9, BDI-II) and returns a predicted score per item — with justification grounded in specific quotes. The result: a respondent × scale item matrix that can be compared 1:1 with the questionnaire filled in by the same respondent.
This opens three paths:
Level A — Same respondent (within-subject). Interview and questionnaire from the same participant. The LLM transforms the interview onto the same scale. Spearman’s ρ between the two versions answers whether the person says the same thing they declare. Divergence reveals social desirability bias or semantic problems in the scale.
Level B — Population aggregation. LLM categories joined with ESS, EVS, or national social-diagnosis data. Question: does the frequency of “economic anxiety” in interviews correlate with regional unemployment? Classic triangulation, except the qualitative side is now scalably coded.
Level C — Cross-prediction. A model that predicts the qualitative outcome from quantitative data (and vice versa). Prediction error = a measure of the “independent information” carried by qualitative data — what the survey did not capture.
Interview audio
↓ Whisper (local)
Transcript
↓ pseudonymisation (PII → tokens)
Pseudonymised text
↓ LLM 1: category classifier
↓ LLM 2: scale extractor
JSON {respondent_id, categories[], scale_predictions{}}
↓ validation: 10% manual coding
Data matrix (CSV / parquet)
↓ R / Python / Stata
Correlations, regressions, SEM
temperature = 0 plus multiple runs with explicit agreement reporting.mistral-nemo class) is mandatory; IRB approval covering AI processing too.Conclusion. The LLM does not replace the qualitative researcher — it moves the bottleneck from coding to validation. Mixed-methods research becomes scalable as long as the researcher treats the LLM as seriously as any other measurement instrument: with control-sample validation, explicit variance documentation, and model-bias reporting.