2025
Conference article  Open Access

Prompt-based bias control in large language models: a mechanistic analysis

Cassese M., Puccetti G., Esuli A.

Large language models; Mechanistic interpretability; Cultural bias 

This study investigates the role of prompt design in controlling stereotyped content generation in large language models (LLMs). Specifically, we examine how adding a fairness-oriented request in the prompt instructions influences both the output and internal states of LLMs. Using the StereoSet dataset, we evaluate models from different families (Llama, Gemma, OLMo) with base and fairness-focused prompts. Human evaluations reveal that models exhibit medium levels of stereotyped output by default, with a varying impact of fairness prompts on reducing it. We applied for the first time a mechanistic interpretability technique (Logit Lens) to the task, showing the depth of the impact of the fairness prompts in the stack of transformer layers, and finding that even with the fairness prompt, stereotypical words remain more probable than anti-stereotypical ones across most layers. While fairness prompts reduce stereotypical probabilities, they are insufficient to reverse the overall trend. This study is an initial dig into the analysis of the presence and propagation of stereotype bias in LLMs, and the findings highlight the challenges of mitigating bias through prompt engineering, suggesting the need for broader interventions on models.

Source: CEUR WORKSHOP PROCEEDINGS, vol. 4074, pp. 324-337. Pisa, Italy, 9-10 june 2025

Publisher: CEUR-WS.org



Back to previous page
BibTeX entry
@inproceedings{oai:iris.cnr.it:20.500.14243/560910,
	title = {Prompt-based bias control in large language models: a mechanistic analysis},
	author = {Cassese M. and Puccetti G. and Esuli A.},
	publisher = {CEUR-WS.org},
	booktitle = {CEUR WORKSHOP PROCEEDINGS, vol. 4074, pp. 324-337. Pisa, Italy, 9-10 june 2025},
	year = {2025}
}

ITSERR Italian Strengthening of the ESFRI RI RESILIENCE
ITSERR Italian Strengthening of the ESFRI RI RESILIENCE