BSC-LT
/

ALIA-40b-instruct

@@ -943,22 +943,7 @@ The ALIA-40b-instruct model is an instruction-tuned variant with preliminary ali
 ### Bias and Harm:
-Following [Mina et al. (2025)](https://aclanthology.org/2025.coling-main.120/), we examine the models’ robustness against cognitive biases focusing on positional effects and majority class bias. On the one hand, we measure majority class bias with a 4-shot binary classification experiment using the [SST-2](https://huggingface.co/datasets/BSC-LT/cobie_sst2) dataset (Socher et al., 2013). As detailed in the following table, we observe significant effects with a moderate effect size.
-| **Bias** | **Task** | **Cramér’s V coefficient** |
-| --- | --- | --- |
-| **Majority Class** | **SST-2** | 0.39 |
-On the other hand, for positional effects, we evaluate primacy and recency biases in 0-shot settings leveraging the [ARC](https://huggingface.co/datasets/BSC-LT/cobie_ai2_arc) dataset (Clark et al., 2018). We detect significant, but relatively weak positional effects. This suggests that the model is relatively robust against the examined cognitive biases.
-| **Bias** | **Task** | **φ coefficient** |
-| --- | --- | --- |
-| **Primacy** | **ARC-Easy** | 0.10 |
-|  | **ARC-Challenge** | 0.11 |
-| **Recency** | **ARC-Easy** | 0.12 |
-|  | **ARC-Challenge** | 0.17 |
-In addition, we examine the presence of undesired social biases by measuring the performance and bias scores on the [BBQ](https://huggingface.co/datasets/heegyu/bbq) dataset (Parrish et al., 2022) as well as on their adaptations to the Spanish and Catalan contexts ([EsBBQ](https://huggingface.co/datasets/BSC-LT/EsBBQ) and [CaBBQ](https://huggingface.co/datasets/BSC-LT/CaBBQ), Ruiz-Fernández et al., 2025). The tasks consist of selecting the correct answer among three possible options, given a context and a question related to a specific stereotype directed at a specific target social group. We measure the model’s accuracy on the QA task as well as the bias score, which quantifies the degree to which the model systematically relies on social biases to answer the questions. Note that the bias scores are calculated using the metric originally defined for each respective benchmark.
 Performance is high in disambiguated settings —where the correct answer to the question can be easily gleaned from the context. However, the model tends to fail to choose the correct answer in ambiguous settings —where the correct answer is not provided. Note that the range for the bias score is between -1 and 1; however, all bias scores are positive, which indicates a strong reliance and alignment with social biases to solve the task. This reveals that the model may reflect biases present in its training data and may produce stereotyped, offensive, or harmful content, particularly regarding gender, ethnicity, nationality, and other protected attributes.

 ### Bias and Harm:
+We examine the presence of undesired social biases by measuring the performance and bias scores on the [BBQ](https://huggingface.co/datasets/heegyu/bbq) dataset (Parrish et al., 2022) as well as on their adaptations to the Spanish and Catalan contexts ([EsBBQ](https://huggingface.co/datasets/BSC-LT/EsBBQ) and [CaBBQ](https://huggingface.co/datasets/BSC-LT/CaBBQ), Ruiz-Fernández et al., 2025). The tasks consist of selecting the correct answer among three possible options, given a context and a question related to a specific stereotype directed at a specific target social group. We measure the model’s accuracy on the QA task as well as the bias score, which quantifies the degree to which the model systematically relies on social biases to answer the questions. Note that the bias scores are calculated using the metric originally defined for each respective benchmark.
 Performance is high in disambiguated settings —where the correct answer to the question can be easily gleaned from the context. However, the model tends to fail to choose the correct answer in ambiguous settings —where the correct answer is not provided. Note that the range for the bias score is between -1 and 1; however, all bias scores are positive, which indicates a strong reliance and alignment with social biases to solve the task. This reveals that the model may reflect biases present in its training data and may produce stereotyped, offensive, or harmful content, particularly regarding gender, ethnicity, nationality, and other protected attributes.