gonzalez-agirre valleruizf commited on
Commit
6330c67
·
verified ·
1 Parent(s): 9ba463d

Update README.md (#6)

Browse files

- Update README.md (eb2e6b572e40e505e59293c13fe28aeba7e160b0)


Co-authored-by: Valle <[email protected]>

Files changed (1) hide show
  1. README.md +38 -8
README.md CHANGED
@@ -926,23 +926,53 @@ It is important to note that strong performance in the "needle in a haystack" te
926
 
927
  ## Ethical Considerations and Limitations
928
 
929
- The ALIA-40b-instruct model is an instruction-tuned variant with preliminary alignment. It has several limitations that users should be aware of. Ongoing work is addressing these areas, including comprehensive evaluation of societal and cognitive biases as well as safety, which will be reported in future updates.
930
 
931
- Functional Limitations:
932
 
933
  - No Function Calling: The model cannot natively execute or call external functions/APIs. Tasks requiring plugin calls or tool execution must be implemented outside the model.
934
  - Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
935
  - Code Generation: Although exposed to code during pretraining, ALIA-40b-Instruct is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
936
  - Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.
937
 
938
- Bias and Harm:
939
- The model may reflect biases present in its training data and may produce stereotyped, offensive, or otherwise harmful content, particularly regarding gender, ethnicity, religion, or other protected attributes. Work is ongoing to evaluate and mitigate societal and cognitive biases, and future releases will provide detailed reports on these analyses.
940
 
941
- Safety and Alignment:
942
- The current alignment is preliminary and does not guarantee robust safety in all scenarios. The model may still follow malicious instructions or generate disallowed content if prompted. Additional filtering, human oversight, and alignment steps are essential. We are actively working to assess and improve the model’s safety, and a comprehensive report will be provided in subsequent updates.
943
 
944
- Recommendations:
945
- Developers should implement additional safety filters, human oversight, targeted evaluation suites, and secondary evaluation models when deploying this model. Do not deploy ALIA-40b-Instruct in critical applications without extensive testing and mitigation. Users are responsible for assessing and mitigating harmful behavior or misinformation resulting from model outputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
946
 
947
  ---
948
 
 
926
 
927
  ## Ethical Considerations and Limitations
928
 
929
+ The ALIA-40b-instruct model is an instruction-tuned variant with preliminary alignment. It has several limitations that users should be aware of. Ongoing work is addressing these areas, including comprehensive evaluation of societal and cognitive biases as well as safety.
930
 
931
+ ### Functional Limitations:
932
 
933
  - No Function Calling: The model cannot natively execute or call external functions/APIs. Tasks requiring plugin calls or tool execution must be implemented outside the model.
934
  - Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
935
  - Code Generation: Although exposed to code during pretraining, ALIA-40b-Instruct is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
936
  - Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.
937
 
938
+ ### Bias and Harm:
 
939
 
940
+ Following [Mina et al. (2025)](https://aclanthology.org/2025.coling-main.120/), we examine the models’ robustness against cognitive biases focusing on positional effects and majority class bias. On the one hand, we measure majority class bias with a 4-shot binary classification experiment using the [SST-2](https://huggingface.co/datasets/BSC-LT/cobie_sst2) dataset (Socher et al., 2013). As detailed in the following table, we observe significant effects with a moderate effect size.
 
941
 
942
+ | **Bias** | **Task** | **Cramér’s V coefficient** |
943
+ | --- | --- | --- |
944
+ | **Majority Class** | **SST-2** | 0.39 |
945
+
946
+ On the other hand, for positional effects, we evaluate primacy and recency biases in 0-shot settings leveraging the [ARC](https://huggingface.co/datasets/BSC-LT/cobie_ai2_arc) dataset (Clark et al., 2018). We detect significant, but relatively weak positional effects. This suggests that the model is relatively robust against the examined cognitive biases.
947
+
948
+ | **Bias** | **Task** | **φ coefficient** |
949
+ | --- | --- | --- |
950
+ | **Primacy** | **ARC-Easy** | 0.10 |
951
+ | | **ARC-Challenge** | 0.11 |
952
+ | **Recency** | **ARC-Easy** | 0.12 |
953
+ | | **ARC-Challenge** | 0.17 |
954
+
955
+ In addition, we examine the presence of undesired social biases by measuring the performance and bias scores on the [BBQ](https://huggingface.co/datasets/heegyu/bbq) dataset (Parrish et al., 2022) as well as on their adaptations to the Spanish and Catalan contexts ([EsBBQ](https://huggingface.co/datasets/BSC-LT/EsBBQ) and [CaBBQ](https://huggingface.co/datasets/BSC-LT/CaBBQ), Ruiz-Fernández et al., 2025). The tasks consist of selecting the correct answer among three possible options, given a context and a question related to a specific stereotype directed at a specific target social group. We measure the model’s accuracy on the QA task as well as the bias score, which quantifies the degree to which the model systematically relies on social biases to answer the questions. Note that the bias scores are calculated using the metric originally defined for each respective benchmark.
956
+
957
+ Performance is high in disambiguated settings —where the correct answer to the question can be easily gleaned from the context. However, the model tends to fail to choose the correct answer in ambiguous settings —where the correct answer is not provided. Note that the range for the bias score is between -1 and 1; however, all bias scores are positive, which indicates a strong reliance and alignment with social biases to solve the task. This reveals that the model may reflect biases present in its training data and may produce stereotyped, offensive, or harmful content, particularly regarding gender, ethnicity, nationality, and other protected attributes.
958
+
959
+ | **Task** | **Accuracy (Ambiguous)** | **Bias Score (Ambiguous)** | **Accuracy (Disambiguated)** | **Bias Score (Disambiguated)** |
960
+ | --- | --- | --- | --- | --- |
961
+ | **BBQ** | 0.08 | 0.16 | 0.90 | 0.02 |
962
+ | **EsBBQ** | 0.02 | 0.26 | 0.96 | 0.03 |
963
+ | **CaBBQ** | 0.01 | 0.26 | 0.95 | 0.07 |
964
+
965
+ We highlight that our evaluation of these biases is by no means exhaustive and is limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses.
966
+
967
+ ### Safety and Alignment:
968
+
969
+ The current alignment is preliminary and does not guarantee robust safety in all scenarios. The model may still follow malicious instructions or generate disallowed content if prompted. To evaluate the model’s vulnerabilities, we conduct a Red Teaming assessment using three adversarial prompts datasets: [AYA RT](https://huggingface.co/datasets/CohereLabs/aya_redteaming) (Aakanksha et al., 2024), [HH-RLHF RT](https://huggingface.co/datasets/Anthropic/hh-rlhf) (Ganguli et al., 2022) and [M-ADV-Bench](https://huggingface.co/datasets/simonycl/multilingual_advbench) (Yong et al., 2023), and with [Llama Guard 3](https://huggingface.co/meta-llama/Llama-Guard-3-8B) (Grattafiori et al., 2024) serving as the moderator model. This evaluation is carried out in English, Spanish and Catalan, using NLLB translation when necessary. Results yielded an average attack success rate of 16.4%.
970
+
971
+ Additional filtering, human oversight, and alignment steps are essential. We are actively working to improve and assess the model’s safety, including human annotation and evaluation, as well as the development of multilingual safety datasets. A comprehensive report will be provided in subsequent updates.
972
+
973
+ ### Recommendations:
974
+
975
+ Developers should implement additional safety filters, human oversight, targeted evaluation suites, and secondary evaluation models when deploying this model. Do not deploy ALIA-40b-Instruct in critical applications without extensive testing and mitigation. Users are responsible for assessing and mitigating harmful behavior or misinformation resulting from model outputs, and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.
976
 
977
  ---
978