TAIDE-LX-7B-GGUF
- This is quantized version of taide/TAIDE-LX-7B created using llama.cpp
Model Description
- The TAIDE project aims to develop a generative AI dialogue engine model that is tailored to the linguistic and cultural characteristics of Taiwan, while also establishing a trustworthy AI environment. By combining academic, industrial, and research resources, the project seeks to advance the development of trustworthy generative AI, enhancing Taiwan's international competitiveness, promoting industrial development, and reducing dependence on foreign technologies.
- The large language models developed in this project are based on Meta's LLaMA2-7b release, incorporating text and training materials from various fields in Taiwan to enhance the model's ability to respond in Traditional Chinese and perform well in specific tasks. The publicly released models are as follows:
- TAIDE-LX-7B: This model is a continuous pretrained version with Traditional Chinese data based on LLaMA2-7b. It is suitable for scenarios where users intend to further fine-tune the model. Since the pre-trained model has not undergone fine-tuning and preference alignment, it may produce malicious or unsafe outputs. Please use with caution.
- TAIDE-LX-7B-Chat: This model enhances office-related tasks and multi-turn question-and-answer dialogue capabilities through instruction tuning based on TAIDE-LX-7B. It is suitable for scenarios such as chat conversations or task assistance. TAIDE-LX-7B-Chat also provides a 4-bit quantization model. The quantization model is primarily offered for user convenience but may affect performance and introduce unforeseen issues. Users are advised to understand and take note of this.
Model Parameters
- Parameters: 7B
- Max context length: 4K
- Training token in Traditional Chinese: 41.44B
- Training time: 1531.82 H100 GPU Hours
Features
- Expanding the character and word count by 24,720 to enhance the model's ability to process Traditional Chinese.
- Strictly vetting training data for the model to improve its trustworthiness and applicability.
- Enhancing the model's performance for tasks such as summarization, writing articles, writing letters, and translating between Chinese and English, which are commonly used in office settings.
- Strengthening the model's understanding of Taiwan-specific culture, terminology, and context.
- Equip the model with multi-turn conversation capabilities.
Examples
| Task | User input | Model output |
|---|---|---|
| Writing articles | ่ซไปฅไปฅไธๅ งๅฎน็บๅบ็ค๏ผๅฏซไธ็ฏๆ็ซ ๏ผๆฐๅฏซไธ็ฏไฝๆ๏ผ้ก็ฎ็บใไธๅผต่็ ง็ใ๏ผๅ งๅฎน่ฆๆฑ็บ๏ผ้ธๆไธๅผตไปคไฝ ๅฐ่ฑกๆทฑๅป็็ ง็๏ผ่ชชๆไปคไฝ ๅฐ่ฑกๆทฑๅป็ๅๅ ๏ผไธฆๆ่ฟฐ็ ง็ไธญ็ๅฝฑๅๅ่ๅพ็ๆ ไบใ่จ้ๆ้ท็้็จใ่ไปไบบ็ๆ ๆฏใ็ฐๅข่ฎ้ทๅ็พ้บ็ๆฏ่ฒใ | ไปฅไธๆฏๆ นๆๆไปคๅตไฝ็ไธ็ฏไฝๆ๏ผ ใไธๅผต่็ ง็ใ ๅจๆๅกตๅฐๅทฒไน ็ๅฎถๆ็ธ็ฐฟไธญ๏ผๆไธๅผต็นๅฅๅธๅผๆ็่็ ง็๏ผ้ฃๅผต็ ง็่จ้ไบๆๅๅฎถๆๆๅกๅจ้ๅพ็พๅฅฝๆๅ ็ๆญก่๏ผ่ณไปไป่ไปคไบบๅๅณ็ก็ชฎใ้ๅผต็ ง็ๆฏๆๅฅถๅฅถๅจไบๅๅคๅนดๅ่่พฆ็ไธๅ ดๅฎถๆ่ๆไธๆๆไธ็๏ผ็ ง็ไธญ็ไบบ็ฉๅ ๆฌไบๆ็็ถๆฏใ็ฅ็ถๆฏใๅๅใๅงๅงๅ่กจๅฅ็ญ๏ผๅคงๅฎถ่ไธๆดๆบข่ๅนธ็ฆ็็ฌๅฎน๏ผๅฝทๅฝฟๆๅ ้ๆญข่ฌๅฐ็พๅฅฝใ ็ ง็ไธญ็ๅ ดๆฏๅๅๆบซ้ฆจ๏ผๅคงๅฎถๅๅๅจไธๅผต้ทๆนๅฝข็้คๆกไธ๏ผๆกไธๆบๆปฟไบ็พๅณ็้ฃ็ฉๅ้ฃฒๆใๆกไธ็่้คๆฏๅณ็ตฑไธญๅผ็ไฝณ้ค๏ผๆ้ฆๅดๅด็็คๅ จ้ญใๅซฉๆป็้่ๅ้ฎฎ็็่ฌ่๏ผ้ไบ่้คๆฏ็ถๅนดๅฅถๅฅถ่ฆช่ช็น่ชฟ็๏ผๅฅนๆ่็ฒพๆน๏ผ่ฎๆฏๅไบบ้ฝๅๅพๆดฅๆดฅๆๅณใ ๆ่ฎๆๅฐ่ฑกๆทฑๅป็ๆฏ๏ผ็ ง็ไธญ้ๆๅฐไบๆๅนดๅนผๆ็ๆจกๆจฃใๆๅๅจๆกๅญ็ๅฆไธ็ซฏ๏ผๆ่ฃกๆฟ่ไธ็ขๆนฏ๏ผ่ไธๅธถ่ๆปฟ่ถณ็็ฌๅฎนใ้ฃๆ็ๆๆๅๆปฟๅจๆญฒ๏ผๅฐๆผ้ๅ ด็ๅคง็ๅฎถๆ่ๆ้ๆๅฐๅๅๆฐๅฅใๅพ้ๅผต็ ง็ไธญ๏ผๆ่ฝๆๅๅฐ็ถๅนดๅฎถๆๆๅก้็่ฆชๅฏๅๆบซๆ๏ผๅณไฝฟ้ไบ้้บผๅคๅนด๏ผ้ฃไปฝๆ ๆๅปไพ็ถ็ขๅบๅฐๅญๅจ่ใ ้ๅผต็ ง็่ๅพๆ่ๅไบบ็ๆ ไบใๅจ้ฃๅๆไปฃ๏ผๆ็็ฅ็ถๆฏๅ็ถๆฏ้ฝ้ๅจ่ฎไธญๅญธ๏ผไปๅๆฏๅคฉ้ฝๆๆพๅญธๅพๅฐ่พฒ็ฐ่ฃกๅนซๅฟ๏ผ่พๅคๅฐ่ไฝไปฅ็ถญๆไธๅฎถไบบ็็่จใๅ็ฎก็ๆดปๆขไปถไธฆไธๅฏ่ฃ๏ผไฝไปๅไป่ไฟๆ่ๆจ่ง็ๆ ๅบฆๅๅฐ็ๆดป็็ฑๆใๆฏๅนดๅฐไบ็งๅคฉ๏ผๅช่ฆๆถ็ฒๅญฃ็ฏไพ่จ๏ผไปๅๅฐฑๆ่่พฆๅฎถๆ่ๆ๏ผ้่ซๆๆ่ฆชๆๆๅๅไพๅ ฑไบซ่ฑๆถ็ๅๆ ใ ้จ่ๆญฒๆ็ๆต้๏ผๆ้ๆผธ้ทๅคง๏ผ้ๅงๅทฅไฝๅ็ๆดปใๆฏๆฌกๅ้๏ผๆ้ฝๆๆฝ็ฉบๅปๆขๆ่ไบบๅฎถ๏ผ่ฝ่ฝไปๅ็ถๅนด็ๅพไบ๏ผๆๅ้ฃๆฎตๅป้ชจ้ๅฟ็ๆๅ ใ่้ๅผต่็ ง็๏ผไนๅฐไธ็ด้ชไผด่ๆ๏ผๆ็บๆ็่ฒดๅๆถ็ไธ้จๅใๅฎ่ฆ่ญไบๆๅ็ๆ้ทๅ่ฎ้ท๏ผไน่จ้ไธไบ้ฃไบ้ฃไปฅๅฟๆท็็พๅฅฝๆๅปใ ้ๅผต็ ง็ไธๅ ๅฑ็คบไบๅฎถๆ็่ฆชๆ ๅๆบซๆ๏ผ้ๆๅฐๅบ้ฃๅๆไปฃไบบๅๅคๅๆจธๅฏฆ็็ๆดปๆนๅผๅๅฐๅณ็ตฑ็ๅฐ้ใๅฆไป๏ผ้จ่็คพๆ็ๅฟซ้็ผๅฑ๏ผ่จฑๅคๅณ็ตฑ็็พๅพทๅ้ขจไฟๆญฃๅจ้ๆผธๆถๅคฑใๆๅๆ่ฉฒๅฅฝๅฅฝ็ๆ้ๆจฃ็ๅๆถๅๆๅ๏ผๅฐๅฎๅๅณๆฟไธๅป๏ผ่ฎๅพไปฃๅญๅญซไน่ฝๆๅๅฐ้ฃไปฝไพ่ชๆญทๅฒ็ๆบซๆ ๅๅณๆฟใ ๅ ๆญค๏ผๆๆฑบๅฎๅฐ้ๅผต็ ง็ๅฅฝๅฅฝไฟ็ฎก่ตทไพ๏ผไธฆๅธธๅธธๆฟๅบไพ็ใๅฎไธๅ ๆฏไธๅผต่็ ง็๏ผๆดๆฏไธๆฎตๅฏถ่ฒด็ๅๆถๅๅฎถๆ็ฒพ็ฅ็่ฑกๅพตใๆๅธๆๆชไพ่ฝๆๆฉๆ้็พ็ถๅนด็็ๆณ๏ผๅไธๆฌก่ๅจไธ่ตท๏ผๅไบซๅฝผๆญค็็ๆดปๅๆ ไบ๏ผ่ฎ้ฃไปฝ่ฆชๆ ๅๅๆ ็นผ็บๅจๆๅ ไธญ่ๅปถ๏ผๆฐธไธๆถ้ใ |
| Writing letters | ่ซไปฅๅ็็ถ็็่บซไปฝ๏ผ็ตฆๅปฃๅๅ ฌๅธ็ๅตๆ็ธฝ็ฃๅฏซไธๅฐไฟก๏ผๆๅบๅฐๆผๆฐ็ขๅๅปฃๅๅฎฃๅณๆดปๅ็ๅตๆๅปบ่ญฐใ | ๅฐๆฌ็ๅตๆ็ธฝ็ฃ๏ผ ๆๅฏซ้ๅฐไฟกๆฏ็บไบๅๆจๆๅบไธไบ้ๆผๆๅๆฐ็ขๅๅปฃๅๅฎฃๅณๆดปๅ็ๅตๆๅปบ่ญฐใๆๅๅ ฌๅธๅณๅฐๆจๅบไธๆฌพๅ จๆฐ็็ขๅ๏ผ็บไบ็ขบไฟๅฎๅจๅธๅ ดไธๅๅพๆๅ๏ผๆๅๅธๆ้้ไธๅๅผไบบๆณจ็ฎ็ๅปฃๅๅฎฃๅณๆดปๅไพๅธๅผๆถ่ฒป่ ็ๆณจๆใ ๅจ่ๆ ฎๅปฃๅ็ๅตๆๅ็ญ็ฅๆ๏ผๆๅนพๅๅ ็ด ้่ฆ็ดๅ ฅ่้ใ้ฆๅ ๏ผๆๅ่ฆ้ๅฐ็ฎๆจๅฎข็พค้ฒ่ก็ ็ฉถ๏ผไปฅ็ญ่งฃไปๅ็้ๆฑๅๅๅฅฝใๅ ถๆฌก๏ผๆๅ่ฆ็ขบไฟๅปฃๅๅ งๅฎนๅ ทๆๅธๅผๅๅ่ชชๆๅ๏ผ่ฝๅผ่ตท็ฎๆจๅฎข็พค็ๅ ฑ้ณดใๆๅพ๏ผๆๅ่ฆๅฉ็จๅ็จฎๅช้ซๅนณ่บๅๆธไฝ่ก้ทๅทฅๅ ท๏ผไปฅๆดๅคงๅปฃๅ็ๅฝฑ้ฟๅใ ๅบๆผ้ไบๅ ็ด ๏ผๆๆๅบไปฅไธๅนพ้ปๅตๆๅปบ่ญฐ๏ผ 1. ็ขๅๅฎไฝ๏ผๆๅๅฏๅฐ้ๆฌพๆฐ็ขๅๅฎไฝ็บ้ซ็ซฏใ็ฐไฟใๅฅๅบทๅๆๅฐ็ไปฃ่กจ๏ผๅผท่ชฟๅ ถ็จๆ็ๅ่ฝๅ็น่ฒใๅจๅปฃๅไธญ๏ผๆๅๅฏไปฅ้้็ๅ็่ฆ่ฆบๆๆๅ็ฐกๆฝ็่ช่จไพๅณ้้ไบ็น้ปใ 2. ๆ ไบ่ก้ท๏ผๅจๅปฃๅไธญ่ฌ่ฟฐไธๅ่็ขๅๅ่ฝ็ธ้็ๅไบบๆ ไบ๏ผ่ฎๆถ่ฒป่ ่ฝๆดๆทฑๅ ฅๅฐ็ญ่งฃ็ขๅๆๅธถไพ็็ๆดป่ฎๅใไพๅฆ๏ผๆๅๅฏไปฅ่ฌ่ฟฐไธไฝๅฟ็ข็่ทๆฅญๅฉฆๅฅณ๏ผๅฆไฝไฝฟ็จๆๅ็ๆฐ็ขๅๅจๅทฅไฝๅ็ๆดปไธญๅๅพๅนณ่กก็ๆ ไบใ 3. ๅไบบๆๆ๏ผ้่ซไธไฝๅๆญก่ฟ็ๅ ฌ็พไบบ็ฉๆๆ่ฆ้ ่ขๆไปป็ขๅไปฃ่จไบบ๏ผๅฉ็จไปๅ็ๅฝฑ้ฟๅไพๆจๅปฃ็ขๅใ้ไธๅ ๅฏไปฅๅขๅ ็ขๅๆๅ ๅบฆ๏ผ้่ฝ่ฎๆดๅคๆถ่ฒป่ ไฟกไปปๅไฟก่ณด็ขๅใ 4. ่ทจ็ๅไฝ๏ผ่ไธไบ็ธ้็ขๆฅญ๏ผๅฆๆๅฐใ็ฐไฟใๅฅๅบท็ญ๏ผ้ฒ่ก่ทจ็ๅไฝ๏ผๅ ฑๅๆจๅบ้้็ๆ่ฏๅ็ขๅใ้็จฎๅไฝไธๅ ่ฝๅธๅผไธๅ้ ๅ็ๆถ่ฒป่ ๏ผ้่ฝ็บ็ขๅๅธถไพๆดๅค่ฉฑ้กๆงๅๆถ่ๅนๅผใ 5. ๆธไฝ่ก้ท๏ผๅ ๅๅฉ็จ็คพ็พคๅช้ซใๆๅฐๅผๆๅ้ป้ต่ก้ท็ญๆธไฝๅทฅๅ ท๏ผๅจๆดๅๅปฃๅๅฎฃๅณ้็จไธญไฟๆไธ่ดๆง็่จๆฏๅๅฝข่ฑกใๆๅไนๅฏ่ฃฝไฝไธ็ณปๅ็ญ่ฆ้ ปๅ็ทไธๆดปๅ๏ผไปฅๅขๅ ๆถ่ฒป่ ๅ่ๅบฆๅๅ็ๅฟ ่ช ๅบฆใ 6. ๅตๆฐๅฝขๅผ๏ผๅ่ฉฆไฝฟ็จๆฐ็ฉ็ๅปฃๅๅฝขๅผ๏ผๅฆAR๏ผๆดๅขๅฏฆๅข๏ผใVR๏ผ่ๆฌๅฏฆๅข๏ผๅ360ๅบฆๅฝฑ็็ญ๏ผ็ตฆๆถ่ฒป่ ๅธถไพๅ จๆฐ็ๅปฃๅ้ซ้ฉใ้็จฎๅตๆฐๅฝขๅผไธๅ ่ฝๅธๅผๆถ่ฒป่ ็ฎๅ ๏ผ้่ฝๆ้ซ็ขๅๅจๅธๅ ดไธ็่พจ่ญๅบฆใ 7. ๆดปๅๅ่๏ผๅจๅปฃๅๅฎฃๅณๆดปๅไธญๅ ๅ ฅไบๅ็ฐ็ฏ๏ผๅฆ็ขๅ่ฉฆ็จใๆฝ็ๅๅชๆ ๅธ็ญ๏ผ้ผๅตๆถ่ฒป่ ็ฉๆฅตๅ่ไธฆๅไบซไปๅ็้ซ้ฉใ้ๅฐๆๅฉๆผๅฃ็ขๅณๆญ๏ผ้ฒไธๆญฅๆดๅคง็ขๅๅฝฑ้ฟๅใ ็ถๅไปฅไธๅปบ่ญฐ๏ผๆๅธๆๆๅ่ฝ่ฃฝๅฎๅบไธๅฅๅ จ้ข่ๅฏๆๅตๆ็ๅปฃๅๅฎฃๅณๆดปๅ่จๅ๏ผไปฅๆๅคง้ๅบฆๅฐๆๅๆฐ็ขๅๅจๅธๅ ดไธ็็ซถ็ญๅใๅฆๆๆจๆไปปไฝ็ๅๆ้่ฆ้ฒไธๆญฅ่จ่ซ๏ผๆญก่ฟ้จๆ่ๆ่ฏ็ตกใ่ฌ่ฌๆจ็้ๆณจๅๆฏๆด๏ผ ็ฅๅ็ฅบ๏ผ ๆจ็ๅ็็ถ็ ๏ผ่จป๏ผๆญคไฟกไปถๅฏๆ นๆๅฏฆ้ๆ ๆณๅ็ขๅ็น้ป้ฒ่ก่ชฟๆด๏ผ |
| Summarization | ่ซๅฐ้็ฏๆ็ซ ็ฒพ็ฐกๆข็ๅ:ใ็ขๆฅญๅตๆฐๆขไพ็ฌฌ10ๆขไน2ๅ็ฌฌ72ๆขๆขๆไฟฎๆญฃๆกใไฟ็จฑใๅฐ็ๆถ็ๆณใ,้ๅฐๅๅฐ้ซใ้ปๅ่ปใ5G็ญๆ่กๅตๆฐไธๅฑ
ๅ้ไพๆ้้้ตๅฐไฝๅ
ฌๅธ,ๆไพๆ้ซ25%็ๆ็จ
ๆๆตๅชๆ ,ไผๆฅญ้ฉ็จ่ฆไปถๅ
ๅซ็ถๅนดๅบฆ็ ็ผ่ฒป็จใ็ ็ผๅฏๅบฆ้ไธๅฎ่ฆๆจก,ไธๆๆ็จ
็้ไธๅฎๆฏ็ใ ็บๅ ๆ็ถๆฟๅไฝๆจ็ผๅฑ็ต็น(OECD)ๅๅฎถๆไฝ็จ ่ฒ ๅถ่ชฟๆด,ๅ ถไธญๆๆ็จ ็้ๆชป,ๆฐๅ112ๅนด่จ็บ12%,113ๅนดๆๅฐๆ้ซ่ณ15%,ไฝไปๅพๅฏฉ้ ๅ้้ๆไฝ็จ ่ฒ ๅถๅฏฆๆฝๆ ๅฝขใ ็ถๆฟ้จๅฎๅก่กจ็คบ,ๅทฒๅ่ฒกๆฟ้จๅๅ้ฒๅ ฅๆๅพ้ๆฎต,้คไผๆฅญ็ ็ผๅฏๅบฆ่จๅจ6%,็ฎๅๅทฒ็ขบ่ช,ไผๆฅญ่ณผ็ฝฎๅ ้ฒ่ฃฝ็จ็่จญๅๆ่ณ้้ก้100ๅๅ ไปฅไธๅฏๆตๆธใ ่ฒกๆฟ้จๅฎๅก่กจ็คบ,็ ๅ้็จไธญ,้ๅฐๅฐ็ฃ็ขๆฅญ่ๅ ถๅจๅ้้้กไผผ็ๅ ฌๅธ้ฒ่กๆทฑๅ ฅ็ ็ฉถ,ๅจ่จญๅ้จๅ,็ข็ซ้ฉ็จ็ขๅต10ไน2็ๆฅญ่ ๆฏไปฃ่กจๅฐ็ฃ้ๆใๅ้็ใ,ๆๅ ฅ้้กไธ้100ๅๅ ,ๅฏ่ฝไนๆไธไบใ ่ณๆผๅๅ้ๆณจ็็ ็ผ่ฒป็จ้ๆชป,็ถๆฟ้จๅฎๅก่กจ็คบ,ๆญท็ถ่่ฒกๆฟ้จไพๅๅฏๅ่จ่ซ,็ ็ผ่ฒป็จ้ๆชปๆๆ่ฝๅจ60ๅ่ณ70ๅๅ ไน้ใ ่ฒกๆฟ้จๅฎๅกๆๅบ,็ ็ผๆธ้ๅฐ็ฃๆชไพ็ถๆฟๆ้ทๅ่ฝ,้ๆชปไธ่ฝใ้ซไธๅฏๆใ,่ตทๅ้่จญๅฎๅจ100ๅๅ ,ไนๆไปฅๆ่ชฟ้,ๆญฃๆฏ็ผ่ฎไผๆฅญ่ฆบๅพๆ่พฆๆณ้ๅพๅฐ้ๆชปใ้ฒ่้ฉ็จ็ง็จ ๅชๆ ,ๆๆๅๅ็นผ็บๆๅ ฅ็ ็ผ,็ถญๆๅ้ไพๆ้้้ตๅฐไฝใ ็ถๆฟ้จๅฎๅก่กจ็คบ,ๅ ๅป ๅ็ ็ผ่ฒป็จๅนณๅ็บ30ใ40ๅๅ ,ๅ ถไธญ,IC่จญ่จๆฅญ่ ไปๆผ30ๅ่ณ60ๅๅ ็ฏๅ,่ฅๅฐ้ๆชป่จๅจ100ๅๅ ,็ฌฆๅๆขไปถ็ๆฅญ่ ่ผๅฐใๅบๆฟ่ชๅ ไธ่ถณ;ๆญคๅค,่ฅ็ฌฆๅ็ณ่ซ้ๆชป็ๆฅญ่ ๅขๅ ,ๅฐๅฏๆ้ซไผๆฅญๅจๅฐๆ่ณ้้ก,่ฒกๆฟ้จ็จ ๆถไน่ฝๅ ๆญค็ฒๅพๆนๆณจใ IC่จญ่จๆฅญ่ ่ฟๆฅ้ ป้ ป้ๅฐ็ขๅต10ไน2็ผ่ฒ,ๅธๆ้ไฝ้ฉ็จ้ๆชป,ๅ ไธๅๅๅๆไพๆ้่ชไธปๅใๅ ็ขผ่ฃๅฉๅๅฐ้ซ็ขๆฅญ,็ถๆฟ้จๅฎๅก่กจ็คบ,็ถๆฟ้จๅ่ฒกๆฟ้จๅฐฑ็ขๅต10ไน2้ๆๅ ฑ่ญ,็ญๅ่ฎๆดๅคๆฅญ่ ๅๆ ,็ผๅขๅผทไผๆฅญๆ่ณๅ้ๅ้ๅบๅฐ็ฃๆ่กๅฐไฝใ ่ฒกๆฟ้จๅฎๅก่กจ็คบ,็ง็จ ็ๅต็ๅถๅฎๅฟ ้ ใๆ็บๆๅฎใ,ไธฆไปฅ้ๅฐ็ๅต่จญ็ฝฎ็ฎ็็บๆ้ซๅๅ,็พ้ๆฎตๅจๆใๅๅ ง็ใ็ไผๆฅญไปๅฏ้ฉ็จ็ขๅต็ฌฌ10ๆขใ10ไน1็็ง็จ ๅชๆ ,ๅ ฑๅๅฃฏๅคงๅฐ็ฃ็ถๆฟ็ผๅฑใ ็ถๆฟ้จๅ่ฒกๆฟ้จๆญฃๅฐฑ็ ็ผ่ฒป็จ้ๆชปๅๆๅพ็ขบ่ช,ๅพ ไปๆๅ ฉๅคฉ้ ๅๅญๆณไนๅพ,็ดๆ30ๅคฉๆ้,ๅฏ่ๆฅญ็้ฒไธๆญฅ่จ่ซๅ่ชฟๆด,็ผ็ขๅต10ไน2่ฝๅจ6ๆไธ่ทฏใ |
ๅฐ็ฃใ็ขๆฅญๅตๆฐๆขไพใ็ไฟฎๆญฃๆก๏ผๅ็จฑใๅฐ็ๆถ็ๆณใ,ๆจๅจๆฏๆๅๅฐ้ซใ้ปๅ่ปๅ5Gๆ่ก้ ๅ ไผๆฅญ็ๅตๆฐใๆฐๆฟ็ญๆไพๆ้ซ25%็ๆๅพ็จ ๆตๅ ๏ผไผๆฅญ้ๆปฟ่ถณ็ถๅนดๅบฆ็ ็ผ่ฒป็จ้ไธๅฎ่ฆๆจกใ็ ็ผๅฏๅบฆ่ถ ้6%ไปฅๅๆๆ็จ ็้ไธๅฎๆฏ็็ญๆขไปถใๅ ๆๅ้้ๅฐไผๆฅญๆไฝ็จ ่ฒ ๅถๅบฆ็่ชฟๆด๏ผๅฐ็ฃ็้ๆชป้ ่จๅๅฅๆผ2023ๅนดๅ2024ๅนด้ๅฐ12%ๅ15%ใ็ฎๅ๏ผ่่ฒกๆฟ้จ่จ่ซๅพ๏ผ็ขบๅฎ100ๅๅ ็่จญๅๆ่ณๆตๅ ๆฏๅฏ่ก็่ชๅ ๏ผไธ็ ็ผ่ฒป็จ้ๆชป้ ่จๆๅจ60่ณ70ๅๅ ไน้ใๅฐๆผ้ไฝ้ๆชป็่ฆๆฑ๏ผ็ถๆฟ้จๅ่ฒกๆฟ้จๅทฒ้ๆๅ ฑ่ญ๏ผๆจๅจๆดๅคงๅๆ ็ฏๅไธฆ้ผๅตไผๆฅญๆ่ณไปฅ็ถญๆไพๆ้็้้ตๅฐไฝใๆญคๅค๏ผ้ๅฐIC่จญ่จๆฅญ่ ็่จดๆฑ๏ผไน่ๆ ฎ่ชฟ้้ๆชปใ้ ่จๆญคๆณๆกๅฐๅจ่ฟๆๅ งๅฏฆๆฝ๏ผ็ธ้่จ่ซๅ่ชฟๆดๅฐๅจ้ ๅๆๅพ้ฒไธๆญฅ้ฒ่กใ |
| Translation | ไปฅไธๆไพ่ฑๆๅ งๅฎน๏ผ่ซๅนซๆ็ฟป่ญฏๆไธญๆใDongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. | ๆฑๅฑฑๅๅกไปฅๅพๅคฉ็จๅ็ๅฐไฝ๏ผไธๆท็ฒพ็ ็่ฃฝไฝๅทฅ่่็จฑ๏ผ้ขจๅณ็ฒ่ซธๅคๅๅกๆๅฅฝ่ ๆจๅดใ |
Training methods
- Software / hardware spec
- GPU: H100
- Training Framework: PyTorch
- Data preprocessing
- Character normalization
- Deduplication
- Denoise
- Html tagใjavascript in web content
- Non-standard characters or garbage characters
- Posts with an insufficient number of characters
- Removing specific formats such as extra line breaks added for formatting purposes
- Removing personal information such as emails and phone numbers.
- Remove inappropriate content such as gambling, pornography, etc..
- Character and word expanding
- Enhancing the performance of Traditional Chinese input and output, the expanded data include the following two parts:
- Obtaining Chinese characters from the Ministry of Education's "Variant Chinese Characters Dictionary" and "Corrected Characters Table".
- Collecting over 5,000,000 sentences with more than 100 characters each from the Traditional Chinese Wikipedia, news articles, and the Chinese Common Crawl data (2.1G), used to train the tokenizer for Chinese characters and words.
- Enhancing the performance of Traditional Chinese input and output, the expanded data include the following two parts:
- Continuous pretraining (CP)
- Supplementing the model with a large amount of reliable Traditional Chinese knowledge.
- Hyper parameters
- optimizer: AdamW
- learning rate: 1e-4
- batch size: 1M tokens
- epoch: 1
- Fine tune (FT)
- Enabling the model to answer questions in Traditional Chinese.
- Hyper parameters
- optimizer: AdamW
- learning rate: 5e-5
- batch size: 256K tokens
- epoch: 3
Training Data
- Continuous pre-training data (about 140GB)
Dataset Description Litigation Data Civil litigation data from various levels of courts in the judicial rulings, including data from 2013/01 to 2023/12. CNA news The CNA news includes daily news articles from June 1993 to June 2023, spanning a period of 30 years. The content covers various domains such as domestic and international politics, society, economy, culture, education, and lifestyle. ETtoday news ETtoday news data, including data from 2011/10 to 2023/12. Legislative Yuan Gazette The Legislative Yuan Gazette contains data from the 1st session of the 8th term to the 7th session of the 10th term. Publisher Website Book Introduction Includes book introduction data from the websites of SunColor, Gotop publishers. Abstracts of GRB research projects GRB is an information system that compiles research projects funded by government grants and their outcome reports. This dataset primarily includes research project abstracts from 1993 to 2023, including both Chinese and their English counterparts. Academic conference proceedings abstracts The database contains academic conference proceedings held in Taiwan from 1988 to 2009. Taiwan Panorama magazine Taiwan Panorama magazine contains articles from July 1993 to June 2023, spanning 30 years. The content focuses on Taiwanese culture, tourism, and local customs. ๆจ่ฉ็ถฒ ใๆจ่ฉ็ถฒใcovers approximately 187,000 academic terms in the humanities and social sciences, along with their translations. Data from various ministries and commissions Including partial data from government department websites such as the Executive Yuan's "National Overview", the Ministry of Culture's "National Cultural Memory Bank", the National Development Council's "Archives Support Teaching Network", the Ministry of Transportation's "Traffic Safety Portal", etc. Business Today Business Today Magazine is a weekly magazine focused on finance. The dataset includes articles from 2008/01 to 2023/07. Mandarin and idiom dictionary from the Ministry of Education Dataset including:
Idiom Dictionary: Contains 5,338 idioms, including definitions, original stories, usage explanations, and example sentences.
Revised Mandarin Dictionary: contains Chinese words and various vocabulary, including pronunciation, radicals, definitions, and other information, totaling approximately 165,539 entries.
Concise Mandarin Dictionary: is a condensed version of the "Revised Mandarin Dictionary", containing a total of 45,247 entries.SCITechVista The dataset includes science news and popular science articles from the SCITechVista website. iKnow The iKnow platform provides information on market trends, strategic analysis, patent knowledge, and technology transaction information for Taiwan and the global technology industry. The dataset includes data from 2005/01 to 2023/07. Science Development Monthly Magazine Science Development Monthly Magazine is a popular science publication published by the National Science Council (NSC) to promote science education. It includes articles from 2004/10 to 2020/12. In 2021, the magazine was relaunched as "CharmingSCITech" quarterly, providing new knowledge on international technology issues. Legislation Database The Legislation Database includes the latest central regulations, rules, draft bills, and local regulations issued by government agencies as of 2023/10. Local Government Tourism Websites Covering partial data from tourism websites of local government counties and cities in Taiwan. Curriculum Guidelines from the National Institute of Education The dataset includes curriculum guidelines for different subjects at various levels of education. CNA's English and Chinese Name Translation Database The English and Chinese Name Translation Database of the Central News Agency (CNA) collects translations of foreign and Chinese surnames, personal names, organizations, and place names used in news. Fairy tales A total of 20 fairy tale books, including "Tom Sawyer," "Peter Pan," "Alice's Adventures in Wonderland," "Uncle Long Legs," and more. RedPajama-Data-V2 Extracting English data from the RedPajama-Data-v2 multilingual dataset MathPile-commercial A mathematics-focused dataset obtained from MathPile-commercial Traditional Chinese Wikipedia Articles The content of all articles in Traditional Chinese Wikipedia, up to January 2023. github-code-clean An open-source code dataset on GitHub. After removing unlicensed code and documents. - Fine tune data
- The TAIDE team trains the LLaMA2 series models to generate fine-tuning data, which generates single or multi-turn conversations on topics such as world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values. The fine tune data consists of 128K prompt-response pairs and will be released publicly later.
Evaluation
- taide-bench
- Data
- Tasks include writing articles, writing letters, summarizing articles, translating from English to Traditional Chinese, translating from Traditional Chinese to English. There are 500 questions in total.
- data link: taide-bench
- Evaluation method
- LLM as a Judge by GPT4
- code link: taide-bench-eval
- Scores
Model Translating from Traditional Chinese to English Translating from English to Traditional Chinese Summerization Writing articles Writing letters Average TAIDE-LX-7B-Chat 7.165 7.685 7.720 9.635 9.110 8.263 GPT3.5 8.880 8.810 7.450 9.490 8.750 8.676 LLAMA2 7B 6.075 4.475 5.905 2.625 3.040 4.424 LLAMA2 13B 6.480 6.135 6.110 2.565 3.000 4.858 LLAMA2 70B 6.975 6.375 6.795 2.625 2.990 5.152
- Data
License
Disclaimer
- Due to limitations in its design architecture and the inevitable biases in data, any response from the LLM model does not represent the stance of TAIDE. Additional security measures should be implemented before use, and responses may also contain incorrect information. Users are advised not to fully trust the responses.
Development Team
Useful links
- Downloads last month
- 240
Hardware compatibility
Log In to add your hardware
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for QuantFactory/TAIDE-LX-7B-GGUF
Base model
taide/TAIDE-LX-7B