TAIDE-LX-7B-GGUF

Model Description

  • The TAIDE project aims to develop a generative AI dialogue engine model that is tailored to the linguistic and cultural characteristics of Taiwan, while also establishing a trustworthy AI environment. By combining academic, industrial, and research resources, the project seeks to advance the development of trustworthy generative AI, enhancing Taiwan's international competitiveness, promoting industrial development, and reducing dependence on foreign technologies.
  • The large language models developed in this project are based on Meta's LLaMA2-7b release, incorporating text and training materials from various fields in Taiwan to enhance the model's ability to respond in Traditional Chinese and perform well in specific tasks. The publicly released models are as follows:
    • TAIDE-LX-7B: This model is a continuous pretrained version with Traditional Chinese data based on LLaMA2-7b. It is suitable for scenarios where users intend to further fine-tune the model. Since the pre-trained model has not undergone fine-tuning and preference alignment, it may produce malicious or unsafe outputs. Please use with caution.
    • TAIDE-LX-7B-Chat: This model enhances office-related tasks and multi-turn question-and-answer dialogue capabilities through instruction tuning based on TAIDE-LX-7B. It is suitable for scenarios such as chat conversations or task assistance. TAIDE-LX-7B-Chat also provides a 4-bit quantization model. The quantization model is primarily offered for user convenience but may affect performance and introduce unforeseen issues. Users are advised to understand and take note of this.

Model Parameters

  • Parameters: 7B
  • Max context length: 4K
  • Training token in Traditional Chinese: 41.44B
  • Training time: 1531.82 H100 GPU Hours

Features

  • Expanding the character and word count by 24,720 to enhance the model's ability to process Traditional Chinese.
  • Strictly vetting training data for the model to improve its trustworthiness and applicability.
  • Enhancing the model's performance for tasks such as summarization, writing articles, writing letters, and translating between Chinese and English, which are commonly used in office settings.
  • Strengthening the model's understanding of Taiwan-specific culture, terminology, and context.
  • Equip the model with multi-turn conversation capabilities.

Examples

Task User input Model output
Writing articles ่ซ‹ไปฅไปฅไธ‹ๅ…งๅฎน็‚บๅŸบ็คŽ๏ผŒๅฏซไธ€็ฏ‡ๆ–‡็ซ ๏ผšๆ’ฐๅฏซไธ€็ฏ‡ไฝœๆ–‡๏ผŒ้กŒ็›ฎ็‚บใ€Šไธ€ๅผต่ˆŠ็…ง็‰‡ใ€‹๏ผŒๅ…งๅฎน่ฆๆฑ‚็‚บ๏ผš้ธๆ“‡ไธ€ๅผตไปคไฝ ๅฐ่ฑกๆทฑๅˆป็š„็…ง็‰‡๏ผŒ่ชชๆ˜Žไปคไฝ ๅฐ่ฑกๆทฑๅˆป็š„ๅŽŸๅ› ๏ผŒไธฆๆ่ฟฐ็…ง็‰‡ไธญ็š„ๅฝฑๅƒๅŠ่ƒŒๅพŒ็š„ๆ•…ไบ‹ใ€‚่จ˜้Œ„ๆˆ้•ท็š„้Ž็จ‹ใ€่ˆ‡ไป–ไบบ็š„ๆƒ…ๆ™ฏใ€็’ฐๅขƒ่ฎŠ้ทๅ’Œ็พŽ้บ—็š„ๆ™ฏ่‰ฒใ€‚ ไปฅไธ‹ๆ˜ฏๆ นๆ“šๆŒ‡ไปคๅ‰ตไฝœ็š„ไธ€็ฏ‡ไฝœๆ–‡๏ผš

ใ€Šไธ€ๅผต่ˆŠ็…ง็‰‡ใ€‹

ๅœจๆˆ‘ๅกตๅฐๅทฒไน…็š„ๅฎถๆ—็›ธ็ฐฟไธญ๏ผŒๆœ‰ไธ€ๅผต็‰นๅˆฅๅธๅผ•ๆˆ‘็š„่ˆŠ็…ง็‰‡๏ผŒ้‚ฃๅผต็…ง็‰‡่จ˜้Œ„ไบ†ๆˆ‘ๅ€‘ๅฎถๆ—ๆˆๅ“กๅœจ้Žๅพ€็พŽๅฅฝๆ™‚ๅ…‰็š„ๆญก่š๏ผŒ่‡ณไปŠไป่ˆŠไปคไบบๅ›žๅ‘ณ็„ก็ชฎใ€‚้€™ๅผต็…ง็‰‡ๆ˜ฏๆˆ‘ๅฅถๅฅถๅœจไบŒๅๅคšๅนดๅ‰่ˆ‰่พฆ็š„ไธ€ๅ ดๅฎถๆ—่šๆœƒไธŠๆ‰€ๆ‹ไธ‹็š„๏ผŒ็…ง็‰‡ไธญ็š„ไบบ็‰ฉๅŒ…ๆ‹ฌไบ†ๆˆ‘็š„็ˆถๆฏใ€็ฅ–็ˆถๆฏใ€ๅ”ๅ”ใ€ๅง‘ๅง‘ๅ’Œ่กจๅ“ฅ็ญ‰๏ผŒๅคงๅฎถ่‡‰ไธŠๆด‹ๆบข่‘—ๅนธ็ฆ็š„็ฌ‘ๅฎน๏ผŒๅฝทๅฝฟๆ™‚ๅ…‰้œๆญข่ˆฌๅœฐ็พŽๅฅฝใ€‚

็…ง็‰‡ไธญ็š„ๅ ดๆ™ฏๅๅˆ†ๆบซ้ฆจ๏ผŒๅคงๅฎถๅœๅๅœจไธ€ๅผต้•ทๆ–นๅฝข็š„้คๆกŒไธŠ๏ผŒๆกŒไธŠๆ“บๆปฟไบ†็พŽๅ‘ณ็š„้ฃŸ็‰ฉๅ’Œ้ฃฒๆ–™ใ€‚ๆกŒไธŠ็š„่œ้คšๆ˜ฏๅ‚ณ็ตฑไธญๅผ็š„ไฝณ้คš๏ผŒๆœ‰้ฆ™ๅ™ดๅ™ด็š„็ƒคๅ…จ้ญšใ€ๅซฉๆป‘็š„้›ž่‚‰ๅ’Œ้ฎฎ็”œ็š„่”ฌ่œ๏ผŒ้€™ไบ›่œ้คšๆ˜ฏ็•ถๅนดๅฅถๅฅถ่ฆช่‡ช็ƒน่ชฟ็š„๏ผŒๅฅนๆ‰‹่—็ฒพๆน›๏ผŒ่ฎ“ๆฏๅ€‹ไบบ้ƒฝๅƒๅพ—ๆดฅๆดฅๆœ‰ๅ‘ณใ€‚

ๆœ€่ฎ“ๆˆ‘ๅฐ่ฑกๆทฑๅˆป็š„ๆ˜ฏ๏ผŒ็…ง็‰‡ไธญ้‚„ๆ‹ๅˆฐไบ†ๆˆ‘ๅนดๅนผๆ™‚็š„ๆจกๆจฃใ€‚ๆˆ‘ๅๅœจๆกŒๅญ็š„ๅฆไธ€็ซฏ๏ผŒๆ‰‹่ฃกๆ‹ฟ่‘—ไธ€็ข—ๆนฏ๏ผŒ่‡‰ไธŠๅธถ่‘—ๆปฟ่ถณ็š„็ฌ‘ๅฎนใ€‚้‚ฃๆ™‚็š„ๆˆ‘ๆ‰ๅ‰›ๆปฟๅ‘จๆญฒ๏ผŒๅฐๆ–ผ้€™ๅ ด็››ๅคง็š„ๅฎถๆ—่šๆœƒ้‚„ๆ„Ÿๅˆฐๅๅˆ†ๆ–ฐๅฅ‡ใ€‚ๅพž้€™ๅผต็…ง็‰‡ไธญ๏ผŒๆˆ‘่ƒฝๆ„Ÿๅ—ๅˆฐ็•ถๅนดๅฎถๆ—ๆˆๅ“ก้–“็š„่ฆชๅฏ†ๅ’Œๆบซๆš–๏ผŒๅณไฝฟ้Žไบ†้€™้บผๅคšๅนด๏ผŒ้‚ฃไปฝๆƒ…ๆ„Ÿๅปไพ็„ถ็‰ขๅ›บๅœฐๅญ˜ๅœจ่‘—ใ€‚

้€™ๅผต็…ง็‰‡่ƒŒๅพŒๆœ‰่‘—ๅ‹•ไบบ็š„ๆ•…ไบ‹ใ€‚ๅœจ้‚ฃๅ€‹ๆ™‚ไปฃ๏ผŒๆˆ‘็š„็ฅ–็ˆถๆฏๅ’Œ็ˆถๆฏ้ƒฝ้‚„ๅœจ่ฎ€ไธญๅญธ๏ผŒไป–ๅ€‘ๆฏๅคฉ้ƒฝๆœƒๆ”พๅญธๅพŒๅˆฐ่พฒ็”ฐ่ฃกๅนซๅฟ™๏ผŒ่พ›ๅ‹คๅœฐ่€•ไฝœไปฅ็ถญๆŒไธ€ๅฎถไบบ็š„็”Ÿ่จˆใ€‚ๅ„˜็ฎก็”ŸๆดปๆขไปถไธฆไธๅฏŒ่ฃ•๏ผŒไฝ†ไป–ๅ€‘ไป่ˆŠไฟๆŒ่‘—ๆจ‚่ง€็š„ๆ…‹ๅบฆๅ’Œๅฐ็”Ÿๆดป็š„็†ฑๆ„›ใ€‚ๆฏๅนดๅˆฐไบ†็ง‹ๅคฉ๏ผŒๅช่ฆๆ”ถ็ฒๅญฃ็ฏ€ไพ†่‡จ๏ผŒไป–ๅ€‘ๅฐฑๆœƒ่ˆ‰่พฆๅฎถๆ—่šๆœƒ๏ผŒ้‚€่ซ‹ๆ‰€ๆœ‰่ฆชๆˆšๆœ‹ๅ‹ๅ‰ไพ†ๅ…ฑไบซ่ฑๆ”ถ็š„ๅ–œๆ‚…ใ€‚

้šจ่‘—ๆญฒๆœˆ็š„ๆต้€๏ผŒๆˆ‘้€ๆผธ้•ทๅคง๏ผŒ้–‹ๅง‹ๅทฅไฝœๅ’Œ็”Ÿๆดปใ€‚ๆฏๆฌกๅ›ž้„‰๏ผŒๆˆ‘้ƒฝๆœƒๆŠฝ็ฉบๅŽปๆŽขๆœ›่€ไบบๅฎถ๏ผŒ่ฝ่ฝไป–ๅ€‘็•ถๅนด็š„ๅพ€ไบ‹๏ผŒๆ„Ÿๅ—้‚ฃๆฎตๅˆป้ชจ้Š˜ๅฟƒ็š„ๆ™‚ๅ…‰ใ€‚่€Œ้€™ๅผต่ˆŠ็…ง็‰‡๏ผŒไนŸๅฐ‡ไธ€็›ด้™ชไผด่‘—ๆˆ‘๏ผŒๆˆ็‚บๆˆ‘็่ฒดๅ›žๆ†ถ็š„ไธ€้ƒจๅˆ†ใ€‚ๅฎƒ่ฆ‹่ญ‰ไบ†ๆˆ‘ๅ€‘็š„ๆˆ้•ทๅ’Œ่ฎŠ้ท๏ผŒไนŸ่จ˜้Œ„ไธ‹ไบ†้‚ฃไบ›้›ฃไปฅๅฟ˜ๆ‡ท็š„็พŽๅฅฝๆ™‚ๅˆปใ€‚

้€™ๅผต็…ง็‰‡ไธๅƒ…ๅฑ•็คบไบ†ๅฎถๆ—็š„่ฆชๆƒ…ๅ’Œๆบซๆš–๏ผŒ้‚„ๆŠ˜ๅฐ„ๅ‡บ้‚ฃๅ€‹ๆ™‚ไปฃไบบๅ€‘ๅ‹คๅ‹žๆจธๅฏฆ็š„็”Ÿๆดปๆ–นๅผๅ’Œๅฐๅ‚ณ็ตฑ็š„ๅฐŠ้‡ใ€‚ๅฆ‚ไปŠ๏ผŒ้šจ่‘—็คพๆœƒ็š„ๅฟซ้€Ÿ็™ผๅฑ•๏ผŒ่จฑๅคšๅ‚ณ็ตฑ็š„็พŽๅพทๅ’Œ้ขจไฟ—ๆญฃๅœจ้€ๆผธๆถˆๅคฑใ€‚ๆˆ‘ๅ€‘ๆ‡‰่ฉฒๅฅฝๅฅฝ็ๆƒœ้€™ๆจฃ็š„ๅ›žๆ†ถๅ’Œๆ–‡ๅŒ–๏ผŒๅฐ‡ๅฎƒๅ€‘ๅ‚ณๆ‰ฟไธ‹ๅŽป๏ผŒ่ฎ“ๅพŒไปฃๅญๅญซไนŸ่ƒฝๆ„Ÿๅ—ๅˆฐ้‚ฃไปฝไพ†่‡ชๆญทๅฒ็š„ๆบซๆƒ…ๅ’Œๅ‚ณๆ‰ฟใ€‚

ๅ› ๆญค๏ผŒๆˆ‘ๆฑบๅฎšๅฐ‡้€™ๅผต็…ง็‰‡ๅฅฝๅฅฝไฟ็ฎก่ตทไพ†๏ผŒไธฆๅธธๅธธๆ‹ฟๅ‡บไพ†็œ‹ใ€‚ๅฎƒไธๅƒ…ๆ˜ฏไธ€ๅผต่ˆŠ็…ง็‰‡๏ผŒๆ›ดๆ˜ฏไธ€ๆฎตๅฏถ่ฒด็š„ๅ›žๆ†ถๅ’Œๅฎถๆ—็ฒพ็ฅž็š„่ฑกๅพตใ€‚ๆˆ‘ๅธŒๆœ›ๆœชไพ†่ƒฝๆœ‰ๆฉŸๆœƒ้‡็พ็•ถๅนด็š„็››ๆณ๏ผŒๅ†ไธ€ๆฌก่šๅœจไธ€่ตท๏ผŒๅˆ†ไบซๅฝผๆญค็š„็”Ÿๆดปๅ’Œๆ•…ไบ‹๏ผŒ่ฎ“้‚ฃไปฝ่ฆชๆƒ…ๅ’Œๅ‹ๆƒ…็นผ็บŒๅœจๆ™‚ๅ…‰ไธญ่”“ๅปถ๏ผŒๆฐธไธๆถˆ้€ใ€‚
Writing letters ่ซ‹ไปฅๅ“็‰Œ็ถ“็†็š„่บซไปฝ๏ผŒ็ตฆๅปฃๅ‘Šๅ…ฌๅธ็š„ๅ‰ตๆ„็ธฝ็›ฃๅฏซไธ€ๅฐไฟก๏ผŒๆๅ‡บๅฐๆ–ผๆ–ฐ็”ขๅ“ๅปฃๅ‘Šๅฎฃๅ‚ณๆดปๅ‹•็š„ๅ‰ตๆ„ๅปบ่ญฐใ€‚ ๅฐŠๆ•ฌ็š„ๅ‰ตๆ„็ธฝ็›ฃ๏ผš

ๆˆ‘ๅฏซ้€™ๅฐไฟกๆ˜ฏ็‚บไบ†ๅ‘ๆ‚จๆๅ‡บไธ€ไบ›้—œๆ–ผๆˆ‘ๅ€‘ๆ–ฐ็”ขๅ“ๅปฃๅ‘Šๅฎฃๅ‚ณๆดปๅ‹•็š„ๅ‰ตๆ„ๅปบ่ญฐใ€‚ๆˆ‘ๅ€‘ๅ…ฌๅธๅณๅฐ‡ๆŽจๅ‡บไธ€ๆฌพๅ…จๆ–ฐ็š„็”ขๅ“๏ผŒ็‚บไบ†็ขบไฟๅฎƒๅœจๅธ‚ๅ ดไธŠๅ–ๅพ—ๆˆๅŠŸ๏ผŒๆˆ‘ๅ€‘ๅธŒๆœ›้€้Žไธ€ๅ€‹ๅผ•ไบบๆณจ็›ฎ็š„ๅปฃๅ‘Šๅฎฃๅ‚ณๆดปๅ‹•ไพ†ๅธๅผ•ๆถˆ่ฒป่€…็š„ๆณจๆ„ใ€‚

ๅœจ่€ƒๆ…ฎๅปฃๅ‘Š็š„ๅ‰ตๆ„ๅ’Œ็ญ–็•ฅๆ™‚๏ผŒๆœ‰ๅนพๅ€‹ๅ› ็ด ้œ€่ฆ็ดๅ…ฅ่€ƒ้‡ใ€‚้ฆ–ๅ…ˆ๏ผŒๆˆ‘ๅ€‘่ฆ้‡ๅฐ็›ฎๆจ™ๅฎข็พค้€ฒ่กŒ็ ”็ฉถ๏ผŒไปฅ็žญ่งฃไป–ๅ€‘็š„้œ€ๆฑ‚ๅ’Œๅ–œๅฅฝใ€‚ๅ…ถๆฌก๏ผŒๆˆ‘ๅ€‘่ฆ็ขบไฟๅปฃๅ‘Šๅ…งๅฎนๅ…ทๆœ‰ๅธๅผ•ๅŠ›ๅ’Œ่ชชๆœๅŠ›๏ผŒ่ƒฝๅผ•่ตท็›ฎๆจ™ๅฎข็พค็š„ๅ…ฑ้ณดใ€‚ๆœ€ๅพŒ๏ผŒๆˆ‘ๅ€‘่ฆๅˆฉ็”จๅ„็จฎๅช’้ซ”ๅนณ่‡บๅ’Œๆ•ธไฝ่กŒ้Šทๅทฅๅ…ท๏ผŒไปฅๆ“ดๅคงๅปฃๅ‘Š็š„ๅฝฑ้ŸฟๅŠ›ใ€‚

ๅŸบๆ–ผ้€™ไบ›ๅ› ็ด ๏ผŒๆˆ‘ๆๅ‡บไปฅไธ‹ๅนพ้ปžๅ‰ตๆ„ๅปบ่ญฐ๏ผš

1. ็”ขๅ“ๅฎšไฝ๏ผšๆˆ‘ๅ€‘ๅฏๅฐ‡้€™ๆฌพๆ–ฐ็”ขๅ“ๅฎšไฝ็‚บ้ซ˜็ซฏใ€็’ฐไฟใ€ๅฅๅบทๅ’Œๆ™‚ๅฐš็š„ไปฃ่กจ๏ผŒๅผท่ชฟๅ…ถ็จๆœ‰็š„ๅŠŸ่ƒฝๅ’Œ็‰น่‰ฒใ€‚ๅœจๅปฃๅ‘Šไธญ๏ผŒๆˆ‘ๅ€‘ๅฏไปฅ้€้Ž็”Ÿๅ‹•็š„่ฆ–่ฆบๆ•ˆๆžœๅ’Œ็ฐกๆฝ”็š„่ชž่จ€ไพ†ๅ‚ณ้”้€™ไบ›็‰น้ปžใ€‚
2. ๆ•…ไบ‹่กŒ้Šท๏ผšๅœจๅปฃๅ‘Šไธญ่ฌ›่ฟฐไธ€ๅ€‹่ˆ‡็”ขๅ“ๅŠŸ่ƒฝ็›ธ้—œ็š„ๅ‹•ไบบๆ•…ไบ‹๏ผŒ่ฎ“ๆถˆ่ฒป่€…่ƒฝๆ›ดๆทฑๅ…ฅๅœฐ็žญ่งฃ็”ขๅ“ๆ‰€ๅธถไพ†็š„็”Ÿๆดป่ฎŠๅŒ–ใ€‚ไพ‹ๅฆ‚๏ผŒๆˆ‘ๅ€‘ๅฏไปฅ่ฌ›่ฟฐไธ€ไฝๅฟ™็ขŒ็š„่ทๆฅญๅฉฆๅฅณ๏ผŒๅฆ‚ไฝ•ไฝฟ็”จๆˆ‘ๅ€‘็š„ๆ–ฐ็”ขๅ“ๅœจๅทฅไฝœๅ’Œ็”Ÿๆดปไธญๅ–ๅพ—ๅนณ่กก็š„ๆ•…ไบ‹ใ€‚
3. ๅไบบๆ•ˆๆ‡‰๏ผš้‚€่ซ‹ไธ€ไฝๅ—ๆญก่ฟŽ็š„ๅ…ฌ็œพไบบ็‰ฉๆˆ–ๆ„่ฆ‹้ ˜่ข–ๆ“”ไปป็”ขๅ“ไปฃ่จ€ไบบ๏ผŒๅˆฉ็”จไป–ๅ€‘็š„ๅฝฑ้ŸฟๅŠ›ไพ†ๆŽจๅปฃ็”ขๅ“ใ€‚้€™ไธๅƒ…ๅฏไปฅๅขžๅŠ ็”ขๅ“ๆ›ๅ…‰ๅบฆ๏ผŒ้‚„่ƒฝ่ฎ“ๆ›ดๅคšๆถˆ่ฒป่€…ไฟกไปปๅ’Œไฟก่ณด็”ขๅ“ใ€‚
4. ่ทจ็•Œๅˆไฝœ๏ผš่ˆ‡ไธ€ไบ›็›ธ้—œ็”ขๆฅญ๏ผˆๅฆ‚ๆ™‚ๅฐšใ€็’ฐไฟใ€ๅฅๅบท็ญ‰๏ผ‰้€ฒ่กŒ่ทจ็•Œๅˆไฝœ๏ผŒๅ…ฑๅŒๆŽจๅ‡บ้™้‡็‰ˆๆˆ–่ฏๅ็”ขๅ“ใ€‚้€™็จฎๅˆไฝœไธๅƒ…่ƒฝๅธๅผ•ไธๅŒ้ ˜ๅŸŸ็š„ๆถˆ่ฒป่€…๏ผŒ้‚„่ƒฝ็‚บ็”ขๅ“ๅธถไพ†ๆ›ดๅคš่ฉฑ้กŒๆ€งๅ’Œๆ”ถ่—ๅƒนๅ€ผใ€‚
5. ๆ•ธไฝ่กŒ้Šท๏ผšๅ……ๅˆ†ๅˆฉ็”จ็คพ็พคๅช’้ซ”ใ€ๆœๅฐ‹ๅผ•ๆ“Žๅ’Œ้›ป้ƒต่กŒ้Šท็ญ‰ๆ•ธไฝๅทฅๅ…ท๏ผŒๅœจๆ•ดๅ€‹ๅปฃๅ‘Šๅฎฃๅ‚ณ้Ž็จ‹ไธญไฟๆŒไธ€่‡ดๆ€ง็š„่จŠๆฏๅ’Œๅฝข่ฑกใ€‚ๆˆ‘ๅ€‘ไนŸๅฏ่ฃฝไฝœไธ€็ณปๅˆ—็Ÿญ่ฆ–้ ปๅ’Œ็ทšไธŠๆดปๅ‹•๏ผŒไปฅๅขžๅŠ ๆถˆ่ฒป่€…ๅƒ่ˆ‡ๅบฆๅ’Œๅ“็‰Œๅฟ ่ช ๅบฆใ€‚
6. ๅ‰ตๆ–ฐๅฝขๅผ๏ผšๅ˜—่ฉฆไฝฟ็”จๆ–ฐ็ฉŽ็š„ๅปฃๅ‘Šๅฝขๅผ๏ผŒๅฆ‚AR๏ผˆๆ“ดๅขžๅฏฆๅขƒ๏ผ‰ใ€VR๏ผˆ่™›ๆ“ฌๅฏฆๅขƒ๏ผ‰ๅ’Œ360ๅบฆๅฝฑ็‰‡็ญ‰๏ผŒ็ตฆๆถˆ่ฒป่€…ๅธถไพ†ๅ…จๆ–ฐ็š„ๅปฃๅ‘Š้ซ”้ฉ—ใ€‚้€™็จฎๅ‰ตๆ–ฐๅฝขๅผไธๅƒ…่ƒฝๅธๅผ•ๆถˆ่ฒป่€…็›ฎๅ…‰๏ผŒ้‚„่ƒฝๆ้ซ˜็”ขๅ“ๅœจๅธ‚ๅ ดไธŠ็š„่พจ่ญ˜ๅบฆใ€‚
7. ๆดปๅ‹•ๅƒ่ˆ‡๏ผšๅœจๅปฃๅ‘Šๅฎฃๅ‚ณๆดปๅ‹•ไธญๅŠ ๅ…ฅไบ’ๅ‹•็’ฐ็ฏ€๏ผŒๅฆ‚็”ขๅ“่ฉฆ็”จใ€ๆŠฝ็Žๅ’Œๅ„ชๆƒ ๅˆธ็ญ‰๏ผŒ้ผ“ๅ‹ตๆถˆ่ฒป่€…็ฉๆฅตๅƒ่ˆ‡ไธฆๅˆ†ไบซไป–ๅ€‘็š„้ซ”้ฉ—ใ€‚้€™ๅฐ‡ๆœ‰ๅŠฉๆ–ผๅฃ็ข‘ๅ‚ณๆ’ญ๏ผŒ้€ฒไธ€ๆญฅๆ“ดๅคง็”ขๅ“ๅฝฑ้ŸฟๅŠ›ใ€‚

็ถœๅˆไปฅไธŠๅปบ่ญฐ๏ผŒๆˆ‘ๅธŒๆœ›ๆˆ‘ๅ€‘่ƒฝ่ฃฝๅฎšๅ‡บไธ€ๅฅ—ๅ…จ้ข่€ŒๅฏŒๆœ‰ๅ‰ตๆ„็š„ๅปฃๅ‘Šๅฎฃๅ‚ณๆดปๅ‹•่จˆๅŠƒ๏ผŒไปฅๆœ€ๅคง้™ๅบฆๅœฐๆๅ‡ๆ–ฐ็”ขๅ“ๅœจๅธ‚ๅ ดไธŠ็š„็ซถ็ˆญๅŠ›ใ€‚ๅฆ‚ๆžœๆ‚จๆœ‰ไปปไฝ•็–‘ๅ•ๆˆ–้œ€่ฆ้€ฒไธ€ๆญฅ่จŽ่ซ–๏ผŒๆญก่ฟŽ้šจๆ™‚่ˆ‡ๆˆ‘่ฏ็ตกใ€‚่ฌ่ฌๆ‚จ็š„้—œๆณจๅ’Œๆ”ฏๆด๏ผ

็ฅๅ•†็ฅบ๏ผ

ๆ‚จ็š„ๅ“็‰Œ็ถ“็†

๏ผˆ่จป๏ผšๆญคไฟกไปถๅฏๆ นๆ“šๅฏฆ้š›ๆƒ…ๆณๅ’Œ็”ขๅ“็‰น้ปž้€ฒ่กŒ่ชฟๆ•ด๏ผ‰
Summarization ่ซ‹ๅฐ‡้€™็ฏ‡ๆ–‡็ซ ็ฒพ็ฐกๆข็†ๅŒ–:ใ€Œ็”ขๆฅญๅ‰ตๆ–ฐๆขไพ‹็ฌฌ10ๆขไน‹2ๅŠ็ฌฌ72ๆขๆขๆ–‡ไฟฎๆญฃๆกˆใ€ไฟ—็จฑใ€Œๅฐ็‰ˆๆ™ถ็‰‡ๆณ•ใ€,้‡ๅฐๅŠๅฐŽ้ซ”ใ€้›ปๅ‹•่ปŠใ€5G็ญ‰ๆŠ€่ก“ๅ‰ตๆ–ฐไธ”ๅฑ…ๅœ‹้š›ไพ›ๆ‡‰้ˆ้—œ้ตๅœฐไฝๅ…ฌๅธ,ๆไพ›ๆœ€้ซ˜25%็‡Ÿๆ‰€็จ…ๆŠ•ๆŠตๅ„ชๆƒ ,ไผๆฅญ้ฉ็”จ่ฆไปถๅŒ…ๅซ็•ถๅนดๅบฆ็ ”็™ผ่ฒป็”จใ€็ ”็™ผๅฏ†ๅบฆ้”ไธ€ๅฎš่ฆๆจก,ไธ”ๆœ‰ๆ•ˆ็จ…็އ้”ไธ€ๅฎšๆฏ”็އใ€‚
็‚บๅ› ๆ‡‰็ถ“ๆฟŸๅˆไฝœๆšจ็™ผๅฑ•็ต„็น”(OECD)ๅœ‹ๅฎถๆœ€ไฝŽ็จ…่ฒ ๅˆถ่ชฟๆ•ด,ๅ…ถไธญๆœ‰ๆ•ˆ็จ…็އ้–€ๆชป,ๆฐ‘ๅœ‹112ๅนด่จ‚็‚บ12%,113ๅนดๆ–™ๅฐ‡ๆ้ซ˜่‡ณ15%,ไฝ†ไปๅพ—ๅฏฉ้…Œๅœ‹้š›้–“ๆœ€ไฝŽ็จ…่ฒ ๅˆถๅฏฆๆ–ฝๆƒ…ๅฝขใ€‚
็ถ“ๆฟŸ้ƒจๅฎ˜ๅ“ก่กจ็คบ,ๅทฒๅ’Œ่ฒกๆ”ฟ้ƒจๅ”ๅ•†้€ฒๅ…ฅๆœ€ๅพŒ้šŽๆฎต,้™คไผๆฅญ็ ”็™ผๅฏ†ๅบฆ่จ‚ๅœจ6%,็›ฎๅ‰ๅทฒ็ขบ่ช,ไผๆฅญ่ณผ็ฝฎๅ…ˆ้€ฒ่ฃฝ็จ‹็š„่จญๅ‚™ๆŠ•่ณ‡้‡‘้ก้”100ๅ„„ๅ…ƒไปฅไธŠๅฏๆŠตๆธ›ใ€‚
่ฒกๆ”ฟ้ƒจๅฎ˜ๅ“ก่กจ็คบ,็ ”ๅ•†้Ž็จ‹ไธญ,้‡ๅฐๅฐ็ฃ็”ขๆฅญ่ˆ‡ๅ…ถๅœจๅœ‹้š›้–“้กžไผผ็š„ๅ…ฌๅธ้€ฒ่กŒๆทฑๅ…ฅ็ ”็ฉถ,ๅœจ่จญๅ‚™้ƒจๅˆ†,็•ข็ซŸ้ฉ็”จ็”ขๅ‰ต10ไน‹2็š„ๆฅญ่€…ๆ˜ฏไปฃ่กจๅฐ็ฃ้šŠๆ‰“ใ€Œๅœ‹้š›็›ƒใ€,ๆŠ•ๅ…ฅ้‡‘้กไธ้”100ๅ„„ๅ…ƒ,ๅฏ่ƒฝไนŸๆ‰“ไธไบ†ใ€‚
่‡ณๆ–ผๅ‚™ๅ—้—œๆณจ็š„็ ”็™ผ่ฒป็”จ้–€ๆชป,็ถ“ๆฟŸ้ƒจๅฎ˜ๅ“ก่กจ็คบ,ๆญท็ถ“่ˆ‡่ฒกๆ”ฟ้ƒจไพ†ๅ›žๅฏ†ๅˆ‡่จŽ่ซ–,็ ”็™ผ่ฒป็”จ้–€ๆชปๆœ‰ๆœ›่ฝๅœจ60ๅ„„่‡ณ70ๅ„„ๅ…ƒไน‹้–“ใ€‚
่ฒกๆ”ฟ้ƒจๅฎ˜ๅ“กๆŒ‡ๅ‡บ,็ ”็™ผๆ”ธ้—œๅฐ็ฃๆœชไพ†็ถ“ๆฟŸๆˆ้•ทๅ‹•่ƒฝ,้–€ๆชปไธ่ƒฝใ€Œ้ซ˜ไธๅฏๆ”€ใ€,่ตทๅˆ้›–่จญๅฎšๅœจ100ๅ„„ๅ…ƒ,ไน‹ๆ‰€ไปฅๆœƒ่ชฟ้™,ๆญฃๆ˜ฏ็›ผ่ฎ“ไผๆฅญ่ฆบๅพ—ๆœ‰่พฆๆณ•้”ๅพ—ๅˆฐ้–€ๆชปใ€้€ฒ่€Œ้ฉ็”จ็งŸ็จ…ๅ„ชๆƒ ,ๆ‰ๆœ‰ๅ‹•ๅŠ›็นผ็บŒๆŠ•ๅ…ฅ็ ”็™ผ,็ถญๆŒๅœ‹้š›ไพ›ๆ‡‰้ˆ้—œ้ตๅœฐไฝใ€‚
็ถ“ๆฟŸ้ƒจๅฎ˜ๅ“ก่กจ็คบ,ๅ› ๅป ๅ•†็ ”็™ผ่ฒป็”จๅนณๅ‡็‚บ30ใ€40ๅ„„ๅ…ƒ,ๅ…ถไธญ,IC่จญ่จˆๆฅญ่€…ไป‹ๆ–ผ30ๅ„„่‡ณ60ๅ„„ๅ…ƒ็ฏ„ๅœ,่‹ฅๅฐ‡้–€ๆชป่จ‚ๅœจ100ๅ„„ๅ…ƒ,็ฌฆๅˆๆขไปถ็š„ๆฅญ่€…่ผƒๅฐ‘ใ€ๅˆบๆฟ€่ช˜ๅ› ไธ่ถณ;ๆญคๅค–,่‹ฅ็ฌฆๅˆ็”ณ่ซ‹้–€ๆชป็š„ๆฅญ่€…ๅขžๅŠ ,ๅฐ‡ๅฏๆ้ซ˜ไผๆฅญๅœจๅฐๆŠ•่ณ‡้‡‘้ก,่ฒกๆ”ฟ้ƒจ็จ…ๆ”ถไนŸ่ƒฝๅ› ๆญค็ฒๅพ—ๆŒนๆณจใ€‚
IC่จญ่จˆๆฅญ่€…่ฟ‘ๆ—ฅ้ ป้ ป้‡ๅฐ็”ขๅ‰ต10ไน‹2็™ผ่ฒ,ๅธŒๆœ›้™ไฝŽ้ฉ็”จ้–€ๆชป,ๅŠ ไธŠๅ„ๅœ‹ๅŠ›ๆ‹šไพ›ๆ‡‰้ˆ่‡ชไธปๅŒ–ใ€ๅŠ ็ขผ่ฃœๅŠฉๅŠๅฐŽ้ซ”็”ขๆฅญ,็ถ“ๆฟŸ้ƒจๅฎ˜ๅ“ก่กจ็คบ,็ถ“ๆฟŸ้ƒจๅ’Œ่ฒกๆ”ฟ้ƒจๅฐฑ็”ขๅ‰ต10ไน‹2้”ๆˆๅ…ฑ่ญ˜,็ˆญๅ–่ฎ“ๆ›ดๅคšๆฅญ่€…ๅ—ๆƒ ,็›ผๅขžๅผทไผๆฅญๆŠ•่ณ‡ๅŠ›้“ๅŠ้žๅ›บๅฐ็ฃๆŠ€่ก“ๅœฐไฝใ€‚
่ฒกๆ”ฟ้ƒจๅฎ˜ๅ“ก่กจ็คบ,็งŸ็จ…็Žๅ‹ต็š„ๅˆถๅฎšๅฟ…้ ˆใ€Œๆœ‰็‚บๆœ‰ๅฎˆใ€,ไธฆไปฅ้”ๅˆฐ็Žๅ‹ต่จญ็ฝฎ็›ฎ็š„็‚บๆœ€้ซ˜ๅŽŸๅ‰‡,็พ้šŽๆฎตๅœจๆ‰“ใ€Œๅœ‹ๅ…ง็›ƒใ€็š„ไผๆฅญไปๅฏ้ฉ็”จ็”ขๅ‰ต็ฌฌ10ๆขใ€10ไน‹1็š„็งŸ็จ…ๅ„ชๆƒ ,ๅ…ฑๅŒๅฃฏๅคงๅฐ็ฃ็ถ“ๆฟŸ็™ผๅฑ•ใ€‚
็ถ“ๆฟŸ้ƒจๅ’Œ่ฒกๆ”ฟ้ƒจๆญฃๅฐฑ็ ”็™ผ่ฒป็”จ้–€ๆชปๅšๆœ€ๅพŒ็ขบ่ช,ๅพ…ไปŠๆ˜Žๅ…ฉๅคฉ้ ๅ‘Šๅญๆณ•ไน‹ๅพŒ,็ด„ๆœ‰30ๅคฉๆ™‚้–“,ๅฏ่ˆ‡ๆฅญ็•Œ้€ฒไธ€ๆญฅ่จŽ่ซ–ๅŠ่ชฟๆ•ด,็›ผ็”ขๅ‰ต10ไน‹2่ƒฝๅœจ6ๆœˆไธŠ่ทฏใ€‚
ๅฐ็ฃใ€Œ็”ขๆฅญๅ‰ตๆ–ฐๆขไพ‹ใ€็š„ไฟฎๆญฃๆกˆ๏ผŒๅˆ็จฑใ€Œๅฐ็‰ˆๆ™ถ็‰‡ๆณ•ใ€,ๆ—จๅœจๆ”ฏๆŒๅŠๅฐŽ้ซ”ใ€้›ปๅ‹•่ปŠๅŠ5GๆŠ€่ก“้ ˜ๅ…ˆไผๆฅญ็š„ๅ‰ตๆ–ฐใ€‚ๆ–ฐๆ”ฟ็ญ–ๆไพ›ๆœ€้ซ˜25%็š„ๆ‰€ๅพ—็จ…ๆŠตๅ…๏ผŒไผๆฅญ้œ€ๆปฟ่ถณ็•ถๅนดๅบฆ็ ”็™ผ่ฒป็”จ้”ไธ€ๅฎš่ฆๆจกใ€็ ”็™ผๅฏ†ๅบฆ่ถ…้Ž6%ไปฅๅŠๆœ‰ๆ•ˆ็จ…็އ้”ไธ€ๅฎšๆฏ”็އ็ญ‰ๆขไปถใ€‚ๅ› ๆ‡‰ๅœ‹้š›้–“ๅฐไผๆฅญๆœ€ไฝŽ็จ…่ฒ ๅˆถๅบฆ็š„่ชฟๆ•ด๏ผŒๅฐ็ฃ็š„้–€ๆชป้ ่จˆๅˆ†ๅˆฅๆ–ผ2023ๅนดๅ’Œ2024ๅนด้”ๅˆฐ12%ๅ’Œ15%ใ€‚็›ฎๅ‰๏ผŒ่ˆ‡่ฒกๆ”ฟ้ƒจ่จŽ่ซ–ๅพŒ๏ผŒ็ขบๅฎš100ๅ„„ๅ…ƒ็š„่จญๅ‚™ๆŠ•่ณ‡ๆŠตๅ…ๆ˜ฏๅฏ่กŒ็š„่ช˜ๅ› ๏ผŒไธ”็ ”็™ผ่ฒป็”จ้–€ๆชป้ ่จˆๆœƒๅœจ60่‡ณ70ๅ„„ๅ…ƒไน‹้–“ใ€‚ๅฐๆ–ผ้™ไฝŽ้–€ๆชป็š„่ฆๆฑ‚๏ผŒ็ถ“ๆฟŸ้ƒจๅ’Œ่ฒกๆ”ฟ้ƒจๅทฒ้”ๆˆๅ…ฑ่ญ˜๏ผŒๆ—จๅœจๆ“ดๅคงๅ—ๆƒ ็ฏ„ๅœไธฆ้ผ“ๅ‹ตไผๆฅญๆŠ•่ณ‡ไปฅ็ถญๆŒไพ›ๆ‡‰้ˆ็š„้—œ้ตๅœฐไฝใ€‚ๆญคๅค–๏ผŒ้‡ๅฐIC่จญ่จˆๆฅญ่€…็š„่จดๆฑ‚๏ผŒไนŸ่€ƒๆ…ฎ่ชฟ้™้–€ๆชปใ€‚้ ่จˆๆญคๆณ•ๆกˆๅฐ‡ๅœจ่ฟ‘ๆœŸๅ…งๅฏฆๆ–ฝ๏ผŒ็›ธ้—œ่จŽ่ซ–ๅ’Œ่ชฟๆ•ดๅฐ‡ๅœจ้ ๅ‘ŠๆœŸๅพŒ้€ฒไธ€ๆญฅ้€ฒ่กŒใ€‚
Translation ไปฅไธ‹ๆไพ›่‹ฑๆ–‡ๅ…งๅฎน๏ผŒ่ซ‹ๅนซๆˆ‘็ฟป่ญฏๆˆไธญๆ–‡ใ€‚Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. ๆฑๅฑฑๅ’–ๅ•กไปฅๅพ—ๅคฉ็จๅŽš็š„ๅœฐไฝ๏ผŒไธๆ–ท็ฒพ็ ”็š„่ฃฝไฝœๅทฅ่—่‘—็จฑ๏ผŒ้ขจๅ‘ณ็ฒ่ซธๅคšๅ’–ๅ•กๆ„›ๅฅฝ่€…ๆŽจๅด‡ใ€‚

Training methods

  • Software / hardware spec
    • GPU: H100
    • Training Framework: PyTorch
  • Data preprocessing
    • Character normalization
    • Deduplication
    • Denoise
      • Html tagใ€javascript in web content
      • Non-standard characters or garbage characters
      • Posts with an insufficient number of characters
      • Removing specific formats such as extra line breaks added for formatting purposes
    • Removing personal information such as emails and phone numbers.
    • Remove inappropriate content such as gambling, pornography, etc..
  • Character and word expanding
    • Enhancing the performance of Traditional Chinese input and output, the expanded data include the following two parts:
      • Obtaining Chinese characters from the Ministry of Education's "Variant Chinese Characters Dictionary" and "Corrected Characters Table".
      • Collecting over 5,000,000 sentences with more than 100 characters each from the Traditional Chinese Wikipedia, news articles, and the Chinese Common Crawl data (2.1G), used to train the tokenizer for Chinese characters and words.
  • Continuous pretraining (CP)
    • Supplementing the model with a large amount of reliable Traditional Chinese knowledge.
    • Hyper parameters
      • optimizer: AdamW
      • learning rate: 1e-4
      • batch size: 1M tokens
      • epoch: 1
  • Fine tune (FT)
    • Enabling the model to answer questions in Traditional Chinese.
    • Hyper parameters
      • optimizer: AdamW
      • learning rate: 5e-5
      • batch size: 256K tokens
      • epoch: 3

Training Data

  • Continuous pre-training data (about 140GB)
    Dataset Description
    Litigation Data Civil litigation data from various levels of courts in the judicial rulings, including data from 2013/01 to 2023/12.
    CNA news The CNA news includes daily news articles from June 1993 to June 2023, spanning a period of 30 years. The content covers various domains such as domestic and international politics, society, economy, culture, education, and lifestyle.
    ETtoday news ETtoday news data, including data from 2011/10 to 2023/12.
    Legislative Yuan Gazette The Legislative Yuan Gazette contains data from the 1st session of the 8th term to the 7th session of the 10th term.
    Publisher Website Book Introduction Includes book introduction data from the websites of SunColor, Gotop publishers.
    Abstracts of GRB research projects GRB is an information system that compiles research projects funded by government grants and their outcome reports. This dataset primarily includes research project abstracts from 1993 to 2023, including both Chinese and their English counterparts.
    Academic conference proceedings abstracts The database contains academic conference proceedings held in Taiwan from 1988 to 2009.
    Taiwan Panorama magazine Taiwan Panorama magazine contains articles from July 1993 to June 2023, spanning 30 years. The content focuses on Taiwanese culture, tourism, and local customs.
    ๆจ‚่ฉž็ถฒ ใ€Šๆจ‚่ฉž็ถฒใ€‹covers approximately 187,000 academic terms in the humanities and social sciences, along with their translations.
    Data from various ministries and commissions Including partial data from government department websites such as the Executive Yuan's "National Overview", the Ministry of Culture's "National Cultural Memory Bank", the National Development Council's "Archives Support Teaching Network", the Ministry of Transportation's "Traffic Safety Portal", etc.
    Business Today Business Today Magazine is a weekly magazine focused on finance. The dataset includes articles from 2008/01 to 2023/07.
    Mandarin and idiom dictionary from the Ministry of Education Dataset including:
    Idiom Dictionary: Contains 5,338 idioms, including definitions, original stories, usage explanations, and example sentences.
    Revised Mandarin Dictionary: contains Chinese words and various vocabulary, including pronunciation, radicals, definitions, and other information, totaling approximately 165,539 entries.
    Concise Mandarin Dictionary: is a condensed version of the "Revised Mandarin Dictionary", containing a total of 45,247 entries.
    SCITechVista The dataset includes science news and popular science articles from the SCITechVista website.
    iKnow The iKnow platform provides information on market trends, strategic analysis, patent knowledge, and technology transaction information for Taiwan and the global technology industry. The dataset includes data from 2005/01 to 2023/07.
    Science Development Monthly Magazine Science Development Monthly Magazine is a popular science publication published by the National Science Council (NSC) to promote science education. It includes articles from 2004/10 to 2020/12. In 2021, the magazine was relaunched as "CharmingSCITech" quarterly, providing new knowledge on international technology issues.
    Legislation Database The Legislation Database includes the latest central regulations, rules, draft bills, and local regulations issued by government agencies as of 2023/10.
    Local Government Tourism Websites Covering partial data from tourism websites of local government counties and cities in Taiwan.
    Curriculum Guidelines from the National Institute of Education The dataset includes curriculum guidelines for different subjects at various levels of education.
    CNA's English and Chinese Name Translation Database The English and Chinese Name Translation Database of the Central News Agency (CNA) collects translations of foreign and Chinese surnames, personal names, organizations, and place names used in news.
    Fairy tales A total of 20 fairy tale books, including "Tom Sawyer," "Peter Pan," "Alice's Adventures in Wonderland," "Uncle Long Legs," and more.
    RedPajama-Data-V2 Extracting English data from the RedPajama-Data-v2 multilingual dataset
    MathPile-commercial A mathematics-focused dataset obtained from MathPile-commercial
    Traditional Chinese Wikipedia Articles The content of all articles in Traditional Chinese Wikipedia, up to January 2023.
    github-code-clean An open-source code dataset on GitHub. After removing unlicensed code and documents.
  • Fine tune data
    • The TAIDE team trains the LLaMA2 series models to generate fine-tuning data, which generates single or multi-turn conversations on topics such as world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values. The fine tune data consists of 128K prompt-response pairs and will be released publicly later.

Evaluation

  • taide-bench
    • Data
      • Tasks include writing articles, writing letters, summarizing articles, translating from English to Traditional Chinese, translating from Traditional Chinese to English. There are 500 questions in total.
      • data link: taide-bench
    • Evaluation method
    • Scores
      Model Translating from Traditional Chinese to English Translating from English to Traditional Chinese Summerization Writing articles Writing letters Average
      TAIDE-LX-7B-Chat 7.165 7.685 7.720 9.635 9.110 8.263
      GPT3.5 8.880 8.810 7.450 9.490 8.750 8.676
      LLAMA2 7B 6.075 4.475 5.905 2.625 3.040 4.424
      LLAMA2 13B 6.480 6.135 6.110 2.565 3.000 4.858
      LLAMA2 70B 6.975 6.375 6.795 2.625 2.990 5.152

License

Disclaimer

  • Due to limitations in its design architecture and the inevitable biases in data, any response from the LLM model does not represent the stance of TAIDE. Additional security measures should be implemented before use, and responses may also contain incorrect information. Users are advised not to fully trust the responses.

Development Team

Useful links

Downloads last month
240
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for QuantFactory/TAIDE-LX-7B-GGUF

Quantized
(1)
this model