๐ŸงŽโ€โ™€๏ธACL 2023 Tutorial: Retrieval-based Language Models and Applications

Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen https://acl2023-retrieval-lm.github.io/

่ฟ™้‡Œไธป่ฆไธญๆ–‡ๆ€ป็ป“ๆœฌๆ•™็จ‹ไธญ็š„ไธ€ไบ›้‡็‚นๅ†…ๅฎน

่ฎฒ่€…่ฏดๆ˜Ž๏ผš ๆœฌๆ•™็จ‹ๆ˜ฏๆœ€ๅ‰ๆฒฟ็š„๏ผŒไธŽๅ‚ๆ•ฐๅŒ–llm็›ธๆฏ”๏ผŒๆˆ‘ไปฌ่ฟ˜่ฟœ่ฟœไธ่ƒฝ็†่งฃๅฆ‚ไฝ•ๆœ€ๅฅฝๅœฐๅผ€ๅ‘ๅŸบไบŽๆฃ€็ดข็š„lm๏ผŒ่ฟ™ไธชๆ•™็จ‹ไธป่ฆๅˆ†ไบซ๏ผš

  • ็Žฐๆœ‰็ ”็ฉถ็š„ๅˆ†็ฑปๅ’Œๅ…ณ้”ฎ่ง่งฃ

  • ๆˆ‘ไปฌๅฏนๅฝ“ๅ‰ๆŒ‘ๆˆ˜ๅ’Œๅผ€ๆ”พ้—ฎ้ข˜็š„็œ‹ๆณ•

1. Introduction

1. ไป€ไนˆๆ˜ฏRetrieval-based language models (LMs)๏ผŸ

Retrieval-based LMs = Retrieval + LMs ่ฏญ่จ€ๆจกๅž‹ไปŽๅค–้ƒจๆ•ฐๆฎๅญ˜ๅ‚จไธญ่ฟ›่กŒๆฃ€็ดข๏ผˆ่‡ณๅฐ‘ๅœจๆŽจ็†ๆœŸ้—ด๏ผ‰

่ฟ™ๆ ท็š„ๆจกๅž‹ไนŸ่ขซ็งฐไธบๅŠๅ‚ๆ•ฐๆจกๅž‹ๅ’Œ้žๅ‚ๆ•ฐๆจกๅž‹๏ผˆsemiparametric and non-parametric models๏ผ‰

2. The age of large language models (LLMs)๏ผšไธป่ฆไป‹็ป็›ฎๅ‰ๅคง่ฏญ่จ€ๆจกๅž‹็š„ไธ€ไบ›็‰น็‚น

  • Transformers-based, fully parametric

  • Trained on next-token prediction tasks (+ RLHF;)

  • Model size โ†‘, data sizeโ†‘

3. Retrieval for knowledge-intensive NLP tasks ๅฏน็Ÿฅ่ฏ†ๅฏ†้›†ๅž‹ไปปๅŠก็š„ๆฃ€็ดข

Representative tasks: open-domain QA, fact-checking, entity linking...

LMๆŽจๅŠจไบ†ๅคง้‡ๅ…ณไบŽๅฏ†้›†ๆฃ€็ดข็š„ๆ›ดๅฅฝ็ฎ—ๆณ•็š„็ ”็ฉถ๏ผŒไพ‹ๅฆ‚๏ผŒDPR๏ผŒColBERT,ANCE,Contriever๏ผŒ..

4. Why retrieval-based LMs?

  • LLMs canโ€™t memorize all (long-tail) knowledge in their parameters ๅคงๆจกๅž‹็š„ๅ‚ๆ•ฐๅฏน็Ÿฅ่ฏ†็š„่ฎฐๅฟ†ๆœ‰้™

  • LLMsโ€™ knowledge is easily outdated and hard to update ๅคงๆจกๅž‹็š„็Ÿฅ่ฏ†ๅฎนๆ˜“่ฟ‡ๆ—ถ๏ผŒ้šพไปฅๆ›ดๆ–ฐ----็Žฐๆœ‰็š„็Ÿฅ่ฏ†็ผ–่พ‘ๆ–นๆณ•ไป็„ถๆ˜ฏไธๅฏๆ‰ฉๅฑ•็š„๏ผˆ็ ”็ฉถๆ–นๅ‘๏ผ๏ผ‰่€Œๆ•ฐๆฎๅญ˜ๅ‚จๅฏไปฅๅพˆๅฎนๆ˜“ๅœฐๆ›ดๆ–ฐๅ’Œๆ‰ฉๅฑ•โ€”โ€”็”š่‡ณไธ้œ€่ฆ้‡ๆ–ฐ่ฎญ็ปƒๆจกๅž‹

  • LLMsโ€™ output is challenging to interpret and verify ๅคงๆจกๅž‹็š„่พ“ๅ‡บ้šพไปฅ้ชŒ่ฏๅ’Œ่งฃ้‡Š--ไปŽๆฃ€็ดข็ป“ๆžœไธญๆ›ดๆ–ฐ็Ÿฅ่ฏ†ๆฅๆบๅฏไปฅ่Žทๅพ—ๆ›ดๅฅฝ็š„่งฃ้‡Šๆ€งๅ’ŒๆŽงๅˆถๆ€ง๏ผˆGenerating text with citations๏ผŒlike newbing๏ผ‰

  • LLMs are shown to easily leak private training data ๅคงๆจกๅž‹ๅฎนๆ˜“ๆณ„ๆผ็งๆœ‰่ฎญ็ปƒๆ•ฐๆฎ ๏ผŒๆ‰€ไปฅๅฏไปฅ้€š่ฟ‡ๅฐ†็งไบบๆ•ฐๆฎๅญ˜ๅ‚จๅœจๆ•ฐๆฎๅญ˜ๅ‚จๅ™จไธญ๏ผŒไปŽ่€Œๅฏนๅ…ถ่ฟ›่กŒไธชๆ€งๅŒ–ๅค„็†๏ผˆ่€Œไธๆ˜ฏ็›ดๆŽฅๅ‚ไธŽๆจกๅž‹ๅ‚ๆ•ฐ่ฎญ็ปƒ๏ผŸ๏ผ‰

  • LLMs are large and expensive to train and run ๅคงๆจกๅž‹่ฎญ็ปƒๅ’Œ่ฟ่กŒๆˆๆœฌ้ซ˜๏ผŒ่€Œๆ•ฐๆฎๅญ˜ๅ‚จๅ™จๅฏไปฅๅœจๆŽจ็†ๆœŸ้—ด่ฟ›่กŒๆฃ€็ดข๏ผŒๅ› ๆญคๅฏไปฅๅ‡ๅฐ‘ๆจกๅž‹็š„ๅคงๅฐๅ’Œๆˆๆœฌ --Long-term goal: can we possibly reduce the training and inference costs, and scale down the size of LLMs?

2. Definition & Preliminaries

1. A Retrieval-based LM: Definition - A language model (LM) that usesan external datastore at test time ๅœจๆต‹่ฏ•ๆœŸ้—ดไฝฟ็”จๅค–้ƒจๆ•ฐๆฎๅญ˜ๅ‚จ็š„่ฏญ่จ€ๆจกๅž‹

2. A language model (LM): Categories

่ฟ™้‡Œๆœ‰ไธ€ไธช้—ฎ้ข˜ๆ˜ฏไธบไป€ไนˆDecoder-onlyๆจกๅž‹ๅ‡ ไนŽๆˆไธบไบ†็ŽฐๅœจLLM็š„ไธปๆตๆžถๆž„๏ผŸ

ๅ‚่€ƒๅšๅฎข๏ผš

https://kexue.fm/archives/9529

https://www.zhihu.com/question/588325646

ไธป่ฆ่ง‚็‚น: ไปปไฝ•NLPไปปๅŠก้ƒฝๅฏไปฅๅˆ†่งฃไธบโ€œ่พ“ๅ…ฅโ€่ทŸโ€œ่พ“ๅ‡บโ€ไธค้ƒจๅˆ†๏ผŒๆˆ‘ไปฌๅฏไปฅๆŠŠๅค„็†โ€œ่พ“ๅ…ฅโ€็š„ๆจกๅž‹ๅซๅšEncoder๏ผŒ็”Ÿๆˆโ€œ่พ“ๅ‡บโ€็š„ๆจกๅž‹ๅซๅšDecoder๏ผŒ้‚ฃไนˆๆ‰€ๆœ‰ไปปๅŠก้ƒฝๅฏไปฅไปŽโ€œEncoder-Decoderโ€็š„่ง†่ง’ๆฅ็†่งฃ๏ผŒ่€ŒไธๅŒๆจกๅž‹ไน‹้—ด็š„ๅทฎ่ทๅœจไบŽEncoderใ€Decoder็š„ๆณจๆ„ๅŠ›ๆจกๅผไปฅๅŠๆ˜ฏๅฆๅ…ฑไบซๅ‚ๆ•ฐ,ๆฏ”ๅฆ‚:

Model
Encoder ๆณจๆ„ๅŠ›
Dncoder ๆณจๆ„ๅŠ›
ๆ˜ฏๅฆๅ…ฑไบซๅ‚ๆ•ฐ

GPT

ๅ•ๅ‘

ๅ•ๅ‘

ๆ˜ฏ

UniLM

ๅŒๅ‘

ๅ•ๅ‘

ๆ˜ฏ

T5

ๅŒๅ‘

ๅ•ๅ‘

ๅฆ

้“พๆŽฅ้˜…่ฏป๏ผš

Transformerๅ‡็บงไน‹่ทฏ

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

3. A language model (LM): Prompting

ๅณ้€š่ฟ‡ไธๅŒ็š„prompt่ฎฉllmๅฎŒๆˆไธๅŒ็š„ไปปๅŠก

4. A language model (LM): Often evaluated with

็ป™ๅ‡บ่ฏ„ไปทๆŒ‡ๆ ‡๏ผš1. Perplexity 2. Downstream accuracy (Zero-shot or few-shot in-context learning,or fine-tuning) ไผšๅœจ็ฌฌไบ”่Š‚่ฏฆ็ป†ไป‹็ป

ไธ€ไธช้—ฎ้ข˜๏ผšไธบไป€ไนˆ่ฆ็”จperplexityๆฅไฝœไธบๆœฌ่ฏพ็จ‹็š„ไธป่ฆๆŒ‡ๆ ‡

โ€œๅœจๆฏ”่พƒๅ‚ๆ•ฐๅŒ–็š„่ฏญ่จ€ๆจกๅž‹ๆ—ถ๏ผŒๅ›ฐๆƒ‘ๅบฆ๏ผˆPPL๏ผ‰็ปๅธธ่ขซ็”จๅˆฐใ€‚ไฝ†ๅ›ฐๆƒ‘ๅบฆ็š„ๆ”นๅ–„่ƒฝๅฆ่ฝฌๅŒ–ไธบไธ‹ๆธธๅบ”็”จไป็„ถๆ˜ฏไธ€ไธช็ ”็ฉถ้—ฎ้ข˜๏ผŒ็Žฐๅทฒๆœ‰็ ”็ฉถ่กจๆ˜Ž๏ผŒๅ›ฐๆƒ‘ๅบฆไธŽไธ‹ๆธธไปปๅŠก๏ผˆๅฐคๅ…ถๆ˜ฏ็”ŸๆˆไปปๅŠก๏ผ‰ๆœ‰ๅพˆๅฅฝ็š„็›ธๅ…ณๆ€ง๏ผŒๅนถไธ”ๅ›ฐๆƒ‘ๅบฆ้€šๅธธๅฏๆไพ›้žๅธธ็จณๅฎš็š„็ป“ๆžœ๏ผŒๅฎƒๅฏไปฅๅœจๅคง่ง„ๆจก่ฏ„ไผฐๆ•ฐๆฎไธŠ่ฟ›่กŒ่ฏ„ไผฐใ€‚(็›ธๅฏนไบŽไธ‹ๆธธไปปๅŠกๆฅ่ฏด๏ผŒ่ฏ„ไผฐๆ•ฐๆฎๆ˜ฏๆฒกๆœ‰ๆ ‡็ญพ็š„๏ผŒ่€Œไธ‹ๆธธไปปๅŠกๅฏ่ƒฝไผšๅ—ๅˆฐๆ็คบ็š„ๆ•ๆ„Ÿๆ€งๅ’Œ็ผบไนๅคง่ง„ๆจกๆ ‡่ฎฐๆ•ฐๆฎ็š„ๅฝฑๅ“๏ผŒไปŽ่€Œๅฏผ่‡ด็ป“ๆžœไธ็จณๅฎš๏ผ‰ใ€‚โ€

5. Inference: Datastore

6. Inference: Index

็›ฎๆ ‡๏ผšๅœจๆ•ฐๆฎๅญ˜ๅ‚จไธญๆ‰พๅˆฐไธŽๆŸฅ่ฏขๆœ€็›ธไผผ็š„ไธ€ๅฐ้ƒจๅˆ†ๅ…ƒ็ด 

sim๏ผša similarity score between two pieces of text ไธ‹้ขๆ˜ฏsimilarity score็š„ไธ€ไบ›ไพ‹ๅญ

index๏ผš็ป™ๅฎšquery๏ผŒ้€š่ฟ‡fast nearest neighbor search๏ผˆ่ฟ™ไนŸๆ˜ฏไธ€ไธช็ ”็ฉถๆ–นๅ‘-ๅฆ‚ไฝ•ๆ›ดๅŠ ๅฟซๅฟซ้€Ÿๅ’Œๅ‡†็กฎ๏ผ‰๏ผŒ่พ“ๅ‡บsimๆœ€ๅคง็š„kไธชๅ…ƒ็ด 

็›ธๅ…ณsoftware: FAISS, Distributed FAISS, SCaNN, etcโ€ฆ

ๅ‚่€ƒ๏ผšFaiss

3. Retrieval-based LM: Architecture

1. Categorization of retrieval-based LMs

Alt text

2. Roadmap

ๆ นๆฎ ๆฃ€็ดขไป€ไนˆ๏ผŒๅฆ‚ไฝ•ไฝฟ็”จๆฃ€็ดข๏ผŒๅœจไป€ไนˆๆ—ถๅ€™ๆฃ€็ดขๅฐ†ๆœ€่ฟ‘็š„็ ”็ฉถๆ€ป็ป“ๅฑ•็คบๅœจไธ‹้ข็š„่ทฏ็บฟๅ›พ๏ผš

Alt text

ๆœฌๆฎตๅผ€ๅง‹ไป‹็ป็ฌฌไธ€ไธช็ป“ๆž„ REALM๏ผšRetrieval-Augmented Language Model Pre-Training--ๆฃ€็ดขๅขžๅผบ็š„้ข„่ฎญ็ปƒ่ฏญ่จ€ๆจกๅž‹

็ŸฅไนŽไธŠไธ€ไบ›้˜…่ฏป็ฌ”่ฎฐ๏ผš

ๅŠจๆœบ๏ผš้ข„่ฎญ็ปƒ่ฏญ่จ€ๆจกๅž‹่ƒฝๅคŸไปŽๆ— ็›‘็ฃๆ–‡ๆœฌ่ฏญๆ–™ไธญๅญฆไน ๅˆฐๅพˆๅคšๅ…ฌๅ…ฑ็Ÿฅ่ฏ†ใ€‚็„ถ่€Œ๏ผŒ่ฟ™ไบ›็Ÿฅ่ฏ†ๅญ˜ๅ‚จๅœจๅ‚ๆ•ฐไธญ๏ผŒๆœ‰ไปฅไธ‹ไธคไธช็ผบ็‚น๏ผš1. ่ฟ™ไบ›็Ÿฅ่ฏ†ๆ˜ฏ้šๅผ็š„๏ผŒไฝฟ็”จๆ—ถ้šพไปฅ่งฃ้‡Šๆจกๅž‹ๅ‚จๅญ˜ใ€ไฝฟ็”จ็š„็Ÿฅ่ฏ†๏ผ›2. ๆจกๅž‹ๅญฆไน ๅˆฐ็š„็Ÿฅ่ฏ†็š„้‡็บงๅ’Œๆจกๅž‹ๅคงๅฐ๏ผˆๅ‚ๆ•ฐ้‡๏ผ‰็›ธๅ…ณ๏ผŒๅ› ๆญคไธบไบ†ๅญฆไน ๅˆฐๆ›ดๅคš็š„็Ÿฅ่ฏ†๏ผŒ้œ€่ฆๆ‰ฉๅ……ๆจกๅž‹ๅคงๅฐใ€‚

้ข„่ฎญ็ปƒ้˜ถๆฎต็š„ๆต็จ‹๏ผš1. ไปŽ้ข„่ฎญ็ปƒ่ฏญๆ–™ไธญ้‡‡ๆ ท ๏ผŒๅนถๅฐ†้ƒจๅˆ†token mask๏ผˆthe [MASK] at the top of the pyramid๏ผ‰๏ผ›2. ้€š่ฟ‡ๆฃ€็ดขๆจกๅ—๏ผŒๆ นๆฎๆ ทๆœฌ ๅŽปๅค–้ƒจ็Ÿฅ่ฏ†ๅบ“๏ผˆๅฆ‚็ปดๅŸบ็™พ็ง‘ๆ–‡ๆกฃ๏ผ‰ไธญๆฃ€็ดข่ƒฝๅคŸๅธฎๅŠฉๆขๅคmask token็š„ๆ–‡ๆกฃ ๏ผˆThe pyramidion on top allows for lessmaterial higher up the pyramid๏ผ‰๏ผ›3. ไฝฟ็”จๆ ทๆœฌ x ๅ†…้ƒจ็š„ไฟกๆฏ๏ผŒไปฅๅŠๆฃ€็ดขๅˆฐ็š„ๆ–‡ๆกฃ ไธญ็š„ไฟกๆฏ๏ผŒๅ…ฑๅŒ้ข„ๆต‹่ขซmaskๆމ็š„token๏ผˆpyramidion๏ผ‰๏ผ›

ๆจกๅž‹็ป“ๆž„๏ผšๆจกๅž‹็š„pre-trainingๅ’Œfine-tuning้ƒฝๅปบๆจกไธบretrieve-then-predict็š„่ฟ‡็จ‹๏ผŒไฝœ่€…ๅฐ†$z$ ่ง†ไธบไธ€ไธช้šๅ˜้‡๏ผŒๅฐ†ๆœ€ๅŽ็š„ไปปๅŠก็›ฎๆ ‡$y|x)$ๅปบๆจกไธบๅฏนไบŽๆ‰€ๆœ‰ๆฝœๅœจๆ–‡ๆกฃ zz ็š„่พน็ผ˜ๆฆ‚็އ๏ผš

p(yโˆฃx)=โˆ‘zโˆˆZp(yโˆฃz,x)p(zโˆฃx)p(y|x)=\sum_{z\in\mathcal{Z}}p(y|z,x)p(z|x)

ไธคไธช้ƒจๅˆ†๏ผšthe neural knowledge retriever(็ฅž็ป็Ÿฅ่ฏ†ๆฃ€็ดขๅ™จ), -> p(zโˆฃx)p(z | x), and the knowledge-augmented encoder(็Ÿฅ่ฏ†ๅขžๅผบ็š„encoder), -> p(yโˆฃz,x)p(y | z, x).

ๅœจ้ข„่ฎญ็ปƒ้˜ถๆฎต๏ผŒไปปๅŠกไธบMLM๏ผ›ๅœจfine-tune้˜ถๆฎต๏ผŒไปปๅŠกไธบOpen-domain QA

่ฎญ็ปƒ็ป†่Š‚๏ผš้’ˆๅฏนๆ•ฐๆฎ้‡่พƒๅคง็š„่งฃๅ†ณๅŠžๆณ•--pretraining้˜ถๆฎตไฝฟ็”จMaximum Inner Product Search๏ผˆๆœ€ๅคงๅ†…็งฏๆœ็ดข--ๅณๅ†…็งฏ็ฉบ้—ดไธ‹็š„KNN๏ผŒMIPS๏ผ‰็š„็ฎ—ๆณ•ๆฅๆ‰พๅˆฐtop-kไธชๆœ€็›ธๅ…ณๆ–‡ๆกฃ๏ผŒไธบไบ†้ฟๅ…ไธ€็›ดๅˆทๆ–ฐMIPS็ดขๅผ•้€ ๆˆ่€—ๆ—ถไธฅ้‡๏ผŒๆฏ้š”่‹ฅๅนฒstepๆ‰ๅˆทๆ–ฐไธ€ๆฌกMIPS็ดขๅผ•๏ผˆ่ฏฅ็ดขๅผ•ไป…็”จๆฅ้€‰ๆ‹ฉtop-kไธชๆ–‡ๆกฃ๏ผŒ่€Œๅœจๆฏไธ€ๆญฅ่ฎญ็ปƒๆขฏๅบฆๅไผ ็š„ๆ—ถๅ€™๏ผŒไป็„ถไฝฟ็”จ็š„ๆ˜ฏๆœ€ๆ–ฐ็š„retreiver็š„ๅ‚ๆ•ฐ๏ผ‰ใ€‚ๅœจfine-tune้˜ถๆฎต๏ผŒMIPS็ดขๅผ•ไป…ๅœจไธ€ๅผ€ๅง‹ๅปบ็ซ‹ไธ€ๆฌก๏ผˆไฝฟ็”จ้ข„่ฎญ็ปƒ็š„retrieverๅ‚ๆ•ฐ๏ผ‰๏ผŒไน‹ๅŽไพฟไธๅ†ๆ›ดๆ–ฐใ€‚ไฝœ่€…่ฎคไธบๅœจ้ข„่ฎญ็ปƒ้˜ถๆฎตๆฃ€็ดขๅ™จๅฐฑๅทฒ็ปๅญฆไน ๅˆฐไบ†่ถณๅคŸๅฅฝ็š„ๆ–‡ๆกฃ็›ธๅ…ณๆ€ง่กจๅพ๏ผŒไฝ†ไฝœ่€…่ฎคไธบๅฆ‚ๆžœๅŒๆ ทๅœจfine-tune้˜ถๆฎต่ฟญไปฃๆ›ดๆ–ฐMIPS็ดขๅผ•็š„่ฏ๏ผŒๆ•ˆๆžœๅฏ่ƒฝไผšๆ›ดๅฅฝใ€‚

trick๏ผš1. Salient span masking๏ผˆSSM๏ผ‰๏ผšๅณๅœจMLM้ข„่ฎญ็ปƒ้˜ถๆฎต๏ผŒ้ฎ็›–ๅ…ณ้”ฎ็š„ๅฎžไฝ“/ๆ•ฐๅญ—๏ผŒ่€Œไธๆ˜ฏ้šๆœบtoken๏ผ›2. null document๏ผš้ƒจๅˆ†MLMๆ ทๆœฌไธ้œ€่ฆๅค–้ƒจๆ–‡ๆกฃๆ”ฏๆŒ๏ผ›3. ้ฟๅ…ไฟกๆฏๆณ„ๆผ๏ผšๅฝ“MLM็š„่ฎญ็ปƒ่ฏญๆ–™ๅ’Œๆฃ€็ดข่ฏญๆ–™ๆœ‰้‡ๅ ๆ—ถ๏ผŒ้ฟๅ…็›ดๆŽฅๆœ็ดขๅˆฐๆ ทๆœฌx็š„ๅŽŸๆ–‡๏ผ›4. ๆฃ€็ดขๅ™จ็š„ๅˆๅง‹ๅŒ–ใ€ๅ†ทๅฏๅŠจ้—ฎ้ข˜๏ผšๅฆ‚ๆžœไธ€ๅผ€ๅง‹้šๆœบๅˆๅง‹ๅŒ–ๆฃ€็ดขๅ™จ๏ผŒ้‚ฃไนˆๆ–‡ๆกฃๅฐ†ไผšๅคงๆฆ‚็އๆ˜ฏๅฎŒๅ…จๆ— ๅ…ณ็š„๏ผŒๆจกๅž‹ๅพ—ไธๅˆฐๆœ‰ๆ•ˆ็š„ๆขฏๅบฆ๏ผ›ไธบไบ†้ฟๅ…่ฟ™ไธช้—ฎ้ข˜๏ผŒไฝœ่€…ไฝฟ็”จInverse Cloze Test๏ผˆICT้€†ๅฎŒๅฝขๅกซ็ฉบ๏ผ‰ไปปๅŠกๆฅๅˆๅง‹ๅŒ–่ฎญ็ปƒๆฃ€็ดขๅ™จใ€‚

็›ธๅ…ณๅทฅไฝœๆ€ป็ป“๏ผš REALM and subsequent work

  • REALM (Guu et al 2020): MLM followed by fine-tuning, focusing on open-domain QA

  • DPR (Karpukhin et al 2020): Pipeline training instead of joint training, focusing on open-domain QA (no explicit language modeling)

  • RAG (Lewis et al 2020): โ€œGenerativeโ€ instead of โ€œmasked language modelingโ€, focusing on open-domain QA & knowledge intensive tasks (no explicit language modeling)

  • Atlas (Izcard et al 2022): Combine RAG with retrieval-based language model pre-training based on the encoder-decoder architecture (more to come in Section 4), focusing on open-domain QA & knowledge intensive tasks

  • Papers that follow this approach focusing on LM perplexity have come out quite recently (Shi et al. 2023, Ram et al. 2023) ๏ผšRam et al. 2023. โ€œIn-Context Retrieval-Augmented Language Modelsโ€&Shi et al. 2023.โ€œREPLUG: Retrieval-Augmented Black-Box Language Models

Retrieval-in-context LM

็›ธๅ…ณ่ฎบๆ–‡๏ผš

In-Context Retrieval-Augmented Language Models

ๅœจไธŠ้ข่ฟ™็ฏ‡่ฎบๆ–‡ไธญๆœ‰ไธ€ไบ›ๅฎž้ชŒ็ป“่ฎบ:1. Retrieval helps overall sizes of LMs 2. A shorter prefix (more recent tokens) as a query helps 3. Retrieving more frequently helps(ไฝ†ๆ˜ฏไผšๆถˆ่€—ๆ›ดๅคš็š„ๆŽจ็†ๆ—ถ้—ดๆˆๆœฌ)

REPLUG: Retrieval-Augmented Black-Box Language Models

Alt text

โ€œIncorporation in the โ€œintermediate layerโ€ instead of the โ€œinputโ€ layer โ†’ designed for many chunks, frequently, more efficientlyโ€

RETRO(Retrieval-Enhanced Transformer )-- improving language models through explicit memory at unprecedented scale

ๅˆๅนถๅˆฐไธญ้—ดๅฑ‚่€Œไธๆ˜ฏ่พ“ๅ…ฅๅฑ‚ + ๆ•ฐๆฎ่ง„ๆจก็š„ๅขžๅŠ 

็›ธๅ…ณ็ฌ”่ฎฐ๏ผš

https://zhuanlan.zhihu.com/p/475346411

https://www.cnblogs.com/Matrix_Yao/p/16480698.html

Alt text

ๅŠจๆœบ๏ผšๆจกๅž‹ๅ‚ๆ•ฐโ†‘ ๆจกๅž‹ๆ•ฐๆฎ้‡โ†‘ ๅฎนๆ˜“ๅ‘็”Ÿๆ•ฐๆฎ้›†้šพ็†่งฃใ€ๅขžๅŠ ๆจกๅž‹ๅๅทฎ็ญ‰ไธ€็ณปๅˆ—้—ฎ้ข˜๏ผŒไธบไบ†่งฃๅ†ณ่ฟ™ไธช้—ฎ้ข˜๏ผŒDeepMindๅ›ข้˜Ÿ็ ”ๅ‘ไธ€็งๅธฆๆœ‰ไบ’่”็ฝ‘่ง„ๆจกๆฃ€็ดข็š„้ซ˜ๆ•ˆ้ข„่ฎญ็ปƒๆจกๅž‹ใ€‚ไฝฟ็”จ RETRO๏ผŒๆจกๅž‹ไธไป…้™ไบŽ่ฎญ็ปƒๆœŸ้—ด็œ‹ๅˆฐ็š„ๆ•ฐๆฎ๏ผŒๅฎƒ่ฟ˜ๅฏไปฅ้€š่ฟ‡ๆฃ€็ดขๆœบๅˆถ่ฎฟ้—ฎๆ•ดไธช่ฎญ็ปƒๆ•ฐๆฎ้›†ใ€‚ไธŽๅ…ทๆœ‰็›ธๅŒๆ•ฐ้‡ๅ‚ๆ•ฐ็š„ๆ ‡ๅ‡† Transformer ็›ธๆฏ”๏ผŒ่ฟ™ไผšๅธฆๆฅๆ˜พ็€็š„ๆ€ง่ƒฝๆๅ‡ใ€‚

ๆ•ฐๆฎ้›†๏ผšMassiveTextๆ•ฐๆฎ้›†(ๆฅ่‡ชgopherๆจกๅž‹่ฎบๆ–‡)

ๆๅ‡บไบ†ไธ€็ง้ฟๅ…ๆ•ฐๆณ„้œฒ็š„ๆ–นๆณ•๏ผšๆฃ€็ดข็š„่ฟ‡็จ‹ๅฐฑ่ƒฝ็›ดๆŽฅ่ฎฟ้—ฎ่ฎญ็ปƒ้›†ๆ‰€ไปฅ้˜ฒๆญขๆ•ฐๆฎๆณ„้œฒๅพˆ้‡่ฆ-ไธบๆญค่ฎบๆ–‡ไฝœ่€…ๆๅ‡บไบ†ไธ€็ง่กก้‡ๆต‹่ฏ•ๆ–‡ๆกฃไธŽ่ฎญ็ปƒ้›†ๆŽฅ่ฟ‘็จ‹ๅบฆ็š„่ฏ„ไผฐๆ–นๅผDeduplicating Training Data Makes Language Models Better

ๆจกๅž‹็ป“ๆž„ RETROๆจกๅž‹ๆžถๆž„็”ฑไธ€ไธช็ผ–็ ๅ™จๅ †ๆ ˆ๏ผˆๅค„็†่ฟ‘้‚ป๏ผ‰ๅ’Œไธ€ไธช่งฃ็ ๅ™จๅ †ๆ ˆ๏ผˆๅค„็†่พ“ๅ…ฅ๏ผ‰็ป„ๆˆ๏ผš ็ผ–็ ๅ™จๅ †ๆ ˆ็”ฑๆ ‡ๅ‡†็š„ Transformer ็ผ–็ ๅ™จๅ—็ป„ๆˆ๏ผ›่งฃ็ ๅ™จๅ †ๆ ˆๅŒ…ๅซไบ†Transformer่งฃ็ ๅ™จๅ—ๅ’ŒRETRO ่งฃ็ ๅ™จๅ—๏ผˆATTN + Chunked cross attention (CCA) + FFNN๏ผˆFeed-forward neural network๏ผ‰๏ผ‰ใ€‚

Alt text

็ฎ€ๅŒ–ๆต็จ‹๏ผš

ๅฏนๆฏ”๏ผš

ๆ€่€ƒ๏ผš้™คไบ†ๆฃ€็ดขsplitๆˆchunks๏ผŒ่ฟ˜ๅฏไปฅๆ€Žไนˆๅค„็†dbไธญ็š„ๆ•ฐๆฎ๏ผŸ

โ†“

ๆๅ‡บkNN-LMs๏ผŒๆŠŠ่ฏญไน‰็ผ–็ ็‰นๅพๅ‘้‡็š„kๆœ€่ฟ‘้‚ปๅ’Œไธ€่ˆฌ็š„่ฏญ่จ€ๆจกๅž‹็ป“ๅˆไปŽ่€Œๆ˜พ่‘—ๆ้ซ˜่ฏญ่จ€ๆจกๅž‹็š„ๆ•ˆๆžœ

  • โ€œA different way of using retrieval, where the LM outputs a nonparametric distribution over every token in the data.โ€ ๅฆไธ€็งไฝฟ็”จๆฃ€็ดข็š„ๆ–นๆณ•๏ผŒๅ…ถไธญLMๅœจๆ•ฐๆฎไธญ็š„ๆฏไธชๆ ‡่ฎฐไธŠ่พ“ๅ‡บไธ€ไธช้žๅ‚ๆ•ฐๅˆ†ๅธƒใ€‚

  • โ€œCan be seen as an incorporation in the โ€˜outputโ€™ layerโ€ ๅฏไปฅ็œ‹ๅšๆ˜ฏๅœจ่พ“ๅ‡บๅฑ‚็š„ไธ€ไธชๅˆๅนถ

ๅŠจๆœบ๏ผš่ฏญ่จ€ๆจกๅž‹๏ผˆLanguage Model, LM๏ผ‰ๆŒ‡็š„ๆ˜ฏๅˆฉ็”จ้“พๅผๆณ•ๅˆ™็ป™ๅ‡บไธ€ไธชๅฅๅญ็š„ๆฆ‚็އ๏ผŒไธป่ฆ่ฆ่งฃๅ†ณไธคไธช้—ฎ้ข˜๏ผš๏ผˆ1๏ผ‰ๅพ—ๅˆฐไธŠๆ–‡่กจ็คบ๏ผ›๏ผˆ2๏ผ‰็”จไธŠๆ–‡่กจ็คบ้ข„ๆต‹ไธ‹ไธ€ไธชtokenใ€‚่ฟ™ไธคไธช้—ฎ้ข˜ไธ€่ˆฌไฝฟ็”จไธ€ไธชautoregressiveๆจกๅž‹่งฃๅ†ณใ€‚ไฝฟ็”จARๆจกๅž‹ๅŽป่ฟ›่กŒ่ฏญ่จ€ๅปบๆจก็š„ไธ€ไธชๆ™ฎ้้—ฎ้ข˜ๆ˜ฏ๏ผš้šพไปฅๅ……ๅˆ†ๅปบ็ซ‹้•ฟ่ท็ฆปไพ่ต–ใ€‚็”ฑๆญคๅ‡บๅ‘๏ผŒๆœฌๆ–‡ๆๅ‡บ้€š่ฟ‡่ฎก็ฎ—ไธŠๆ–‡่กจ็คบ็š„kๆœ€่ฟ‘้‚ปๅŽป็ป“ๅˆ่ฏญ่จ€ๆจกๅž‹ไปŽ่€Œๆ›ดๅฅฝๅœฐๆ•ๆ‰ไธŠไธ‹ๆ–‡ไน‹้—ด็š„่ฏญไน‰ๅ…ณ็ณปใ€‚

ๆจกๅž‹็ป“ๆž„๏ผš

ๅ…ทไฝ“็š„ๆต็จ‹ๅฏไปฅๅŽป็œ‹slide่ฎฒ็š„ๅพˆๆธ…ๆฅš

ๆจกๅž‹ๅฎž้ชŒ็ป“ๆžœ:

Can use in-domain datastore even if parameters were not trained in-domain

ๅฏนๆฏ”ๆ€ป็ป“๏ผš

KNN-LM็š„ไผ˜็‚น:ๆ›ด็ป†็ฒ’ๅบฆ๏ผ›ๅฏไปฅๆ›ดๅฅฝๅœฐๅค„็†็ฝ•่ง็š„ๆจกๅผ&ๅŸŸๅค–ๆ•ฐๆฎ๏ผŒๅฏไปฅ้žๅธธ้ซ˜ๆ•ˆ๏ผˆๅ› ไธบKNNๆœ็ดขๅพˆๅฟซ๏ผ‰

็ผบ็‚น: ่พ“ๅ…ฅๅ’Œๆฃ€็ดข็ป“ๆžœไน‹้—ดๆฒกๆœ‰ไบคๅ‰ๆณจๆ„๏ผ›Datastoreๆถˆ่€—ๆฏ”่พƒๅคง

ๆ€่€ƒ: ๅœจwhen to retrieveไธญ๏ผŒevery n tokensๅ’Œevery tokensๆ˜ฏๅฆๅฏไปฅๅŽปๅš adaptive ๏ผŸ โ†“

Adaptive retrieval for efficiency

ๅˆ†ไธบไธค็ฑป๏ผšAdaptive retrieval of text chunks (following retrieve-in-context)๏ผ›Adaptive retrieval of tokens (following kNN-LM)

ๅŠจๆœบๅ’Œๆฆ‚่ฟฐ๏ผšๅคงๅคšๆ•ฐ็Žฐๆœ‰็š„ๆฃ€็ดขๅขžๅผบๅž‹่ฏญ่จ€ๆจกๅž‹้ƒฝ้‡‡็”จretrieve-and-generate่ฎพ็ฝฎ๏ผŒๆ นๆฎquery่ฟ›่กŒไธ€ๆฌกไฟกๆฏๆฃ€็ดขใ€‚็„ถ่€Œ๏ผŒๅœจๆถ‰ๅŠ็”Ÿๆˆ้•ฟๆ–‡ๆœฌ็š„ๆ›ดไธ€่ˆฌๅœบๆ™ฏไธญ๏ผŒๅœจๆ•ดไธช็”Ÿๆˆ่ฟ‡็จ‹ไธญไธๆ–ญๆ”ถ้›†ไฟกๆฏ่‡ณๅ…ณ้‡่ฆใ€‚่ฟ‡ๅŽปๅทฒ็ปๆœ‰ไธ€ไบ›ๅœจ็”Ÿๆˆ่พ“ๅ‡บๆ—ถๅคšๆฌกๆฃ€็ดขไฟกๆฏ็š„ๅŠชๅŠ›๏ผŒ่ฟ™ไบ›ๅŠชๅŠ›ๅคงๅคšไฝฟ็”จๅ…ˆๅ‰็š„ไธŠไธ‹ๆ–‡ไฝœไธบๆŸฅ่ฏขไปฅๅ›บๅฎš็š„ๆ—ถ้—ด้—ด้š”ๆฃ€็ดขๆ–‡ๆกฃใ€‚ๅœจ่ฟ™้กนๅทฅไฝœไธญ๏ผŒๆˆ‘ไปฌๆไพ›ไบ†ไธปๅŠจๆฃ€็ดขๅขžๅผบ็”Ÿๆˆ็š„ๆฆ‚ๆ‹ฌ่ง†ๅ›พ๏ผŒๅณๅœจ็”Ÿๆˆ่ฟ‡็จ‹ไธญไธปๅŠจๅ†ณๅฎšไฝ•ๆ—ถๆฃ€็ดขไปฅๅŠๆฃ€็ดขไป€ไนˆๅ†…ๅฎน็š„ๆ–นๆณ•ใ€‚ๆˆ‘ไปฌๆๅ‡บไบ†ๅ‰็žปๆ€งไธปๅŠจๆฃ€็ดขๅขžๅผบ็”Ÿๆˆ๏ผˆFLARE๏ผ‰๏ผŒ่ฟ™ๆ˜ฏไธ€็ง้€š็”จ็š„ๆฃ€็ดขๅขžๅผบ็”Ÿๆˆๆ–นๆณ•๏ผŒๅฎƒ่ฟญไปฃๅœฐไฝฟ็”จๅฏนๅณๅฐ†ๅˆฐๆฅ็š„ๅฅๅญ็š„้ข„ๆต‹ๆฅ้ข„ๆต‹ๆœชๆฅ็š„ๅ†…ๅฎน๏ผŒ็„ถๅŽๅฆ‚ๆžœๅฎƒๅŒ…ๅซไฝŽๅฏไฟกๅบฆไปค็‰Œ๏ผŒๅˆ™ๅฐ†ๅ…ถ็”จไฝœๆŸฅ่ฏขๆฅๆฃ€็ดข็›ธๅ…ณๆ–‡ๆกฃไปฅ้‡ๆ–ฐ็”Ÿๆˆๅฅๅญใ€‚

็”ฑ้•ฟๆ–‡ๆœฌ็”ŸๆˆไปปๅŠกๅผ•ๅ‡บ๏ผšไธ€ๆฌกๆฃ€็ดขๅนถไธ่ƒฝๆปก่ถณ้œ€่ฆ๏ผŒไธŽไบบ็ฑปๅœจๅˆ›ๅปบ่ฎบๆ–‡ๆˆ–ไนฆ็ฑ็ญ‰ๅ†…ๅฎนๆ—ถ้€ๆธๆ”ถ้›†ไฟกๆฏ็š„ๆ–นๅผ็ฑปไผผ๏ผŒไฝฟ็”จ LM ่ฟ›่กŒ้•ฟๆ ผๅผ็”Ÿๆˆ้œ€่ฆๅœจๆ•ดไธช็”Ÿๆˆ่ฟ‡็จ‹ไธญๆ”ถ้›†ๅคš็ง็Ÿฅ่ฏ†ใ€‚ ๆœฌๆ–‡้‡‡ๅ–็š„ๆ–นๆณ•ๆ˜ฏ้€š่ฟ‡็”Ÿๆˆไธดๆ—ถ็š„ไธ‹ไธ€ไธชๅฅๅญ๏ผŒๅฐ†ๅ…ถไฝœไธบๆฃ€็ดข็›ธๅ…ณๆ–‡ๆกฃ็š„ๆŸฅ่ฏข๏ผŒ็„ถๅŽๆ นๆฎๆฃ€็ดขๅˆฐ็š„ๆ–‡ๆกฃ้‡ๆ–ฐ็”Ÿๆˆไธ‹ไธ€ไธชๅฅๅญๆฅ้ข„ๆต‹ๆœชๆฅใ€‚

FLARE่ฟญไปฃ็”Ÿๆˆไธ€ไธชไธดๆ—ถ็š„ไธ‹ไธ€ไธชๅฅๅญ๏ผŒๅฆ‚ๆžœๅฎƒๅŒ…ๅซlow-probability tokens๏ผŒๅˆ™ๅฐ†ๅ…ถ็”จไฝœๆฃ€็ดข็›ธๅ…ณๆ–‡ๆกฃ็š„ๆŸฅ่ฏข๏ผŒๅนถ้‡ๆ–ฐ็”Ÿๆˆไธ‹ไธ€ไธชๅฅๅญๅฅๅญ็›ดๅˆฐ็ป“ๆŸใ€‚

ๆ€่€ƒ๏ผšไป€ไนˆๆ˜ฏlow-probability tokens ๅฆ‚ไฝ•็•Œๅฎš

Alt text

่ฏฆ็ป†ๆต็จ‹ๅ‚่€ƒslides

Adaptive retrieval of tokens -Judge necessity-- Efficient Nearest Neighbor Language Models

Adaptive retrieval of tokens Use local info -- RETOMATON -- Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Alt text

ๆ€ป็ป“๏ผš

ๆ€่€ƒ: What else beyond text chunks and tokens to retrieve? โ†“

ๅฎžไฝ“ไธ“ๅฎถๆจกๅž‹

Introduce a new modelโ€”Entities as Experts (EAE)that can access distinct memories of the entities mentioned in a piece of text . ๆๅ‡บโ€œๅฎžไฝ“ไธ“ๅฎถโ€ๆจกๅž‹๏ผŒๅฏไปฅ่ฎฟ้—ฎๆ–‡ๆœฌไธญๆๅˆฐ็š„ๅฎžไฝ“็š„ไธๅŒmemories๏ผŒไธŽๅ…ถไป–ๅฐ†ๅฎžไฝ“็‰นๅฎš็Ÿฅ่ฏ†ๆณจๅ…ฅๅบๅˆ—ๆจกๅž‹็š„ๅŠชๅŠ›ไธๅŒ๏ผŒๆœฌๆจกๅž‹ไปŽๆ–‡ๆœฌไธญๅญฆไน ๅฎžไฝ“่กจ็คบไปฅๅŠๆ‰€ๆœ‰ๅ…ถไป–ๆจกๅž‹ๅ‚ๆ•ฐใ€‚

Alt text ไธŠๅ›พๅฏไธŽ็œ‹ๅˆฐ๏ผŒไผ ็ปŸ็š„Transformer้œ€่ฆๆ นๆฎโ€œCharlesโ€ๅ’Œโ€œDarwinโ€่ฟ™ไธคไธช่ฏๆž„ๅปบ Charles Darwin ็š„ๅ†…้ƒจ่กจ็คบ๏ผŒ่ฟ™ไธคไธช่ฏ้ƒฝๅฏไปฅไนŸๆŒ‡ไธๅŒ็š„ๅฎžไฝ“๏ผŒไพ‹ๅฆ‚ๆŸฅๅฐ”ๆ–ฏๆฒณๆˆ–่พพๅฐ”ๆ–‡ๅธ‚ใ€‚็›ธๅ๏ผŒEAE ๅฏไปฅ่ฎฟ้—ฎโ€œๆŸฅๅฐ”ๆ–ฏยท่พพๅฐ”ๆ–‡โ€็š„ไธ“็”จ่กจ็คบ๏ผŒๅฎƒๆ˜ฏๅ…ˆๅ‰ๆๅˆฐ่ฟ‡่ฏฅๅฎžไฝ“็š„ๆ‰€ๆœ‰ไธŠไธ‹ๆ–‡็š„่ฎฐๅฟ†ใ€‚

Alt text

ไปŽๆฏไธชๅฎžไฝ“ไธ€ไธชๅ‘้‡ๅˆฐๆฏไธชๅฎžไฝ“ๆๅŠไธ€ไธชๅ‘้‡็š„่ฝฌๅ˜--Mention Memory:incorporating textual knowledge into Transformers through entity mention attention้€š่ฟ‡ๅฎžไฝ“ๆๅŠๆณจๆ„ๅŠ›ๅฐ†ๆ–‡ๆœฌ็Ÿฅ่ฏ†่žๅ…ฅtransformerไธญ

ๆ‘˜่ฆ็ฟป่ฏ‘๏ผš

่ฏธๅฆ‚ๅผ€ๆ”พๅŸŸ้—ฎ็ญ”ไน‹็ฑป็š„่‡ช็„ถ่ฏญ่จ€็†่งฃไปปๅŠก้€šๅธธ้œ€่ฆไปŽๅคšไธชๆฅๆบๆฃ€็ดขๅ’Œๅธๆ”ถไบ‹ๅฎžไฟกๆฏใ€‚ๆˆ‘ไปฌๅปบ่ฎฎ้€š่ฟ‡ๅฐ†ๅคงๅž‹ๆ–‡ๆœฌ่ฏญๆ–™ๅบ“็š„ๅŠๅ‚ๆ•ฐ่กจ็คบ้›†ๆˆๅˆฐ Transformer ๆจกๅž‹ไธญไฝœไธบไบ‹ๅฎž็Ÿฅ่ฏ†็š„ๆฅๆบๆฅ่งฃๅ†ณ่ฟ™ไธช้—ฎ้ข˜ใ€‚

ๅ…ทไฝ“ๆฅ่ฏด๏ผŒๆˆ‘ไปฌ็š„ๆ–นๆณ•็”จโ€œๆๅŠ่ฎฐๅฟ†โ€ๆฅ่กจ็คบ็Ÿฅ่ฏ†๏ผŒโ€œๆๅŠ่ฎฐๅฟ†โ€ๆ˜ฏ่ฏญๆ–™ๅบ“ไธญๆๅŠ็š„ๆฏไธชๅฎžไฝ“็š„ๅฏ†้›†ๅ‘้‡่กจ็คบ่กจใ€‚ๆ‰€ๆๅ‡บ็š„ๆจกๅž‹ - TOME - ๆ˜ฏไธ€ไธช Transformer๏ผŒๅฎƒ้€š่ฟ‡ๅ†…้ƒจ่ฎฐๅฟ†ๅฑ‚่ฎฟ้—ฎไฟกๆฏ๏ผŒๅ…ถไธญ่พ“ๅ…ฅๆฎต่ฝไธญๆๅŠ็š„ๆฏไธชๅฎžไฝ“้ƒฝๆถ‰ๅŠๆๅŠ่ฎฐๅฟ†ใ€‚่ฟ™็งๆ–นๆณ•ๅฏไปฅๅœจๅ•ไธช Transformer ๆจกๅž‹ไธญๅฏน่ฎธๅคšไธๅŒ็š„ไฟกๆฏๆบ่ฟ›่กŒ็ปผๅˆๅ’ŒๆŽจ็†ใ€‚ๅœจไฝฟ็”จ 1.5 ไบฟๆก็ปดๅŸบ็™พ็ง‘ๆๅŠ็š„ๅ†…ๅญ˜่ฟ›่กŒ็š„ๅฎž้ชŒไธญ๏ผŒTOME ๅœจๅคšไธชๅผ€ๆ”พ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅฏ†้›†ๅž‹ไปปๅŠกไธŠๅ–ๅพ—ไบ†ๅ‡บ่‰ฒ็š„ๆ€ง่ƒฝ๏ผŒๅŒ…ๆ‹ฌๅฃฐๆ˜Ž้ชŒ่ฏๅŸบๅ‡† HoVer ๅ’Œ FEVER ไปฅๅŠๅคšไธชๅŸบไบŽๅฎžไฝ“็š„ QA ๅŸบๅ‡†ใ€‚ๆˆ‘ไปฌ่ฟ˜่กจๆ˜Ž๏ผŒ่ฏฅๆจกๅž‹ๅœจๆฒกๆœ‰ไปปไฝ•็›ดๆŽฅ็›‘็ฃ็š„ๆƒ…ๅ†ตไธ‹ๅญฆไผšไบ†ๅ…ณๆณจinformative mentionsใ€‚ๆœ€ๅŽ๏ผŒๆˆ‘ไปฌ่ฏๆ˜Ž่ฏฅๆจกๅž‹ๅฏไปฅ้€š่ฟ‡ๆ›ดๆ–ฐๅ†…ๅญ˜่€Œๆ— ้œ€้‡ๆ–ฐ่ฎญ็ปƒๆฅๆŽจๅนฟๅˆฐๆ–ฐ็š„็œ‹ไธ่ง็š„ๅฎžไฝ“ใ€‚

Alt text

ๆ€ป็ป“๏ผš

Alt text

ไผ˜ๅŠฟ๏ผšๅฏนไบŽไปฅๅฎžไฝ“ไธบไธญๅฟƒ็š„ไปปๅŠกๅพˆๆœ‰ๆ•ˆ&็ฉบ้—ด้ซ˜ๆ•ˆ

ๅŠฃๅŠฟ๏ผš้œ€่ฆ้ขๅค–็š„ๅฎžไฝ“ๆฃ€ๆต‹

ไธŠ้ขๆ‰€ๆœ‰็š„ๆจกๅž‹้ƒฝๆ˜ฏๅŸบไบŽๅค–้ƒจๆ–‡ๆœฌ็š„๏ผŒ่ฟ˜ๆœ‰ๅ…ถไป–ๆ–นๆณ•ๅ—๏ผŸโ†“

Retrieval for long-range LM

่ฏญ่จ€ๆจกๅž‹้€šๅธธ้œ€่ฆ่ฟ›่กŒ่ฎญ็ปƒๆˆ–ๅพฎ่ฐƒๆ‰่ƒฝ่Žทๅ–ๆ–ฐ็Ÿฅ่ฏ†๏ผŒ่ฟ™ๆถ‰ๅŠๆ›ดๆ–ฐๅ…ถๆƒ้‡ใ€‚็›ธๅ๏ผŒๆˆ‘ไปฌ่ฎพๆƒณ่ฏญ่จ€ๆจกๅž‹ๅฏไปฅๅœจๆŽจ็†ๆ—ถ็ฎ€ๅ•ๅœฐ่ฏปๅ–ๅ’Œ่ฎฐๅฟ†ๆ–ฐๆ•ฐๆฎ๏ผŒไปŽ่€Œ็ซ‹ๅณ่Žทๅ–ๆ–ฐ็Ÿฅ่ฏ†ใ€‚ๅœจ่ฟ™้กนๅทฅไฝœไธญ๏ผŒๆˆ‘ไปฌๆ‰ฉๅฑ•ไบ†่ฏญ่จ€ๆจกๅž‹๏ผŒไฝฟๅ…ถ่ƒฝๅคŸ่ฎฐไฝ่ฟ‡ๅŽป่พ“ๅ…ฅ็š„ๅ†…้ƒจ่กจ็คบใ€‚ๆˆ‘ไปฌ่ฏๆ˜Ž๏ผŒๅฏนๆœ€่ฟ‘๏ผˆ้”ฎใ€ๅ€ผ๏ผ‰ๅฏน็š„ไธๅฏๅพฎ่ฎฐๅฟ†่ฟ›่กŒ่ฟ‘ไผผ kNN ๆŸฅๆ‰พๅฏไปฅๆ”น่ฟ›่ทจๅ„็งๅŸบๅ‡†ๅ’ŒไปปๅŠก็š„่ฏญ่จ€ๅปบๆจก๏ผŒๅŒ…ๆ‹ฌ้€š็”จ็ฝ‘็ปœๆ–‡ๆœฌ (C4)ใ€ๆ•ฐๅญฆ่ฎบๆ–‡ (arXiv)ใ€ไนฆ็ฑ (PG-19)ใ€ไปฃ็ ๏ผˆGithub๏ผ‰๏ผŒไปฅๅŠๅฝขๅผๅฎš็†๏ผˆIsabelle๏ผ‰ใ€‚ๆˆ‘ไปฌ่กจๆ˜Ž๏ผŒๅฝ“ๆˆ‘ไปฌๅฐ†ๅ†…ๅญ˜ๅคงๅฐๅขžๅŠ ๅˆฐ 262K ไปค็‰Œๆ—ถ๏ผŒๆ€ง่ƒฝไผš็จณๆญฅๆ้ซ˜ใ€‚ๅœจๅŒ…ๆ‹ฌไปฃ็ ๅ’Œๆ•ฐๅญฆๅœจๅ†…็š„ๅŸบๅ‡†ๆต‹่ฏ•ไธญ๏ผŒๆˆ‘ไปฌๅ‘็Žฐ่ฏฅๆจกๅž‹่ƒฝๅคŸๅœจๆต‹่ฏ•ๆœŸ้—ดไฝฟ็”จๆ–ฐๅฎšไน‰็š„ๅ‡ฝๆ•ฐๅ’Œๅฎš็†ใ€‚(ๅŸบไบŽKNNๅŽปๅšๆฃ€็ดข)

ๅฏน้•ฟๅบๅˆ—็š„ๆณจๆ„ๅŠ›ไฝœไธบๅฟซ้€Ÿๅญฆไน ็š„ไธ€็งๅฝขๅผไนŸๅพˆๆœ‰็”จใ€‚ไปฅๆƒ้‡็Ÿฉ้˜ตๅฝขๅผๅญ˜ๅ‚จ็š„ไบ‹ๅฎžๅ’Œไฟกๆฏๅฟ…้กป็ป่ฟ‡ๆ•ฐๅไธ‡ไธช่ฎญ็ปƒๆญฅ้ชค็ผ“ๆ…ข่ฎญ็ปƒใ€‚็„ถ่€Œ๏ผŒ้€š่ฟ‡ไฝฟ็”จๆณจๆ„ๅŠ›๏ผŒๆจกๅž‹ๅฏไปฅ้€š่ฟ‡ๅฐ†ไบ‹ๅฎž๏ผˆไพ‹ๅฆ‚ๅ‡ฝๆ•ฐๅฎšไน‰๏ผ‰ไฝœไธบ๏ผˆ้”ฎ๏ผŒๅ€ผ๏ผ‰ๅฏนๅญ˜ๅ‚จๅœจ้•ฟๆœŸ่ฎฐๅฟ†ไธญๆฅ็ฎ€ๅ•ๅœฐ่ฎฐไฝๅฎƒไปฌ๏ผŒ็„ถๅŽ้€š่ฟ‡ๅˆ›ๅปบๅ…ณๆณจๅฎƒไปฌ็š„ๆŸฅ่ฏขๆฅๆฃ€็ดข่ฟ™ไบ›ไบ‹ๅฎžใ€‚ๅœจ่ฟ™็งๆƒ…ๅ†ตไธ‹๏ผŒๆณจๆ„ๅŠ›ๅ……ๅฝ“ไฟกๆฏๆฃ€็ดข็š„ไธ€็งๅฝขๅผ๏ผŒๅ…่ฎธๆจกๅž‹ๆŸฅๆ‰พๅฎƒไปฅๅ‰่ง่ฟ‡็š„ไบ‹ๅฎžใ€‚

Alt text

โ†‘ๆ‰ฉๅฑ• Transformer ๆฅ่ฎฟ้—ฎๅ…ˆๅ‰็œ‹ๅˆฐ็š„ๅญๅบๅˆ—็š„๏ผˆ้”ฎ๏ผŒๅ€ผ๏ผ‰ๅฏนใ€‚

Bertsch et al. 2023. Unlimiformer: Long-Range Transformers with Unlimited Length Input

้™„ๅฝ•๏ผšๆฆ‚ๅฟต่กฅๅ……

ๆขฏๅบฆๅไผ 

ๅ…ถๅฎžๅฐฑๆ˜ฏๆขฏๅบฆไธ‹้™ๅ’Œๅๅ‘ไผ ๆ’ญ ๅ‚่€ƒ๏ผšhttps://atcold.github.io/pytorch-Deep-Learning/zh/week02/02-1/

ๆขฏๅบฆๅ่ฝฌ

็”จไบŽ้ข†ๅŸŸ่‡ช้€‚ๅบ” ๅ‚่€ƒ๏ผš https://zhuanlan.zhihu.com/p/75470256

Last updated