Why GPT-3 Matters

Number of Parameters of GPT-3 compared to previous models. (<a href='https://www.willstats.com/'>Edited by WillStats</a>, <a href='https://arxiv.org/abs/1910.01108'>Original 1</a>, <a href='https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/'>Original 2</a>)Num­ber of Pa­ra­me­ters of GPT-3 com­pared to pre­vi­ous mod­els. (Edited by Will­Stats, Orig­i­nal 1, Orig­i­nal 2)

The sheer scale of the new GPT-3 model is hard to over­state; it’s an en­tire order of mag­ni­tude larger than Mi­crosoft’s already-​massive 17B pa­ra­me­ter Turing-​NLG.[1] Load­ing the en­tire model’s weights in fp16 would take up an ab­solutely pre­pos­ter­ous 300GB of VRAM, not even in­clud­ing the gra­di­ents. But, with mas­sive size comes mas­sive gen­er­al­iza­tion abil­ity: GPT-3 is com­pet­i­tive in many bench­marks with­out even tun­ing on the tar­get task. And when I say many, I mean many—the full, 72-page paper con­tains an ex­ten­sive eval­u­a­tion of GPT-3 on many NLP datasets. Through the Ope­nAI API, a vast array of im­pres­sive demos have sprung up tak­ing ad­van­tage of the gen­er­al­iza­tion ca­pa­bil­i­ties of GPT-3 to do ex­tremely dis­parate tasks. Per­haps the most im­pres­sive part, though, is that even at such a mas­sive scale, the model still scales smoothly in per­for­mance in­stead of plateau­ing, im­ply­ing that still-​larger mod­els would per­form even bet­ter. Through­out the rest of this post, my goal is to dis­till this mas­sive (in mul­ti­ple ways) paper down to a di­gestible size, and shed some light on why it mat­ters.

Model

The fol­low­ing table sum­ma­rizes some of the largest au­tore­gres­sive Trans­former mod­els of the past few years. I’ve ex­cluded mod­els like XLNet and BERT-​derivatives be­cause they don’t have the same uni­di­rec­tional au­tore­gres­sive train­ing tar­get.

Pa­ra­me­ters Lay­ers Hid­den Size Attn Heads Attn Head Di­men­sion Con­text Length
GPT 0.110B 12 768 12 64 512
GPT-2 1.542B 48 1600 25 64 1024
Megatron-​LM 8.3B 72 3072 32 96 1024
Turing-​NLG 17B 78 4256 28 152 1024
GPT-3 175.0B 96 12288 96 128 2048

While GPT-3 isn’t that much deeper, its width is nearly 3x that of Turing-​NLG, which—since pa­ra­me­ter count scales ap­prox­i­mately pro­por­tional to the square of the hid­den size—ex­plains where most of the extra pa­ra­me­ters come from. It also has dou­ble the con­text size, at 2048 to­kens, which is im­pres­sive (and memory-​expensive!), though not the biggest con­text size across all mod­els; some mod­els have even longer con­texts, like Transformer-​XL, which in­cor­po­rates longer con­texts by pass­ing con­text vec­tors be­tween seg­ments, and Re­former, which uses locality-​sensitive hash­ing to en­able sparser at­ten­tion. Sim­i­larly, GPT-3 uses sparse at­ten­tion lay­ers in every other layer, though the exact de­tails are left some­what am­bigu­ous. It’s also in­ter­est­ing to note that the smaller GPT-3 ver­sions trained for com­par­i­son with GPT-2 are slightly shal­lower and wider, with GPT-3-XL hav­ing only 24 lay­ers but a hid­den size of 2048.[2] GPT-3 also reuses the BPE to­k­eniza­tion of GPT-2. Over­all, GPT-3 is es­sen­tially just a down­right mas­sive ver­sion of GPT-2.

Train­ing Data

Weighted Training Data (<a href='https://arxiv.org/abs/2005.14165'>Source</a>)Weighted Train­ing Data (Source)

The train­ing data is a reweighted mix of Com­mon Crawl, Web­Text2 (a larger ver­sion of the orig­i­nal that also in­cludes links sam­pled in the pe­riod of Jan-​Oct 2018), two book cor­pora, and Eng­lish Wikipedia. Some of these com­po­nents, such as Wikipedia, were seen more than 3 times dur­ing train­ing; oth­ers, like the mas­sive Com­mon Crawl com­po­nent, had less than half of their data seen. The au­thors claim that this is to help raise the over­all qual­ity of the cor­pus by pri­ori­tis­ing known-​good datasets. Also, in con­trast to the orig­i­nal Web­Text, this new cor­pus is not fil­tered by lan­guage, but Eng­lish still con­sti­tutes 93% of the dataset by words sim­ply due to its preva­lence. Al­to­gether, the dataset is 500 bil­lion to­kens, or 700GB[3], after fil­ter­ing and clean­ing. The paper also pro­vides a de­tailed de­scrip­tion of the fil­ter­ing process of the dataset, which the GPT-2 paper didn’t.

The au­thors also at­tempted to re­move any data that over­lapped with the train and test sets of the eval­u­a­tions. Un­for­tu­nately, due to a bug, some were missed, so to com­pen­sate the paper pro­vides a fairly good analy­sis of the im­pact of this leak­age.

Eval­u­a­tion

Zero-, One-, and Few-shot performance of GPT-3 scaling with parameter count (<a href='https://arxiv.org/abs/2005.14165'>Source</a>)Zero-, One-, and Few-​shot per­for­mance of GPT-3 scal­ing with pa­ra­me­ter count (Source)

The Eval­u­a­tion sec­tion of GPT-3 is very com­pre­hen­sive, eval­u­at­ing on a mas­sive bat­tery of NLP tasks in the Zero-​shot (given only a nat­ural lan­guage de­scrip­tion in the gen­er­a­tion con­text), One-​shot (a sin­gle ex­am­ple in the gen­er­a­tion con­text), or Few-​shot (a small hand­ful of ex­am­ples in the gen­er­a­tion con­text) set­tings. This set­ting is worth em­pha­siz­ing as per­haps one of the biggest dif­fer­ences in abil­ity be­tween GPT-2 and its pre­de­ces­sors, be­cause being able to infer the task from just one or a few ex­am­ples is a mas­sive step for­ward in gen­er­al­iza­tion. Whereas pre­vi­ous mod­els all re­lied on task-​specific tun­ing, GPT-3 can be “tuned” merely by giv­ing it in­struc­tions in plain Eng­lish! In fact, the paper doesn’t even at­tempt to fine-​tune on the tar­get task, leav­ing that to fu­ture work.[4] How­ever, one cru­cial con­clu­sion is that in al­most all tests, per­for­mance con­tin­ues to get bet­ter with larger mod­els, even across 4 en­tire or­ders of mag­ni­tude, whereas fine-​tuning only im­proves on one task and risks cat­a­strophic for­get­ting and over­fit­ting.

With­out going too much into the in­di­vid­ual tests, the gen­eral re­sult is this: on most tasks, GPT-3 achieves per­for­mance sig­nif­i­cantly worse than fine-​tuned SOTA (i.e Su­per­GLUE, CoQA, Wino­grad, to name a few), but beat­ing fine-​tuned SOTA for some other tasks (i.e Phys­i­calQA, LAM­BADA, Penn Tree Bank). GPT-3 does par­tic­u­larly well on PTB in par­tic­u­lar, tak­ing the SOTA per­plex­ity from 35.76 down to 20.5—a mas­sive im­prove­ment. GPT-3 can also fi­nally do some arith­metic, some­thing GPT-2 was un­able to do well.[5]

People are unable to separate GPT-3 generated news articles from real ones (<a href='https://arxiv.org/abs/2005.14165'>Source</a>)Peo­ple are un­able to sep­a­rate GPT-3 gen­er­ated news ar­ti­cles from real ones (Source)

Im­pres­sively, and per­haps some­what alarm­ingly, peo­ple are un­able to dis­tin­guish GPT-3 gen­er­ated news sto­ries from real ones, only ex­ac­er­bat­ing the eth­i­cal con­cerns al­ready raised by GPT-2. The paper an­a­lyzes the re­sult of the re­lease of GPT-2, and con­cludes that the re­lease of GPT-2 has not led to wide­spread use of LMs for mis­in­for­ma­tion due to the dif­fi­culty of con­trol­ling out­put and vari­ance in out­put qual­ity, both among low-​to-mid skill ad­ver­saries and “ad­vanced per­sis­tent threats”—ad­ver­saries with “high skill and long-​term agen­das”—such as state ac­tors. How­ever, the paper also ac­knowl­edges that with fur­ther de­vel­op­ment, LMs will even­tu­ally be­come ad­vanced enough for these ad­ver­saries.

The au­thors also in­ves­ti­gate gen­der bias in GPT-3, show­ing that GPT-3 is male lean­ing; how­ever, the au­thors claim that some pre­lim­i­nary ev­i­dence on the Wino­gen­der dataset (which tests coref­er­ence res­o­lu­tion on the same sen­tence but with dif­fer­ent gen­dered pro­noun) seems to sug­gest that larger mod­els are more ro­bust to bias is­sues. Sim­i­lar is­sues ap­peared for race and re­li­gion, with the sen­ti­ment of coöccur­rent terms vary­ing sig­nif­i­cantly with race. The au­thors claim that this issue also got bet­ter with the larger mod­els—al­though, with­out proper hy­poth­e­sis test­ing, it’s dif­fi­cult to draw any solid con­clu­sions here.

Down­stream Ap­pli­ca­tions

GPT-3 has al­ready been used for a smor­gas­bord of dif­fer­ent ap­pli­ca­tions through the Ope­nAI API. You can ask it to write code, turn nat­ural lan­guage com­mands into shell com­mands, and sim­u­late chat­ting with fa­mous peo­ple. You can ask it to an­swer med­ical ques­tions, or write par­o­dies of the navy seal copy­pasta. You can ask it to sum­ma­rize pas­sages for sec­ond graders, or write po­etry.

It’s im­por­tant to re­mem­ber all these are done by the exact same model trained only on mod­el­ling text; all that’s dif­fer­ent is that it has been “asked nicely” to do dif­fer­ent things. These apps show­case the ver­sa­til­ity of GPT-3 across many dis­parate do­mains—some­thing that, if it were done with GPT-2, would re­quire days or even weeks of ex­ten­sive data en­gi­neer­ing and fine tun­ing, rather than 15 min­utes of prompt craft­ing. This new par­a­digm of pro­gram­ming through craft­ing plain-​English prompts, jok­ingly dubbed “Soft­ware 3.0”, has achieved re­sults that are al­ready im­pres­sive, but even more im­pres­sive when viewed through the lens of gen­er­al­iza­tion; GPT-3 wasn’t trained to do any of these things in par­tic­u­lar, but it could still be asked[6] to do them, and fairly well at that!

Con­clu­sion

Performance continues to scale with compute. (<a href='https://arxiv.org/abs/2005.14165'>Source</a>)Per­for­mance con­tin­ues to scale with com­pute. (Source)

But why does GPT-3 mat­ter, if it can’t even beat SOTA across all bench­marks? Why should we care about a model so large that a small com­put­ing clus­ter is nec­es­sary even just to run in­fer­ence at a rea­son­able speed?

One thing about GPT-3 is that it’s doing rea­son­ably well on tasks it has never even seen, and some­times tasks not even an­tic­i­pated by the de­vel­op­ers of the model. Ad­di­tion­ally, in­stead of reach­ing a point of di­min­ish­ing re­turns, GPT-3 shows that the trend of larger mod­els per­form­ing bet­ter con­tin­ues for at least an­other order of mag­ni­tude, with no signs of stop­ping. Even though GPT-3 is un­wieldy, and even though it still doesn’t quite reach human level per­for­mance across the board, GPT-3 shows that it’s pos­si­ble for a model to some­day reach human lev­els of gen­er­al­iza­tion in NLP—and once the im­pos­si­ble be­comes pos­si­ble, it’s only a mat­ter of time until it be­comes prac­ti­cal.


  1. Back when I talked about large Trans­former lan­guage mod­els like GPT-2, CTRL, and Megatron-​LM late last year, I touched briefly on the trend of Lan­guage Mod­els get­ting big­ger, and cov­ered some of the is­sues that sim­ply more com­pute might not fix. My gen­eral an­tic­i­pa­tion was that the model size arms race would soon be at a tem­po­rary stand­still, with focus being di­verted to bet­ter de­cod­ing strate­gies for text gen­er­a­tion (per­haps via RL-​based meth­ods). I most cer­tainly had not ex­pected that Ope­nAI would be back at it so soon with such a mas­sive model.

    This was such a sur­prise that I dropped every­thing to read the paper and work on this post, in­clud­ing a more theory-​oriented post that I’ve been work­ing on for a few months now. It will prob­a­bly be fin­ished soon™, after I re­cover from GPT-3 shock. Stay tuned! ↩︎

  2. It’s likely that this was done for eas­ier model par­al­lelism—big­ger ma­trix mul­ti­pli­ca­tions are much eas­ier to par­al­lelize than sequentially-​applied lay­ers à la GPipe.

    This could have other ad­van­tages too, though. After Ef­fi­cient­Net came out, I in­de­pen­dently ran some ex­per­i­ments of the same con­cepts to Trans­former mod­els, and the re­sult was that for the same amount of com­pute, wider mod­els had a size­able ad­van­tage over deeper ones—which cor­rob­o­rates the choice here to go with wider mod­els. ↩︎

  3. This fig­ure is ex­trap­o­lated from the size of the Com­mon Crawl set, which is given in the paper. ↩︎

  4. There has been some spec­u­la­tion over the lack of fine-​tuning ex­am­ples in this paper. Some have spec­u­lated that the fine-​tuned per­for­mance is poor, lead­ing Ope­nAI to ex­clude the re­sults. It could be that with a model this big, the fine tun­ing par­a­digm starts to break down due to the ease of over­fit­ting, though with­out ac­cess to the model (and hard­ware to tune it!) there’s not much more than spec­u­late that we can do.

    An up­date: Ope­nAI has talked about re­leas­ing a fine-​tuning API. It re­mains to be seen ex­actly how the API will work, but I’m per­son­ally ex­cited to see the re­sults on var­i­ous datasets. There has been de­bate over whether fine-​tuning GPT-3 would ac­tu­ally re­sult in per­for­mance gains, and see­ing the re­sult ei­ther way would be very in­for­ma­tive. ↩︎

  5. There is some ev­i­dence that the Byte-​Pair En­cod­ing used in GPT3 may be the source of many is­sues, es­pe­cially sur­round­ing arith­metic and rhyming. BPE glues to­gether ad­ja­cent char­ac­ters which cre­ates dif­fi­cul­ties for the model when those char­ac­ters are in­di­vid­u­ally se­man­ti­cally mean­ing­ful. ↩︎

  6. The word “ask” is often abused to refer to get­ting mod­els to do things in gen­eral, but in this case we’re lit­er­ally ask­ing the model! ↩︎

...