PNG  IHDRX cHRMz&u0`:pQ<bKGD pHYsodtIME MeqIDATxw]Wug^Qd˶ 6`!N:!@xI~)%7%@Bh&`lnjVF29gΨ4E$|>cɚ{gk= %,a KX%,a KX%,a KX%,a KX%,a KX%,a KX%, b` ǟzeאfp]<!SJmɤY޲ڿ,%c ~ع9VH.!Ͳz&QynֺTkRR.BLHi٪:l;@(!MԴ=žI,:o&N'Kù\vRmJ雵֫AWic H@" !: Cé||]k-Ha oݜ:y F())u]aG7*JV@J415p=sZH!=!DRʯvɱh~V\}v/GKY$n]"X"}t@ xS76^[bw4dsce)2dU0 CkMa-U5tvLƀ~mlMwfGE/-]7XAƟ`׮g ewxwC4\[~7@O-Q( a*XGƒ{ ՟}$_y3tĐƤatgvێi|K=uVyrŲlLӪuܿzwk$m87k( `múcE)"@rK( z4$D; 2kW=Xb$V[Ru819קR~qloѱDyįݎ*mxw]y5e4K@ЃI0A D@"BDk_)N\8͜9dz"fK0zɿvM /.:2O{ Nb=M=7>??Zuo32 DLD@D| &+֎C #B8ַ`bOb $D#ͮҪtx]%`ES`Ru[=¾!@Od37LJ0!OIR4m]GZRJu$‡c=%~s@6SKy?CeIh:[vR@Lh | (BhAMy=݃  G"'wzn޺~8ԽSh ~T*A:xR[ܹ?X[uKL_=fDȊ؂p0}7=D$Ekq!/t.*2ʼnDbŞ}DijYaȲ(""6HA;:LzxQ‘(SQQ}*PL*fc\s `/d'QXW, e`#kPGZuŞuO{{wm[&NBTiiI0bukcA9<4@SӊH*؎4U/'2U5.(9JuDfrޱtycU%j(:RUbArLֺN)udA':uGQN"-"Is.*+k@ `Ojs@yU/ H:l;@yyTn}_yw!VkRJ4P)~y#)r,D =ě"Q]ci'%HI4ZL0"MJy 8A{ aN<8D"1#IJi >XjX֔#@>-{vN!8tRݻ^)N_╗FJEk]CT՟ YP:_|H1@ CBk]yKYp|og?*dGvzنzӴzjֺNkC~AbZƷ`.H)=!QͷVTT(| u78y֮}|[8-Vjp%2JPk[}ԉaH8Wpqhwr:vWª<}l77_~{s۴V+RCģ%WRZ\AqHifɤL36: #F:p]Bq/z{0CU6ݳEv_^k7'>sq*+kH%a`0ԣisqにtү04gVgW΂iJiS'3w.w}l6MC2uԯ|>JF5`fV5m`Y**Db1FKNttu]4ccsQNnex/87+}xaUW9y>ͯ骵G{䩓Գ3+vU}~jJ.NFRD7<aJDB1#ҳgSb,+CS?/ VG J?|?,2#M9}B)MiE+G`-wo߫V`fio(}S^4e~V4bHOYb"b#E)dda:'?}׮4繏`{7Z"uny-?ǹ;0MKx{:_pÚmFמ:F " .LFQLG)Q8qN q¯¯3wOvxDb\. BKD9_NN &L:4D{mm o^tֽ:q!ƥ}K+<"m78N< ywsard5+вz~mnG)=}lYݧNj'QJS{S :UYS-952?&O-:W}(!6Mk4+>A>j+i|<<|;ر^߉=HE|V#F)Emm#}/"y GII웻Jі94+v뾧xu~5C95~ūH>c@덉pʃ1/4-A2G%7>m;–Y,cyyaln" ?ƻ!ʪ<{~h~i y.zZB̃/,雋SiC/JFMmBH&&FAbϓO^tubbb_hZ{_QZ-sύodFgO(6]TJA˯#`۶ɟ( %$&+V'~hiYy>922 Wp74Zkq+Ovn錄c>8~GqܲcWꂎz@"1A.}T)uiW4="jJ2W7mU/N0gcqܗOO}?9/wìXžΏ0 >֩(V^Rh32!Hj5`;O28؇2#ݕf3 ?sJd8NJ@7O0 b־?lldщ̡&|9C.8RTWwxWy46ah嘦mh٤&l zCy!PY?: CJyв]dm4ǜҐR޻RլhX{FƯanшQI@x' ao(kUUuxW_Ñ줮[w8 FRJ(8˼)_mQ _!RJhm=!cVmm ?sFOnll6Qk}alY}; "baӌ~M0w,Ggw2W:G/k2%R,_=u`WU R.9T"v,<\Ik޽/2110Ӿxc0gyC&Ny޽JҢrV6N ``یeA16"J³+Rj*;BϜkZPJaÍ<Jyw:NP8/D$ 011z֊Ⱳ3ι֘k1V_"h!JPIΣ'ɜ* aEAd:ݺ>y<}Lp&PlRfTb1]o .2EW\ͮ]38؋rTJsǏP@芎sF\> P^+dYJLbJ C-xϐn> ι$nj,;Ǖa FU *择|h ~izť3ᤓ`K'-f tL7JK+vf2)V'-sFuB4i+m+@My=O҈0"|Yxoj,3]:cо3 $#uŘ%Y"y죯LebqtҢVzq¼X)~>4L׶m~[1_k?kxֺQ`\ |ٛY4Ѯr!)N9{56(iNq}O()Em]=F&u?$HypWUeB\k]JɩSع9 Zqg4ZĊo oMcjZBU]B\TUd34ݝ~:7ڶSUsB0Z3srx 7`:5xcx !qZA!;%͚7&P H<WL!džOb5kF)xor^aujƍ7 Ǡ8/p^(L>ὴ-B,{ۇWzֺ^k]3\EE@7>lYBȝR.oHnXO/}sB|.i@ɥDB4tcm,@ӣgdtJ!lH$_vN166L__'Z)y&kH;:,Y7=J 9cG) V\hjiE;gya~%ks_nC~Er er)muuMg2;֫R)Md) ,¶ 2-wr#F7<-BBn~_(o=KO㭇[Xv eN_SMgSҐ BS헃D%g_N:/pe -wkG*9yYSZS.9cREL !k}<4_Xs#FmҶ:7R$i,fi!~' # !6/S6y@kZkZcX)%5V4P]VGYq%H1!;e1MV<!ϐHO021Dp= HMs~~a)ަu7G^];git!Frl]H/L$=AeUvZE4P\.,xi {-~p?2b#amXAHq)MWǾI_r`S Hz&|{ +ʖ_= (YS(_g0a03M`I&'9vl?MM+m~}*xT۲(fY*V4x@29s{DaY"toGNTO+xCAO~4Ϳ;p`Ѫ:>Ҵ7K 3}+0 387x\)a"/E>qpWB=1 ¨"MP(\xp߫́A3+J] n[ʼnӼaTbZUWb={~2ooKױӰp(CS\S筐R*JغV&&"FA}J>G֐p1ٸbk7 ŘH$JoN <8s^yk_[;gy-;߉DV{c B yce% aJhDȶ 2IdйIB/^n0tNtџdcKj4϶v~- CBcgqx9= PJ) dMsjpYB] GD4RDWX +h{y`,3ꊕ$`zj*N^TP4L:Iz9~6s) Ga:?y*J~?OrMwP\](21sZUD ?ܟQ5Q%ggW6QdO+\@ ̪X'GxN @'4=ˋ+*VwN ne_|(/BDfj5(Dq<*tNt1х!MV.C0 32b#?n0pzj#!38}޴o1KovCJ`8ŗ_"]] rDUy޲@ Ȗ-;xџ'^Y`zEd?0„ DAL18IS]VGq\4o !swV7ˣι%4FѮ~}6)OgS[~Q vcYbL!wG3 7띸*E Pql8=jT\꘿I(z<[6OrR8ºC~ډ]=rNl[g|v TMTղb-o}OrP^Q]<98S¤!k)G(Vkwyqyr޽Nv`N/e p/~NAOk \I:G6]4+K;j$R:Mi #*[AȚT,ʰ,;N{HZTGMoּy) ]%dHء9Պ䠬|<45,\=[bƟ8QXeB3- &dҩ^{>/86bXmZ]]yޚN[(WAHL$YAgDKp=5GHjU&99v簪C0vygln*P)9^͞}lMuiH!̍#DoRBn9l@ xA/_v=ȺT{7Yt2N"4!YN`ae >Q<XMydEB`VU}u]嫇.%e^ánE87Mu\t`cP=AD/G)sI"@MP;)]%fH9'FNsj1pVhY&9=0pfuJ&gޤx+k:!r˭wkl03׼Ku C &ѓYt{.O.zҏ z}/tf_wEp2gvX)GN#I ݭ߽v/ .& и(ZF{e"=V!{zW`, ]+LGz"(UJp|j( #V4, 8B 0 9OkRrlɱl94)'VH9=9W|>PS['G(*I1==C<5"Pg+x'K5EMd؞Af8lG ?D FtoB[je?{k3zQ vZ;%Ɠ,]E>KZ+T/ EJxOZ1i #T<@ I}q9/t'zi(EMqw`mYkU6;[t4DPeckeM;H}_g pMww}k6#H㶏+b8雡Sxp)&C $@'b,fPߑt$RbJ'vznuS ~8='72_`{q纶|Q)Xk}cPz9p7O:'|G~8wx(a 0QCko|0ASD>Ip=4Q, d|F8RcU"/KM opKle M3#i0c%<7׿p&pZq[TR"BpqauIp$ 8~Ĩ!8Սx\ւdT>>Z40ks7 z2IQ}ItԀ<-%S⍤};zIb$I 5K}Q͙D8UguWE$Jh )cu4N tZl+[]M4k8֦Zeq֮M7uIqG 1==tLtR,ƜSrHYt&QP윯Lg' I,3@P'}'R˪e/%-Auv·ñ\> vDJzlӾNv5:|K/Jb6KI9)Zh*ZAi`?S {aiVDԲuy5W7pWeQJk֤#5&V<̺@/GH?^τZL|IJNvI:'P=Ϛt"¨=cud S Q.Ki0 !cJy;LJR;G{BJy޺[^8fK6)=yʊ+(k|&xQ2`L?Ȓ2@Mf 0C`6-%pKpm')c$׻K5[J*U[/#hH!6acB JA _|uMvDyk y)6OPYjœ50VT K}cǻP[ $:]4MEA.y)|B)cf-A?(e|lɉ#P9V)[9t.EiQPDѠ3ϴ;E:+Օ t ȥ~|_N2,ZJLt4! %ա]u {+=p.GhNcŞQI?Nd'yeh n7zi1DB)1S | S#ًZs2|Ɛy$F SxeX{7Vl.Src3E℃Q>b6G ўYCmtկ~=K0f(=LrAS GN'ɹ9<\!a`)֕y[uՍ[09` 9 +57ts6}b4{oqd+J5fa/,97J#6yν99mRWxJyѡyu_TJc`~W>l^q#Ts#2"nD1%fS)FU w{ܯ R{ ˎ󅃏џDsZSQS;LV;7 Od1&1n$ N /.q3~eNɪ]E#oM~}v֯FڦwyZ=<<>Xo稯lfMFV6p02|*=tV!c~]fa5Y^Q_WN|Vs 0ҘދU97OI'N2'8N֭fgg-}V%y]U4 峧p*91#9U kCac_AFңĪy뚇Y_AiuYyTTYЗ-(!JFLt›17uTozc. S;7A&&<ԋ5y;Ro+:' *eYJkWR[@F %SHWP 72k4 qLd'J "zB6{AC0ƁA6U.'F3:Ȅ(9ΜL;D]m8ڥ9}dU "v!;*13Rg^fJyShyy5auA?ɩGHRjo^]׽S)Fm\toy 4WQS@mE#%5ʈfFYDX ~D5Ϡ9tE9So_aU4?Ѽm%&c{n>.KW1Tlb}:j uGi(JgcYj0qn+>) %\!4{LaJso d||u//P_y7iRJ߬nHOy) l+@$($VFIQ9%EeKʈU. ia&FY̒mZ=)+qqoQn >L!qCiDB;Y<%} OgBxB!ØuG)WG9y(Ą{_yesuZmZZey'Wg#C~1Cev@0D $a@˲(.._GimA:uyw֬%;@!JkQVM_Ow:P.s\)ot- ˹"`B,e CRtaEUP<0'}r3[>?G8xU~Nqu;Wm8\RIkբ^5@k+5(By'L&'gBJ3ݶ!/㮻w҅ yqPWUg<e"Qy*167΃sJ\oz]T*UQ<\FԎ`HaNmڜ6DysCask8wP8y9``GJ9lF\G g's Nn͵MLN֪u$| /|7=]O)6s !ĴAKh]q_ap $HH'\1jB^s\|- W1:=6lJBqjY^LsPk""`]w)󭃈,(HC ?䔨Y$Sʣ{4Z+0NvQkhol6C.婧/u]FwiVjZka&%6\F*Ny#8O,22+|Db~d ~Çwc N:FuuCe&oZ(l;@ee-+Wn`44AMK➝2BRՈt7g*1gph9N) *"TF*R(#'88pm=}X]u[i7bEc|\~EMn}P瘊J)K.0i1M6=7'_\kaZ(Th{K*GJyytw"IO-PWJk)..axӝ47"89Cc7ĐBiZx 7m!fy|ϿF9CbȩV 9V-՛^pV̌ɄS#Bv4-@]Vxt-Z, &ֺ*diؠ2^VXbs֔Ìl.jQ]Y[47gj=幽ex)A0ip׳ W2[ᎇhuE^~q흙L} #-b۸oFJ_QP3r6jr+"nfzRJTUqoaۍ /$d8Mx'ݓ= OՃ| )$2mcM*cЙj}f };n YG w0Ia!1Q.oYfr]DyISaP}"dIӗթO67jqR ҊƐƈaɤGG|h;t]䗖oSv|iZqX)oalv;۩meEJ\!8=$4QU4Xo&VEĊ YS^E#d,yX_> ۘ-e\ "Wa6uLĜZi`aD9.% w~mB(02G[6y.773a7 /=o7D)$Z 66 $bY^\CuP. (x'"J60׿Y:Oi;F{w佩b+\Yi`TDWa~|VH)8q/=9!g߆2Y)?ND)%?Ǐ`k/sn:;O299yB=a[Ng 3˲N}vLNy;*?x?~L&=xyӴ~}q{qE*IQ^^ͧvü{Huu=R|>JyUlZV, B~/YF!Y\u_ݼF{_C)LD]m {H 0ihhadd nUkf3oٺCvE\)QJi+֥@tDJkB$1!Đr0XQ|q?d2) Ӣ_}qv-< FŊ߫%roppVBwü~JidY4:}L6M7f٬F "?71<2#?Jyy4뷢<_a7_=Q E=S1И/9{+93֮E{ǂw{))?maÆm(uLE#lïZ  ~d];+]h j?!|$F}*"4(v'8s<ŏUkm7^7no1w2ؗ}TrͿEk>p'8OB7d7R(A 9.*Mi^ͳ; eeUwS+C)uO@ =Sy]` }l8^ZzRXj[^iUɺ$tj))<sbDJfg=Pk_{xaKo1:-uyG0M ԃ\0Lvuy'ȱc2Ji AdyVgVh!{]/&}}ċJ#%d !+87<;qN޼Nفl|1N:8ya  8}k¾+-$4FiZYÔXk*I&'@iI99)HSh4+2G:tGhS^繿 Kتm0 вDk}֚+QT4;sC}rՅE,8CX-e~>G&'9xpW,%Fh,Ry56Y–hW-(v_,? ; qrBk4-V7HQ;ˇ^Gv1JVV%,ik;D_W!))+BoS4QsTM;gt+ndS-~:11Sgv!0qRVh!"Ȋ(̦Yl.]PQWgٳE'`%W1{ndΗBk|Ž7ʒR~,lnoa&:ü$ 3<a[CBݮwt"o\ePJ=Hz"_c^Z.#ˆ*x z̝grY]tdkP*:97YľXyBkD4N.C_[;F9`8& !AMO c `@BA& Ost\-\NX+Xp < !bj3C&QL+*&kAQ=04}cC!9~820G'PC9xa!w&bo_1 Sw"ܱ V )Yl3+ס2KoXOx]"`^WOy :3GO0g;%Yv㐫(R/r (s } u B &FeYZh0y> =2<Ϟc/ -u= c&׭,.0"g"7 6T!vl#sc>{u/Oh Bᾈ)۴74]x7 gMӒ"d]U)}" v4co[ ɡs 5Gg=XR14?5A}D "b{0$L .\4y{_fe:kVS\\O]c^W52LSBDM! C3Dhr̦RtArx4&agaN3Cf<Ԉp4~ B'"1@.b_/xQ} _߃҉/gٓ2Qkqp0շpZ2fԫYz< 4L.Cyυι1t@鎫Fe sYfsF}^ V}N<_`p)alٶ "(XEAVZ<)2},:Ir*#m_YӼ R%a||EƼIJ,,+f"96r/}0jE/)s)cjW#w'Sʯ5<66lj$a~3Kʛy 2:cZ:Yh))+a߭K::N,Q F'qB]={.]h85C9cr=}*rk?vwV렵ٸW Rs%}rNAkDv|uFLBkWY YkX מ|)1!$#3%y?pF<@<Rr0}: }\J [5FRxY<9"SQdE(Q*Qʻ)q1E0B_O24[U'],lOb ]~WjHޏTQ5Syu wq)xnw8~)c 쫬gٲߠ H% k5dƝk> kEj,0% b"vi2Wس_CuK)K{n|>t{P1򨾜j>'kEkƗBg*H%'_aY6Bn!TL&ɌOb{c`'d^{t\i^[uɐ[}q0lM˕G:‚4kb祔c^:?bpg… +37stH:0}en6x˟%/<]BL&* 5&fK9Mq)/iyqtA%kUe[ڛKN]Ě^,"`/ s[EQQm?|XJ߅92m]G.E΃ח U*Cn.j_)Tѧj̿30ڇ!A0=͜ar I3$C^-9#|pk!)?7.x9 @OO;WƝZBFU keZ75F6Tc6"ZȚs2y/1 ʵ:u4xa`C>6Rb/Yм)^=+~uRd`/|_8xbB0?Ft||Z\##|K 0>>zxv8۴吅q 8ĥ)"6>~\8:qM}#͚'ĉ#p\׶ l#bA?)|g g9|8jP(cr,BwV (WliVxxᡁ@0Okn;ɥh$_ckCgriv}>=wGzβ KkBɛ[˪ !J)h&k2%07δt}!d<9;I&0wV/ v 0<H}L&8ob%Hi|޶o&h1L|u֦y~󛱢8fٲUsւ)0oiFx2}X[zVYr_;N(w]_4B@OanC?gĦx>мgx>ΛToZoOMp>40>V Oy V9iq!4 LN,ˢu{jsz]|"R޻&'ƚ{53ўFu(<٪9:΋]B;)B>1::8;~)Yt|0(pw2N%&X,URBK)3\zz&}ax4;ǟ(tLNg{N|Ǽ\G#C9g$^\}p?556]/RP.90 k,U8/u776s ʪ_01چ|\N 0VV*3H鴃J7iI!wG_^ypl}r*jɤSR 5QN@ iZ#1ٰy;_\3\BQQ x:WJv츟ٯ$"@6 S#qe딇(/P( Dy~TOϻ<4:-+F`0||;Xl-"uw$Цi󼕝mKʩorz"mϺ$F:~E'ҐvD\y?Rr8_He@ e~O,T.(ފR*cY^m|cVR[8 JҡSm!ΆԨb)RHG{?MpqrmN>߶Y)\p,d#xۆWY*,l6]v0h15M˙MS8+EdI='LBJIH7_9{Caз*Lq,dt >+~ّeʏ?xԕ4bBAŚjﵫ!'\Ը$WNvKO}ӽmSşذqsOy?\[,d@'73'j%kOe`1.g2"e =YIzS2|zŐƄa\U,dP;jhhhaxǶ?КZ՚.q SE+XrbOu%\GتX(H,N^~]JyEZQKceTQ]VGYqnah;y$cQahT&QPZ*iZ8UQQM.qo/T\7X"u?Mttl2Xq(IoW{R^ ux*SYJ! 4S.Jy~ BROS[V|žKNɛP(L6V^|cR7i7nZW1Fd@ Ara{詑|(T*dN]Ko?s=@ |_EvF]׍kR)eBJc" MUUbY6`~V޴dJKß&~'d3i WWWWWW
Current Directory: /opt/alt/python311/lib/python3.11/site-packages/pip/_vendor/html5lib
Viewing File: /opt/alt/python311/lib/python3.11/site-packages/pip/_vendor/html5lib/_tokenizer.py
from __future__ import absolute_import, division, unicode_literals from pip._vendor.six import unichr as chr from collections import deque, OrderedDict from sys import version_info from .constants import spaceCharacters from .constants import entities from .constants import asciiLetters, asciiUpper2Lower from .constants import digits, hexDigits, EOF from .constants import tokenTypes, tagTokenTypes from .constants import replacementCharacters from ._inputstream import HTMLInputStream from ._trie import Trie entitiesTrie = Trie(entities) if version_info >= (3, 7): attributeMap = dict else: attributeMap = OrderedDict class HTMLTokenizer(object): """ This class takes care of tokenizing HTML. * self.currentToken Holds the token that is currently being processed. * self.state Holds a reference to the method to be invoked... XXX * self.stream Points to HTMLInputStream object. """ def __init__(self, stream, parser=None, **kwargs): self.stream = HTMLInputStream(stream, **kwargs) self.parser = parser # Setup the initial tokenizer state self.escapeFlag = False self.lastFourChars = [] self.state = self.dataState self.escape = False # The current token being created self.currentToken = None super(HTMLTokenizer, self).__init__() def __iter__(self): """ This is where the magic happens. We do our usually processing through the states and when we have a token to return we yield the token which pauses processing until the next token is requested. """ self.tokenQueue = deque([]) # Start processing. When EOF is reached self.state will return False # instead of True and the loop will terminate. while self.state(): while self.stream.errors: yield {"type": tokenTypes["ParseError"], "data": self.stream.errors.pop(0)} while self.tokenQueue: yield self.tokenQueue.popleft() def consumeNumberEntity(self, isHex): """This function returns either U+FFFD or the character based on the decimal or hexadecimal representation. It also discards ";" if present. If not present self.tokenQueue.append({"type": tokenTypes["ParseError"]}) is invoked. """ allowed = digits radix = 10 if isHex: allowed = hexDigits radix = 16 charStack = [] # Consume all the characters that are in range while making sure we # don't hit an EOF. c = self.stream.char() while c in allowed and c is not EOF: charStack.append(c) c = self.stream.char() # Convert the set of characters consumed to an int. charAsInt = int("".join(charStack), radix) # Certain characters get replaced with others if charAsInt in replacementCharacters: char = replacementCharacters[charAsInt] self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "illegal-codepoint-for-numeric-entity", "datavars": {"charAsInt": charAsInt}}) elif ((0xD800 <= charAsInt <= 0xDFFF) or (charAsInt > 0x10FFFF)): char = "\uFFFD" self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "illegal-codepoint-for-numeric-entity", "datavars": {"charAsInt": charAsInt}}) else: # Should speed up this check somehow (e.g. move the set to a constant) if ((0x0001 <= charAsInt <= 0x0008) or (0x000E <= charAsInt <= 0x001F) or (0x007F <= charAsInt <= 0x009F) or (0xFDD0 <= charAsInt <= 0xFDEF) or charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF])): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "illegal-codepoint-for-numeric-entity", "datavars": {"charAsInt": charAsInt}}) try: # Try/except needed as UCS-2 Python builds' unichar only works # within the BMP. char = chr(charAsInt) except ValueError: v = charAsInt - 0x10000 char = chr(0xD800 | (v >> 10)) + chr(0xDC00 | (v & 0x3FF)) # Discard the ; if present. Otherwise, put it back on the queue and # invoke parseError on parser. if c != ";": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "numeric-entity-without-semicolon"}) self.stream.unget(c) return char def consumeEntity(self, allowedChar=None, fromAttribute=False): # Initialise to the default output for when no entity is matched output = "&" charStack = [self.stream.char()] if (charStack[0] in spaceCharacters or charStack[0] in (EOF, "<", "&") or (allowedChar is not None and allowedChar == charStack[0])): self.stream.unget(charStack[0]) elif charStack[0] == "#": # Read the next character to see if it's hex or decimal hex = False charStack.append(self.stream.char()) if charStack[-1] in ("x", "X"): hex = True charStack.append(self.stream.char()) # charStack[-1] should be the first digit if (hex and charStack[-1] in hexDigits) \ or (not hex and charStack[-1] in digits): # At least one digit found, so consume the whole number self.stream.unget(charStack[-1]) output = self.consumeNumberEntity(hex) else: # No digits found self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-numeric-entity"}) self.stream.unget(charStack.pop()) output = "&" + "".join(charStack) else: # At this point in the process might have named entity. Entities # are stored in the global variable "entities". # # Consume characters and compare to these to a substring of the # entity names in the list until the substring no longer matches. while (charStack[-1] is not EOF): if not entitiesTrie.has_keys_with_prefix("".join(charStack)): break charStack.append(self.stream.char()) # At this point we have a string that starts with some characters # that may match an entity # Try to find the longest entity the string will match to take care # of &noti for instance. try: entityName = entitiesTrie.longest_prefix("".join(charStack[:-1])) entityLength = len(entityName) except KeyError: entityName = None if entityName is not None: if entityName[-1] != ";": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "named-entity-without-semicolon"}) if (entityName[-1] != ";" and fromAttribute and (charStack[entityLength] in asciiLetters or charStack[entityLength] in digits or charStack[entityLength] == "=")): self.stream.unget(charStack.pop()) output = "&" + "".join(charStack) else: output = entities[entityName] self.stream.unget(charStack.pop()) output += "".join(charStack[entityLength:]) else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-named-entity"}) self.stream.unget(charStack.pop()) output = "&" + "".join(charStack) if fromAttribute: self.currentToken["data"][-1][1] += output else: if output in spaceCharacters: tokenType = "SpaceCharacters" else: tokenType = "Characters" self.tokenQueue.append({"type": tokenTypes[tokenType], "data": output}) def processEntityInAttribute(self, allowedChar): """This method replaces the need for "entityInAttributeValueState". """ self.consumeEntity(allowedChar=allowedChar, fromAttribute=True) def emitCurrentToken(self): """This method is a generic handler for emitting the tags. It also sets the state to "data" because that's what's needed after a token has been emitted. """ token = self.currentToken # Add token to the queue to be yielded if (token["type"] in tagTokenTypes): token["name"] = token["name"].translate(asciiUpper2Lower) if token["type"] == tokenTypes["StartTag"]: raw = token["data"] data = attributeMap(raw) if len(raw) > len(data): # we had some duplicated attribute, fix so first wins data.update(raw[::-1]) token["data"] = data if token["type"] == tokenTypes["EndTag"]: if token["data"]: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "attributes-in-end-tag"}) if token["selfClosing"]: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "self-closing-flag-on-end-tag"}) self.tokenQueue.append(token) self.state = self.dataState # Below are the various tokenizer states worked out. def dataState(self): data = self.stream.char() if data == "&": self.state = self.entityDataState elif data == "<": self.state = self.tagOpenState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\u0000"}) elif data is EOF: # Tokenization ends. return False elif data in spaceCharacters: # Directly after emitting a token you switch back to the "data # state". At that point spaceCharacters are important so they are # emitted separately. self.tokenQueue.append({"type": tokenTypes["SpaceCharacters"], "data": data + self.stream.charsUntil(spaceCharacters, True)}) # No need to update lastFourChars here, since the first space will # have already been appended to lastFourChars and will have broken # any <!-- or --> sequences else: chars = self.stream.charsUntil(("&", "<", "\u0000")) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + chars}) return True def entityDataState(self): self.consumeEntity() self.state = self.dataState return True def rcdataState(self): data = self.stream.char() if data == "&": self.state = self.characterReferenceInRcdata elif data == "<": self.state = self.rcdataLessThanSignState elif data == EOF: # Tokenization ends. return False elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) elif data in spaceCharacters: # Directly after emitting a token you switch back to the "data # state". At that point spaceCharacters are important so they are # emitted separately. self.tokenQueue.append({"type": tokenTypes["SpaceCharacters"], "data": data + self.stream.charsUntil(spaceCharacters, True)}) # No need to update lastFourChars here, since the first space will # have already been appended to lastFourChars and will have broken # any <!-- or --> sequences else: chars = self.stream.charsUntil(("&", "<", "\u0000")) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + chars}) return True def characterReferenceInRcdata(self): self.consumeEntity() self.state = self.rcdataState return True def rawtextState(self): data = self.stream.char() if data == "<": self.state = self.rawtextLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) elif data == EOF: # Tokenization ends. return False else: chars = self.stream.charsUntil(("<", "\u0000")) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + chars}) return True def scriptDataState(self): data = self.stream.char() if data == "<": self.state = self.scriptDataLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) elif data == EOF: # Tokenization ends. return False else: chars = self.stream.charsUntil(("<", "\u0000")) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + chars}) return True def plaintextState(self): data = self.stream.char() if data == EOF: # Tokenization ends. return False elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + self.stream.charsUntil("\u0000")}) return True def tagOpenState(self): data = self.stream.char() if data == "!": self.state = self.markupDeclarationOpenState elif data == "/": self.state = self.closeTagOpenState elif data in asciiLetters: self.currentToken = {"type": tokenTypes["StartTag"], "name": data, "data": [], "selfClosing": False, "selfClosingAcknowledged": False} self.state = self.tagNameState elif data == ">": # XXX In theory it could be something besides a tag name. But # do we really care? self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-tag-name-but-got-right-bracket"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<>"}) self.state = self.dataState elif data == "?": # XXX In theory it could be something besides a tag name. But # do we really care? self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-tag-name-but-got-question-mark"}) self.stream.unget(data) self.state = self.bogusCommentState else: # XXX self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-tag-name"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.stream.unget(data) self.state = self.dataState return True def closeTagOpenState(self): data = self.stream.char() if data in asciiLetters: self.currentToken = {"type": tokenTypes["EndTag"], "name": data, "data": [], "selfClosing": False} self.state = self.tagNameState elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-closing-tag-but-got-right-bracket"}) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-closing-tag-but-got-eof"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</"}) self.state = self.dataState else: # XXX data can be _'_... self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-closing-tag-but-got-char", "datavars": {"data": data}}) self.stream.unget(data) self.state = self.bogusCommentState return True def tagNameState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeAttributeNameState elif data == ">": self.emitCurrentToken() elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-tag-name"}) self.state = self.dataState elif data == "/": self.state = self.selfClosingStartTagState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["name"] += "\uFFFD" else: self.currentToken["name"] += data # (Don't use charsUntil here, because tag names are # very short and it's faster to not do anything fancy) return True def rcdataLessThanSignState(self): data = self.stream.char() if data == "/": self.temporaryBuffer = "" self.state = self.rcdataEndTagOpenState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.stream.unget(data) self.state = self.rcdataState return True def rcdataEndTagOpenState(self): data = self.stream.char() if data in asciiLetters: self.temporaryBuffer += data self.state = self.rcdataEndTagNameState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</"}) self.stream.unget(data) self.state = self.rcdataState return True def rcdataEndTagNameState(self): appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower() data = self.stream.char() if data in spaceCharacters and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.beforeAttributeNameState elif data == "/" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.selfClosingStartTagState elif data == ">" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.emitCurrentToken() self.state = self.dataState elif data in asciiLetters: self.temporaryBuffer += data else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</" + self.temporaryBuffer}) self.stream.unget(data) self.state = self.rcdataState return True def rawtextLessThanSignState(self): data = self.stream.char() if data == "/": self.temporaryBuffer = "" self.state = self.rawtextEndTagOpenState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.stream.unget(data) self.state = self.rawtextState return True def rawtextEndTagOpenState(self): data = self.stream.char() if data in asciiLetters: self.temporaryBuffer += data self.state = self.rawtextEndTagNameState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</"}) self.stream.unget(data) self.state = self.rawtextState return True def rawtextEndTagNameState(self): appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower() data = self.stream.char() if data in spaceCharacters and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.beforeAttributeNameState elif data == "/" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.selfClosingStartTagState elif data == ">" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.emitCurrentToken() self.state = self.dataState elif data in asciiLetters: self.temporaryBuffer += data else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</" + self.temporaryBuffer}) self.stream.unget(data) self.state = self.rawtextState return True def scriptDataLessThanSignState(self): data = self.stream.char() if data == "/": self.temporaryBuffer = "" self.state = self.scriptDataEndTagOpenState elif data == "!": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<!"}) self.state = self.scriptDataEscapeStartState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.stream.unget(data) self.state = self.scriptDataState return True def scriptDataEndTagOpenState(self): data = self.stream.char() if data in asciiLetters: self.temporaryBuffer += data self.state = self.scriptDataEndTagNameState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</"}) self.stream.unget(data) self.state = self.scriptDataState return True def scriptDataEndTagNameState(self): appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower() data = self.stream.char() if data in spaceCharacters and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.beforeAttributeNameState elif data == "/" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.selfClosingStartTagState elif data == ">" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.emitCurrentToken() self.state = self.dataState elif data in asciiLetters: self.temporaryBuffer += data else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</" + self.temporaryBuffer}) self.stream.unget(data) self.state = self.scriptDataState return True def scriptDataEscapeStartState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataEscapeStartDashState else: self.stream.unget(data) self.state = self.scriptDataState return True def scriptDataEscapeStartDashState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataEscapedDashDashState else: self.stream.unget(data) self.state = self.scriptDataState return True def scriptDataEscapedState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataEscapedDashState elif data == "<": self.state = self.scriptDataEscapedLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) elif data == EOF: self.state = self.dataState else: chars = self.stream.charsUntil(("<", "-", "\u0000")) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data + chars}) return True def scriptDataEscapedDashState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataEscapedDashDashState elif data == "<": self.state = self.scriptDataEscapedLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) self.state = self.scriptDataEscapedState elif data == EOF: self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.state = self.scriptDataEscapedState return True def scriptDataEscapedDashDashState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) elif data == "<": self.state = self.scriptDataEscapedLessThanSignState elif data == ">": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": ">"}) self.state = self.scriptDataState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) self.state = self.scriptDataEscapedState elif data == EOF: self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.state = self.scriptDataEscapedState return True def scriptDataEscapedLessThanSignState(self): data = self.stream.char() if data == "/": self.temporaryBuffer = "" self.state = self.scriptDataEscapedEndTagOpenState elif data in asciiLetters: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<" + data}) self.temporaryBuffer = data self.state = self.scriptDataDoubleEscapeStartState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.stream.unget(data) self.state = self.scriptDataEscapedState return True def scriptDataEscapedEndTagOpenState(self): data = self.stream.char() if data in asciiLetters: self.temporaryBuffer = data self.state = self.scriptDataEscapedEndTagNameState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</"}) self.stream.unget(data) self.state = self.scriptDataEscapedState return True def scriptDataEscapedEndTagNameState(self): appropriate = self.currentToken and self.currentToken["name"].lower() == self.temporaryBuffer.lower() data = self.stream.char() if data in spaceCharacters and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.beforeAttributeNameState elif data == "/" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.state = self.selfClosingStartTagState elif data == ">" and appropriate: self.currentToken = {"type": tokenTypes["EndTag"], "name": self.temporaryBuffer, "data": [], "selfClosing": False} self.emitCurrentToken() self.state = self.dataState elif data in asciiLetters: self.temporaryBuffer += data else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "</" + self.temporaryBuffer}) self.stream.unget(data) self.state = self.scriptDataEscapedState return True def scriptDataDoubleEscapeStartState(self): data = self.stream.char() if data in (spaceCharacters | frozenset(("/", ">"))): self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) if self.temporaryBuffer.lower() == "script": self.state = self.scriptDataDoubleEscapedState else: self.state = self.scriptDataEscapedState elif data in asciiLetters: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.temporaryBuffer += data else: self.stream.unget(data) self.state = self.scriptDataEscapedState return True def scriptDataDoubleEscapedState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataDoubleEscapedDashState elif data == "<": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.state = self.scriptDataDoubleEscapedLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) elif data == EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-script-in-script"}) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) return True def scriptDataDoubleEscapedDashState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) self.state = self.scriptDataDoubleEscapedDashDashState elif data == "<": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.state = self.scriptDataDoubleEscapedLessThanSignState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) self.state = self.scriptDataDoubleEscapedState elif data == EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-script-in-script"}) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.state = self.scriptDataDoubleEscapedState return True def scriptDataDoubleEscapedDashDashState(self): data = self.stream.char() if data == "-": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "-"}) elif data == "<": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "<"}) self.state = self.scriptDataDoubleEscapedLessThanSignState elif data == ">": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": ">"}) self.state = self.scriptDataState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "\uFFFD"}) self.state = self.scriptDataDoubleEscapedState elif data == EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-script-in-script"}) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.state = self.scriptDataDoubleEscapedState return True def scriptDataDoubleEscapedLessThanSignState(self): data = self.stream.char() if data == "/": self.tokenQueue.append({"type": tokenTypes["Characters"], "data": "/"}) self.temporaryBuffer = "" self.state = self.scriptDataDoubleEscapeEndState else: self.stream.unget(data) self.state = self.scriptDataDoubleEscapedState return True def scriptDataDoubleEscapeEndState(self): data = self.stream.char() if data in (spaceCharacters | frozenset(("/", ">"))): self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) if self.temporaryBuffer.lower() == "script": self.state = self.scriptDataEscapedState else: self.state = self.scriptDataDoubleEscapedState elif data in asciiLetters: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.temporaryBuffer += data else: self.stream.unget(data) self.state = self.scriptDataDoubleEscapedState return True def beforeAttributeNameState(self): data = self.stream.char() if data in spaceCharacters: self.stream.charsUntil(spaceCharacters, True) elif data in asciiLetters: self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState elif data == ">": self.emitCurrentToken() elif data == "/": self.state = self.selfClosingStartTagState elif data in ("'", '"', "=", "<"): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-character-in-attribute-name"}) self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"].append(["\uFFFD", ""]) self.state = self.attributeNameState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-attribute-name-but-got-eof"}) self.state = self.dataState else: self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState return True def attributeNameState(self): data = self.stream.char() leavingThisState = True emitToken = False if data == "=": self.state = self.beforeAttributeValueState elif data in asciiLetters: self.currentToken["data"][-1][0] += data +\ self.stream.charsUntil(asciiLetters, True) leavingThisState = False elif data == ">": # XXX If we emit here the attributes are converted to a dict # without being checked and when the code below runs we error # because data is a dict not a list emitToken = True elif data in spaceCharacters: self.state = self.afterAttributeNameState elif data == "/": self.state = self.selfClosingStartTagState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"][-1][0] += "\uFFFD" leavingThisState = False elif data in ("'", '"', "<"): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-character-in-attribute-name"}) self.currentToken["data"][-1][0] += data leavingThisState = False elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-attribute-name"}) self.state = self.dataState else: self.currentToken["data"][-1][0] += data leavingThisState = False if leavingThisState: # Attributes are not dropped at this stage. That happens when the # start tag token is emitted so values can still be safely appended # to attributes, but we do want to report the parse error in time. self.currentToken["data"][-1][0] = ( self.currentToken["data"][-1][0].translate(asciiUpper2Lower)) for name, _ in self.currentToken["data"][:-1]: if self.currentToken["data"][-1][0] == name: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "duplicate-attribute"}) break # XXX Fix for above XXX if emitToken: self.emitCurrentToken() return True def afterAttributeNameState(self): data = self.stream.char() if data in spaceCharacters: self.stream.charsUntil(spaceCharacters, True) elif data == "=": self.state = self.beforeAttributeValueState elif data == ">": self.emitCurrentToken() elif data in asciiLetters: self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState elif data == "/": self.state = self.selfClosingStartTagState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"].append(["\uFFFD", ""]) self.state = self.attributeNameState elif data in ("'", '"', "<"): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-character-after-attribute-name"}) self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-end-of-tag-but-got-eof"}) self.state = self.dataState else: self.currentToken["data"].append([data, ""]) self.state = self.attributeNameState return True def beforeAttributeValueState(self): data = self.stream.char() if data in spaceCharacters: self.stream.charsUntil(spaceCharacters, True) elif data == "\"": self.state = self.attributeValueDoubleQuotedState elif data == "&": self.state = self.attributeValueUnQuotedState self.stream.unget(data) elif data == "'": self.state = self.attributeValueSingleQuotedState elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-attribute-value-but-got-right-bracket"}) self.emitCurrentToken() elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"][-1][1] += "\uFFFD" self.state = self.attributeValueUnQuotedState elif data in ("=", "<", "`"): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "equals-in-unquoted-attribute-value"}) self.currentToken["data"][-1][1] += data self.state = self.attributeValueUnQuotedState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-attribute-value-but-got-eof"}) self.state = self.dataState else: self.currentToken["data"][-1][1] += data self.state = self.attributeValueUnQuotedState return True def attributeValueDoubleQuotedState(self): data = self.stream.char() if data == "\"": self.state = self.afterAttributeValueState elif data == "&": self.processEntityInAttribute('"') elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"][-1][1] += "\uFFFD" elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-attribute-value-double-quote"}) self.state = self.dataState else: self.currentToken["data"][-1][1] += data +\ self.stream.charsUntil(("\"", "&", "\u0000")) return True def attributeValueSingleQuotedState(self): data = self.stream.char() if data == "'": self.state = self.afterAttributeValueState elif data == "&": self.processEntityInAttribute("'") elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"][-1][1] += "\uFFFD" elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-attribute-value-single-quote"}) self.state = self.dataState else: self.currentToken["data"][-1][1] += data +\ self.stream.charsUntil(("'", "&", "\u0000")) return True def attributeValueUnQuotedState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeAttributeNameState elif data == "&": self.processEntityInAttribute(">") elif data == ">": self.emitCurrentToken() elif data in ('"', "'", "=", "<", "`"): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-character-in-unquoted-attribute-value"}) self.currentToken["data"][-1][1] += data elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"][-1][1] += "\uFFFD" elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-attribute-value-no-quotes"}) self.state = self.dataState else: self.currentToken["data"][-1][1] += data + self.stream.charsUntil( frozenset(("&", ">", '"', "'", "=", "<", "`", "\u0000")) | spaceCharacters) return True def afterAttributeValueState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeAttributeNameState elif data == ">": self.emitCurrentToken() elif data == "/": self.state = self.selfClosingStartTagState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-EOF-after-attribute-value"}) self.stream.unget(data) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-character-after-attribute-value"}) self.stream.unget(data) self.state = self.beforeAttributeNameState return True def selfClosingStartTagState(self): data = self.stream.char() if data == ">": self.currentToken["selfClosing"] = True self.emitCurrentToken() elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-EOF-after-solidus-in-tag"}) self.stream.unget(data) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-character-after-solidus-in-tag"}) self.stream.unget(data) self.state = self.beforeAttributeNameState return True def bogusCommentState(self): # Make a new comment token and give it as value all the characters # until the first > or EOF (charsUntil checks for EOF automatically) # and emit it. data = self.stream.charsUntil(">") data = data.replace("\u0000", "\uFFFD") self.tokenQueue.append( {"type": tokenTypes["Comment"], "data": data}) # Eat the character directly after the bogus comment which is either a # ">" or an EOF. self.stream.char() self.state = self.dataState return True def markupDeclarationOpenState(self): charStack = [self.stream.char()] if charStack[-1] == "-": charStack.append(self.stream.char()) if charStack[-1] == "-": self.currentToken = {"type": tokenTypes["Comment"], "data": ""} self.state = self.commentStartState return True elif charStack[-1] in ('d', 'D'): matched = True for expected in (('o', 'O'), ('c', 'C'), ('t', 'T'), ('y', 'Y'), ('p', 'P'), ('e', 'E')): charStack.append(self.stream.char()) if charStack[-1] not in expected: matched = False break if matched: self.currentToken = {"type": tokenTypes["Doctype"], "name": "", "publicId": None, "systemId": None, "correct": True} self.state = self.doctypeState return True elif (charStack[-1] == "[" and self.parser is not None and self.parser.tree.openElements and self.parser.tree.openElements[-1].namespace != self.parser.tree.defaultNamespace): matched = True for expected in ["C", "D", "A", "T", "A", "["]: charStack.append(self.stream.char()) if charStack[-1] != expected: matched = False break if matched: self.state = self.cdataSectionState return True self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-dashes-or-doctype"}) while charStack: self.stream.unget(charStack.pop()) self.state = self.bogusCommentState return True def commentStartState(self): data = self.stream.char() if data == "-": self.state = self.commentStartDashState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "incorrect-comment"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["data"] += data self.state = self.commentState return True def commentStartDashState(self): data = self.stream.char() if data == "-": self.state = self.commentEndState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "-\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "incorrect-comment"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["data"] += "-" + data self.state = self.commentState return True def commentState(self): data = self.stream.char() if data == "-": self.state = self.commentEndDashState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "\uFFFD" elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["data"] += data + \ self.stream.charsUntil(("-", "\u0000")) return True def commentEndDashState(self): data = self.stream.char() if data == "-": self.state = self.commentEndState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "-\uFFFD" self.state = self.commentState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment-end-dash"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["data"] += "-" + data self.state = self.commentState return True def commentEndState(self): data = self.stream.char() if data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "--\uFFFD" self.state = self.commentState elif data == "!": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-bang-after-double-dash-in-comment"}) self.state = self.commentEndBangState elif data == "-": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-dash-after-double-dash-in-comment"}) self.currentToken["data"] += data elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment-double-dash"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: # XXX self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-comment"}) self.currentToken["data"] += "--" + data self.state = self.commentState return True def commentEndBangState(self): data = self.stream.char() if data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == "-": self.currentToken["data"] += "--!" self.state = self.commentEndDashState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["data"] += "--!\uFFFD" self.state = self.commentState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-comment-end-bang-state"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["data"] += "--!" + data self.state = self.commentState return True def doctypeState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeDoctypeNameState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-doctype-name-but-got-eof"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "need-space-after-doctype"}) self.stream.unget(data) self.state = self.beforeDoctypeNameState return True def beforeDoctypeNameState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-doctype-name-but-got-right-bracket"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["name"] = "\uFFFD" self.state = self.doctypeNameState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-doctype-name-but-got-eof"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["name"] = data self.state = self.doctypeNameState return True def doctypeNameState(self): data = self.stream.char() if data in spaceCharacters: self.currentToken["name"] = self.currentToken["name"].translate(asciiUpper2Lower) self.state = self.afterDoctypeNameState elif data == ">": self.currentToken["name"] = self.currentToken["name"].translate(asciiUpper2Lower) self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["name"] += "\uFFFD" self.state = self.doctypeNameState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype-name"}) self.currentToken["correct"] = False self.currentToken["name"] = self.currentToken["name"].translate(asciiUpper2Lower) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["name"] += data return True def afterDoctypeNameState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.currentToken["correct"] = False self.stream.unget(data) self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: if data in ("p", "P"): matched = True for expected in (("u", "U"), ("b", "B"), ("l", "L"), ("i", "I"), ("c", "C")): data = self.stream.char() if data not in expected: matched = False break if matched: self.state = self.afterDoctypePublicKeywordState return True elif data in ("s", "S"): matched = True for expected in (("y", "Y"), ("s", "S"), ("t", "T"), ("e", "E"), ("m", "M")): data = self.stream.char() if data not in expected: matched = False break if matched: self.state = self.afterDoctypeSystemKeywordState return True # All the characters read before the current 'data' will be # [a-zA-Z], so they're garbage in the bogus doctype and can be # discarded; only the latest character might be '>' or EOF # and needs to be ungetted self.stream.unget(data) self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "expected-space-or-right-bracket-in-doctype", "datavars": {"data": data}}) self.currentToken["correct"] = False self.state = self.bogusDoctypeState return True def afterDoctypePublicKeywordState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeDoctypePublicIdentifierState elif data in ("'", '"'): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.stream.unget(data) self.state = self.beforeDoctypePublicIdentifierState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.stream.unget(data) self.state = self.beforeDoctypePublicIdentifierState return True def beforeDoctypePublicIdentifierState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == "\"": self.currentToken["publicId"] = "" self.state = self.doctypePublicIdentifierDoubleQuotedState elif data == "'": self.currentToken["publicId"] = "" self.state = self.doctypePublicIdentifierSingleQuotedState elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-end-of-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["correct"] = False self.state = self.bogusDoctypeState return True def doctypePublicIdentifierDoubleQuotedState(self): data = self.stream.char() if data == "\"": self.state = self.afterDoctypePublicIdentifierState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["publicId"] += "\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-end-of-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["publicId"] += data return True def doctypePublicIdentifierSingleQuotedState(self): data = self.stream.char() if data == "'": self.state = self.afterDoctypePublicIdentifierState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["publicId"] += "\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-end-of-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["publicId"] += data return True def afterDoctypePublicIdentifierState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.betweenDoctypePublicAndSystemIdentifiersState elif data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == '"': self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierDoubleQuotedState elif data == "'": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierSingleQuotedState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["correct"] = False self.state = self.bogusDoctypeState return True def betweenDoctypePublicAndSystemIdentifiersState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data == '"': self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierDoubleQuotedState elif data == "'": self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierSingleQuotedState elif data == EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["correct"] = False self.state = self.bogusDoctypeState return True def afterDoctypeSystemKeywordState(self): data = self.stream.char() if data in spaceCharacters: self.state = self.beforeDoctypeSystemIdentifierState elif data in ("'", '"'): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.stream.unget(data) self.state = self.beforeDoctypeSystemIdentifierState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.stream.unget(data) self.state = self.beforeDoctypeSystemIdentifierState return True def beforeDoctypeSystemIdentifierState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == "\"": self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierDoubleQuotedState elif data == "'": self.currentToken["systemId"] = "" self.state = self.doctypeSystemIdentifierSingleQuotedState elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.currentToken["correct"] = False self.state = self.bogusDoctypeState return True def doctypeSystemIdentifierDoubleQuotedState(self): data = self.stream.char() if data == "\"": self.state = self.afterDoctypeSystemIdentifierState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["systemId"] += "\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-end-of-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["systemId"] += data return True def doctypeSystemIdentifierSingleQuotedState(self): data = self.stream.char() if data == "'": self.state = self.afterDoctypeSystemIdentifierState elif data == "\u0000": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) self.currentToken["systemId"] += "\uFFFD" elif data == ">": self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-end-of-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.currentToken["systemId"] += data return True def afterDoctypeSystemIdentifierState(self): data = self.stream.char() if data in spaceCharacters: pass elif data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "eof-in-doctype"}) self.currentToken["correct"] = False self.tokenQueue.append(self.currentToken) self.state = self.dataState else: self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "unexpected-char-in-doctype"}) self.state = self.bogusDoctypeState return True def bogusDoctypeState(self): data = self.stream.char() if data == ">": self.tokenQueue.append(self.currentToken) self.state = self.dataState elif data is EOF: # XXX EMIT self.stream.unget(data) self.tokenQueue.append(self.currentToken) self.state = self.dataState else: pass return True def cdataSectionState(self): data = [] while True: data.append(self.stream.charsUntil("]")) data.append(self.stream.charsUntil(">")) char = self.stream.char() if char == EOF: break else: assert char == ">" if data[-1][-2:] == "]]": data[-1] = data[-1][:-2] break else: data.append(char) data = "".join(data) # pylint:disable=redefined-variable-type # Deal with null here rather than in the parser nullCount = data.count("\u0000") if nullCount > 0: for _ in range(nullCount): self.tokenQueue.append({"type": tokenTypes["ParseError"], "data": "invalid-codepoint"}) data = data.replace("\u0000", "\uFFFD") if data: self.tokenQueue.append({"type": tokenTypes["Characters"], "data": data}) self.state = self.dataState return True