Cork encoding

The Cork (also known as T1 or EC) encoding is a character encoding used for encoding glyphs in fonts.^[1] It is named after the city of Cork in Ireland, where during a TeX Users Group (TUG) conference in 1990 a new encoding was introduced for LaTeX.^[1] It contains 256 characters supporting most west- and east-European languages with the Latin alphabet.^[2]

Details

In 8-bit TeX engines the font encoding has to match the encoding of hyphenation patterns where this encoding is most commonly used.^[3] In LaTeX one can switch to this encoding with \usepackage[T1]{fontenc}, while in ConTeXt MkII this is the default encoding already. In modern engines such as XeTeX and LuaTeX Unicode is fully supported and the 8-bit font encodings are obsolete.

Character set

More information A, B ...

Cork encoding
	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
0x	` 0060	´ 00B4	ˆ 02C6	˜ 02DC	¨ 00A8	˝ 02DD	˚ 02DA	ˇ 02C7	˘ 02D8	¯ 00AF	˙ 02D9	¸ 00B8	˛ 02DB	‚ 201A	‹ 2039	› 203A
1x	“ 201C	” 201D	„ 201E	« 00AB	» 00BB	– 2013	— 2014	ZWSP^[a] 200B	₀^[b] 2080	ı^[c] 0131	ȷ^[c] 0237	ﬀ FB00	ﬁ FB01	ﬂ FB02	ﬃ FB03	ﬄ FB04
2x	␣ 2423	!	"	#	$	%	&	’ 2019	(	)	*	+	,	-	.	/
3x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
4x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
6x	‘ 2018	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	SHY^[d]
8x	Ă 0102	Ą 0104	Ć 0106	Č 010C	Ď 010E	Ě 011A	Ę 0118	Ğ 011E	Ĺ 0139	Ľ 013D	Ł 0141	Ń 0143	Ň 0147	Ŋ 014A	Ő 0150	Ŕ 0154
9x	Ř 0158	Ś 015A	Š 0160	Ş 015E	Ť 0164	Ţ 0162	Ű 0170	Ů 016E	Ÿ 0178	Ź 0179	Ž 017D	Ż 017B	Ĳ 0132	İ 0130	đ 0111	§ 00A7
Ax	ă 0103	ą 0105	ć 0107	č 010D	ď 010F	ě 011B	ę 0119	ğ 011F	ĺ 013A	ľ 013E	ł 0142	ń 0144	ň 0148	ŋ 014B	ő 0151	ŕ 0155
Bx	ř 0159	ś 015B	š 0161	ş 015F	ť 0165	ţ 0163	ű 0171	ů 016F	ÿ 00FF	ź 017A	ž 017E	ż 017C	ĳ 0133	¡ 00A1	¿ 00BF	£ 00A3
Cx	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
Dx	Ð^[e]	Ñ	Ò	Ó	Ô	Õ	Ö	Œ 0152	Ø	Ù	Ú	Û	Ü	Ý	Þ	SS^[f] 1E9E
Ex	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Fx	ð	ñ	ò	ó	ô	õ	ö	œ 0153	ø	ù	ú	û	ü	ý	þ	ß 00DF

Close

Notes

Hexadecimal values under the characters in the table are the Unicode character codes.
The first 12 characters are often used as combining characters.

[a]
0x17 is dubbed a “compound word mark” (CWM) in the Cork encoding, and is an innovation of this standard. It is an invisible character that separates compounds in a complex word, for instance in German, in order to disallow esthetic ligatures at compound boundaries.^[2] It is mapped to the Unicode “zero-width space” (ZWSP, U+200B), defined at about the same time, whose purpose is similar, if not identical.
[b]
0x18 is a “small o”, used to compose ‰ or ‱ (or arbitrary smaller quantities) out of percent sign (%).^[2]
[c]
Dotless i and dotless j may be used to compose accented variants like i with macron (ī).
[d]
0x7F is the hyphenation character, not really a soft hyphen (SHY) as defined by Unicode.
[e]
0xD0 is used both as Eth (Ð, U+00D0) and as D with stroke (Đ, U+0110) which might be a problem at some occasions (like copying text from PDF, hyphenation, ...)
[f]
0xDF contains an uppercase variant of ß. Traditionally, it was SS (two letters S), but some newer fonts may use ẞ.^[4] It allows TeX to automatically convert the German lowercase ß into the uppercase form.

Supported languages

The encoding supports most European languages written in Latin alphabet. Notable exceptions are:

Esperanto and Maltese language (using IL3)
Latvian language and Lithuanian language (using L7X)
Welsh language

Languages with slightly suboptimal support include:

Galician language, Portuguese language and Spanish language – due to the lack of characters ª and º, which are not superscript versions of lowercase "a" and "o" (superscripts are thinner) and they are often underlined
Croatian language, Bosnian language, Serbian language – due to the shared use of the slot for Đ
Turkish language – due to dotless i having different uppercase and lowercase combinations than in other languages
Romanian language – due to the characters "Ş ş Ţ ţ" (with a cedilla) being typographically considered incorrect by modern standards^[5]^[6], with the expected correct forms being "Ș ș Ț ț" ^[7] (with a comma below) - though when the encoding was developed, it was arguably considered acceptable at that time, but the status of support retroactively changed to suboptimal or insufficient when thw Unicode codepoints were disunified.

Details

Character set

Notes

Supported languages

References

External links

Related Articles