How the X11 protocol works at the lowest level

How the X11 protocol works at the lowest level

X11 is the mechanism on which the entire Unix graphical interface of similar OSes works.

But few people know how it actually works. Because over the years it has grown with layers and layers of libraries that seek to hide the very essence of the protocol.

And the protocol is essentially excellent. It is concise and almost perfect.

Full protocol documentation is available online. But the fact is that this documentation is large, written in not entirely clear language and, in fact, is just a specification. Important points are not marked in any way, and their use is also left to the imagination of the reader.

And all the books and articles on using X11 describe it via layout libraries like XLib and XCB, and even worse, GTK or Qt.

Therefore, you have to read the entire documentation and distinguish what is important and what is not so important. Invent usage scenarios and write at least short programs to test how everything actually works.

Be that as it may, if anyone is interested in how things actually work, feel free to ask.

Essence

The essence of X11 is that there is a server program (X server) that waits for a connection and executes the commands it receives from the client. For example, create a graphic window. Draw something and so on.

Clients connect to the server through a regular socket. Send commands and get back responses, errors if something goes wrong, and events (like mouse moves, button presses, etc.)

The client is essentially a console program that has nothing to do with graphics other than this network connection.

Protocol

The entire underlying protocol is described in the X Window System Protocol document

The most useful part of this document is Appendix “B”, which describes byte-by-byte what and where is sent and received.

I will quote passages to illustrate the text.

Identifiers

All objects of X have an ID. This is a 32 bit number, which is generated by the client and passes to the server to mark the object being created. For example, a window, cursor, picture, etc.

Another type of identifiers is ATOM. Atoms are also 32-bit numbers, but they are generated by the server. The client sends some character string to the server, and the server returns a number. The same number always corresponds to the same lines. This is similar to hashing, but done differently – the server simply stores a list of strings and assigns them numbers. If some client requests an atom for a row that is already in the list, it is returned the number of the row in the list.

Atoms are primarily used so that different clients can exchange information with each other using standard text identifiers.

And in order not to load the network exchange with long text identifiers, actual numbers are transmitted.

To reduce the load on the server, the most important atoms are defined in the standard and have the same values. If anyone is interested, the list is here:

Standard atoms

What is written in capital letters is the line from which the atom is generated:

atomPRIMARY            = 1
atomSECONDARY          = 2
atomARC                = 3
atomATOM               = 4
atomBITMAP             = 5
atomCARDINAL           = 6
atomCOLORMAP           = 7
atomCURSOR             = 8
atomCUT_BUFFER0        = 9
atomCUT_BUFFER1        = 10
atomCUT_BUFFER2        = 11
atomCUT_BUFFER3        = 12
atomCUT_BUFFER4        = 13
atomCUT_BUFFER5        = 14
atomCUT_BUFFER6        = 15
atomCUT_BUFFER7        = 16
atomDRAWABLE           = 17
atomFONT               = 18
atomINTEGER            = 19
atomPIXMAP             = 20
atomPOINT              = 21
atomRECTANGLE          = 22
atomRESOURCE_MANAGER   = 23
atomRGB_COLOR_MAP      = 24
atomRGB_BEST_MAP       = 25
atomRGB_BLUE_MAP       = 26
atomRGB_DEFAULT_MAP    = 27
atomRGB_GRAY_MAP       = 28
atomRGB_GREEN_MAP      = 29
atomRGB_RED_MAP        = 30
atomSTRING             = 31
atomVISUALID           = 32
atomWINDOW             = 33
atomWM_COMMAND         = 34
atomWM_HINTS           = 35
atomWM_CLIENT_MACHINE  = 36
atomWM_ICON_NAME       = 37
atomWM_ICON_SIZE       = 38
atomWM_NAME            = 39
atomWM_NORMAL_HINTS    = 40
atomWM_SIZE_HINTS      = 41
atomWM_ZOOM_HINTS      = 42
atomMIN_SPACE          = 43
atomNORM_SPACE         = 44
atomMAX_SPACE          = 45
atomEND_SPACE          = 46
atomSUPERSCRIPT_X      = 47
atomSUPERSCRIPT_Y      = 48
atomSUBSCRIPT_X        = 49
atomSUBSCRIPT_Y        = 50
atomUNDERLINE_POSITION = 51
atomUNDERLINE_THICKNESS= 52
atomSTRIKEOUT_ASCENT   = 53
atomSTRIKEOUT_DESCENT  = 54
atomITALIC_ANGLE       = 55
atomX_HEIGHT           = 56
atomQUAD_WIDTH         = 57
atomWEIGHT             = 58
atomPOINT_SIZE         = 59
atomRESOLUTION         = 60
atomCOPYRIGHT          = 61
atomNOTICE             = 62
atomFONT_NAME          = 63
atomFAMILY_NAME        = 64
atomFULL_NAME          = 65
atomCAP_HEIGHT         = 66
atomWM_CLASS           = 67
atomWM_TRANSIENT_FOR   = 68

Requests

All requests in X11 are binary, with fields of different lengths. Essentially, there are 1-byte, 2-byte, and 4-byte fields.

The first 4 bytes of the request are always present and always contain the same information:

After reading this header, the server knows how many bytes (or rather, double words) still need to be read to retrieve the entire request.

In order not to be too verbose, I will show a simple example:

The “DestroyWindow” request is encoded as follows (suppose we want to close the window with ID 0x12345678):

Or as a result, the following goes through the socket: 03 00 02 00 78 56 34 12

Upon receiving this request, the X server will close the window with ID 0x12345678

In the protocol documentation (or more precisely in the application), this DestroyWindow request is described with the following syntax:

     1     4                               opcode
     1                                     unused
     2     2                               request length
     4     WINDOW                          window

And now something more complicated: CreateWindow.

First you need to select the window ID. Let’s choose 0x12345678 again to make it easier.
You will still need the ID of the root window (this is a service window that occupies the entire display and is the parent of all top-level windows. Let’s say its ID is 0x9abcdef0 (and where to get the real values, I’ll tell you a little later).

And so the final request that we send to the socket: 01 00 08 00 78 65 43 21 f0 de bc 9a 64 65 c8 66 00 00 01 00 00 00 00 00 00 00 00 00

Here is the full description of the request in the protocol appendix:

     1     1                               opcode
     1     CARD8                           depth
     2     8+n                             request length
     4     WINDOW                          wid
     4     WINDOW                          parent
     2     INT16                           x
     2     INT16                           y
     2     CARD16                          width
     2     CARD16                          height
     2     CARD16                          border-width
     2                                     class
          0     CopyFromParent
          1     InputOutput
          2     InputOnly
     4     VISUALID                        visual
          0     CopyFromParent
     4     BITMASK                         value-mask (has n bits set to 1)
          #x00000001     background-pixmap
          #x00000002     background-pixel
          #x00000004     border-pixmap
          #x00000008     border-pixel
          #x00000010     bit-gravity
          #x00000020     win-gravity
          #x00000040     backing-store
          #x00000080     backing-planes
          #x00000100     backing-pixel
          #x00000200     override-redirect
          #x00000400     save-under
          #x00000800     event-mask
          #x00001000     do-not-propagate-mask
          #x00002000     colormap
          #x00004000     cursor
     4n     LISTofVALUE                    value-list

  VALUEs
     4     PIXMAP                          background-pixmap
          0     None
          1     ParentRelative
     4     CARD32                          background-pixel
     4     PIXMAP                          border-pixmap
          0     CopyFromParent
     4     CARD32                          border-pixel
     1     BITGRAVITY                      bit-gravity
     1     WINGRAVITY                      win-gravity
     1                                     backing-store
          0     NotUseful
          1     WhenMapped
          2     Always
     4     CARD32                          backing-planes
     4     CARD32                          backing-pixel
     1     BOOL                            override-redirect
     1     BOOL                            save-under
     4     SETofEVENT                      event-mask
     4     SETofDEVICEEVENT                do-not-propagate-mask
     4     COLORMAP                        colormap
          0     CopyFromParent
     4     CURSOR                          cursor
          0     None

A little more complicated, but I hope it is more or less clear. The difficulty here is that a bunch of window parameters of different appearance and format can be passed in the request. But in reality, everything goes sequentially and more or less logically.

After receiving this request, the server creates a window with the specified parameters. But this window will not appear, because it is not yet shown on the screen. We do this through the “MapWindow” request. Against the background of the former, it is very simple:

The socket goes: 08 00 02 00 78 56 34 12 and the window becomes visible.

Answers

The server also sends us information over the socket. It comes in 3 types: Replies, Events and Errors.

All three types are at least 32 bytes long. (And events and errors are always exactly 32 bytes). So reading from the server is always done in portions of 32 bytes, and if it is a Reply, we take the length of the additional part from the body of the response and read it too.

All information from the server comes asynchronously, but responses and errors always arrive in the order of requests whose results they are.

  1. Answers to requests (Reply). If the request requires a response from the server, the server sends it over the socket as soon as it processes the request. If the response contains information that fits in 32 bytes, then that’s all that needs to be accepted. If the answer is longer, then its body contains the length of the additional part of the answer.

The general format of the answer is as follows:

  1. Events (Events). They contain the same 32 bytes and are generated in response to some events in the GUI. To receive some events, the client must subscribe to them when it creates a window, e.g.

Some system-wide events are always sent to everyone.

The format of the event is as follows:

  1. Errors. Sent if some client request could not be fulfilled because it contains some error in the data or parameters. The error format is as follows:

Connection

Now let’s take a step back and consider the most difficult thing in X11 – connecting to the server. Unfortunately, the procedure is complicated and confusing and is a stumbling block for using X11 directly.

It is the connection that raises the level of entry into the technology.

As we have seen, using the protocol is quite simple. But connection is something with something!

The connection itself is essentially ordinary, we create a socket and connect to it. But first you need to find out the address of the server. There is an algorithm for this:

Let’s look at the content of the environment variable DISPLAY. If present, it contains the address of the X11 server in the format: [host]:D.S.

host is the host server. This can be a domain name, can be the string “/unix”, or just be missing. missing host is equal to "/unix" and means that the server is listening on a unix domain socket on the local machine.

By the way, this is the most common case. If the host is present, it means that it is necessary to connect to this host, via TCP, through the IP6 address.

D is the display number, a S this is the screen number. In most cases, on modern configurations the screen number will be 0, even if there is more than one monitor. All of them are virtually combined into one screen.

The server connection port depends on the display number. If TCP, then the server listens on port 6000+D. If we connect through a unix domain socket, it is located at the address /tmp/.X11-unix/X{D} – that is, zero display on /tmp/.X11-unix/X0the first on /tmp/.X11-unix/X1 etc.

And here we are connected to the socket. Once connected, you can’t just send requests. You must first send information about yourself to the server and log in to the server.

All this is contained in the first (or rather zero) request, which is non-standard and contains:

The first byte defines the format in which our program understands numbers. The server will send us all numbers longer than one byte in this format and will understand the numbers we send in this format.

Then follows the minimum version of the protocol that would suit the application. If the server supports a lower version, the connection will be rejected.

Then follows the name of the authorization protocol and the actual authorization data. This is a type of proof that this application is allowed to connect to the X11 server.

Where do we get the protocol name and authorization data from? They are in a file whose path is in an environment variable $XAUTHORITY. If this variable does not exist, you can search in the file $HOME/.Xauthority – This is the most common option. If your application does not have permission to access this file, or the file does not exist, then you do not have access to this X11 server.

The file is binary and its format is not very well documented. I had to ask on stackoverflow to figure it out, and even then it was only partially successful.

Yes, the file structure is a sequence of records of such structures:

typedef struct xauth {
    unsigned short   family;
    unsigned short   address_length;
    char            *address;
    unsigned short   number_length;
    char            *number;
    unsigned short   name_length;
    char            *name;
    unsigned short   data_length;
    char        *data;
} Xauth;

But first, of course, there are no pointers in the file. All lines are simply entered sequentially, character by character in the file. Secondly, all two-byte numbers always is big-endian. Regardless of the computer architecture.

address is the HOST address of the server.

number is the display number we already determined from the $DISPLAY variable, written as a text string!

name – This is the name of the protocol. At present and as far as I know, it is used only MIT-MAGIC-COOKIE-1 protocol.

data – This is a byte array, something like this: 07 bd 70 26 1а ab 4c 7c 35 3c c1 b2 cc 25 a2 29. which we must send to the server to indicate that we have access.

Iterate over this file until we find an entry in which HOST matches the host from $DISPLAY and the display number matches the display number from $DISPLAY. From this record we get the protocol name and authorization data.

And so we collected all the necessary data about the zero request and form it:

To the server goes: 6c 00 0b 00 00 00 12 10 4d 49 54 2d 4d 41 47 49 43 2d 43 4f 4f 4b 49 45 2d 31 07 bd 70 26 1а ab 4c 7c 35 3c c1 b2 cc 25 a2 29

To which the server can respond with three responses. The answer option is determined by the first byte. It can be:

0: Connection rejected. The entire response contains:

2: Additional authentication is required. I did not study this option, because I did not have time to find a system that would correspond so well…

1: Connection accepted.

The best option for us. The response is very long and complex, contains the main system parameters that we should remember and use later in our queries.

I was never able to draw such a complex table to arrange everything on the shelves. Therefore, here is a description of the response from the protocol documentation:

     1     1                               Success
     1                                     unused
     2     CARD16                          protocol-major-version
     2     CARD16                          protocol-minor-version
     2     8+2n+(v+p+m)/4                  length in 4-byte units of
                                           "additional data"
     4     CARD32                          release-number
     4     CARD32                          resource-id-base
     4     CARD32                          resource-id-mask
     4     CARD32                          motion-buffer-size
     2     v                               length of vendor
     2     CARD16                          maximum-request-length
     1     CARD8                           number of SCREENs in roots
     1     n                               number for FORMATs in
                                           pixmap-formats
     1                                     image-byte-order
          0     LSBFirst
          1     MSBFirst
     1                                     bitmap-format-bit-order
          0     LeastSignificant
          1     MostSignificant
     1     CARD8                           bitmap-format-scanline-unit
     1     CARD8                           bitmap-format-scanline-pad
     1     KEYCODE                         min-keycode
     1     KEYCODE                         max-keycode
     4                                     unused
     v     STRING8                         vendor
     p                                     unused, p=pad(v)
     8n     LISTofFORMAT                   pixmap-formats
     m     LISTofSCREEN                    roots (m is always a multiple of 4)

FORMAT
     1     CARD8                           depth
     1     CARD8                           bits-per-pixel
     1     CARD8                           scanline-pad
     5                                     unused

SCREEN
     4     WINDOW                          root
     4     COLORMAP                        default-colormap
     4     CARD32                          white-pixel
     4     CARD32                          black-pixel
     4     SETofEVENT                      current-input-masks
     2     CARD16                          width-in-pixels
     2     CARD16                          height-in-pixels
     2     CARD16                          width-in-millimeters
     2     CARD16                          height-in-millimeters
     2     CARD16                          min-installed-maps
     2     CARD16                          max-installed-maps
     4     VISUALID                        root-visual
     1                                     backing-stores
          0     Never
          1     WhenMapped
          2     Always
     1     BOOL                            save-unders
     1     CARD8                           root-depth
     1     CARD8                           number of DEPTHs in allowed-depths
     n     LISTofDEPTH                     allowed-depths (n is always a
                                           multiple of 4)

DEPTH
     1     CARD8                           depth
     1                                     unused
     2     n                               number of VISUALTYPES in visuals
     4                                     unused
     24n     LISTofVISUALTYPE              visuals

VISUALTYPE
     4     VISUALID                        visual-id
     1                                     class
          0     StaticGray
          1     GrayScale
          2     StaticColor
          3     PseudoColor
          4     TrueColor
          5     DirectColor
     1     CARD8                           bits-per-rgb-value
     2     CARD16                          colormap-entries
     4     CARD32                          red-mask
     4     CARD32                          green-mask
     4     CARD32                          blue-mask
     4                                     unused

But no matter how difficult it looks, all the information does not need to be memorized or even analyzed.

We will take from this answer only what is important for us. And this is in the first two numbers from the fields: resource-id-base and resource-id-mask. They give us a range in which to generate ID constants for all GUI objects. (Don’t forget that in X11 all object IDs are generated on the client side, and the client tells the server what the ID of the window or other objects will be.)

Yes, the server has only one limitation – it allocates a range in which the identifiers should fit to each program. Thus, the identifier should contain only those bits that are set to one in the resource-id-mask. The ID must start with resource-id_base.

You still need to remember the range of keycodes (min-keycode/max-keycode) for future use, find the answers to those image formats that the program can use and which are convenient for it.

It is also necessary to find the corresponding SCREEN from the list and take the ID of the root window from there. We need it as the parent window for all top-level windows that we will create.

The rest can be more or less ignored.

I usually look for the SCREEN that suits me (32 bit TrueColor) in all this diversity and use only it. And if the server does not support this, then I just end the work. This greatly simplifies work and code.

Conclusion

Well, that’s all for the first time. I hope I managed to explain everything more clearly than in the documentation and give the understanding that will allow you to read the documentation freely (And it is really good, if a person knows how to understand it).

As an exercise, I offer a contest-challenge: Write a bash program that establishes a connection with the X server and creates and displays a window with the title “X11 rules”.

If no one can manage or want to, I will try to write it as an example for the next article of the cycle.

Ask in the comments if it is not clear. If you like something, write too. The article can and will be edited as the discussion progresses.

Related posts