Learn how to run inference on the SmolLM3 model in JupyterLab using the transformers library.
transformers
library.
SmolLM3 is a family of small language models developed by Hugging Face that provides strong performance while being efficient enough to run on modest hardware.
The 3B parameter model we’ll use in this tutorial requires only 24 GB of VRAM, making it accessible for experimentation and development.
transformers
and accelerate
Python libraries:
max_new_tokens
and run the cell again to get a longer response (it will just take longer to run).
pipeline()
: Creates a high-level interface for text generation.model="HuggingFaceTB/SmolLM3-3B"
: Specifies the model to use.torch_dtype=torch.bfloat16
: Uses 16-bit floating point for memory efficiency.device_map=0
: Automatically places the model on the first available GPU.messages
: Defines a chat-like conversation with system and user roles.max_new_tokens
: Controls the maximum length of the generated texttemperature
: Controls randomness (0.1 = more focused, 1.0 = more creative)top_k
: Limits the vocabulary to the top K most likely tokenstop_p
: Uses nucleus sampling to control diversity