> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpod.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3 32B AWQ

> Latest generation LLM with advanced reasoning, instruction-following, and multilingual support.

Qwen3 32B AWQ is the latest large language model in the Qwen series, offering advancements in reasoning, instruction-following, agent capabilities, and multilingual support. It uses AWQ quantization for efficient inference while maintaining high quality.

<Card title="Try in playground" icon="play" href="https://console.runpod.io/hub/playground/text/qwen3-32b-awq" horizontal>
  Test Qwen3 32B AWQ in the Runpod Hub playground.
</Card>

|              |                                                  |
| ------------ | ------------------------------------------------ |
| **Endpoint** | `https://api.runpod.ai/v2/qwen3-32b-awq/runsync` |
| **Pricing**  | \$10.00 per 1M tokens                            |
| **Type**     | Text generation                                  |

<Note>
  This endpoint is fully compatible with the OpenAI API. See the [OpenAI compatibility examples](#openai-api-compatibility) below.
</Note>

## Request

All parameters are passed within the `input` object in the request body.

<ParamField body="input.prompt" type="string" required>
  Prompt for text generation.
</ParamField>

<ParamField body="input.max_tokens" type="integer" default="512">
  Maximum number of tokens to output.
</ParamField>

<ParamField body="input.temperature" type="float" default="0.7">
  Randomness of the output. Lower values make output more predictable and deterministic. Range: 0.0-1.0.
</ParamField>

<ParamField body="input.top_p" type="float">
  Nucleus sampling threshold. Samples from the smallest set of words whose cumulative probability exceeds this threshold.
</ParamField>

<ParamField body="input.top_k" type="integer">
  Restricts sampling to the top K most probable words.
</ParamField>

<ParamField body="input.stop" type="string">
  Stops generation if the given string is encountered.
</ParamField>

<RequestExample>
  ```bash cURL theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl -X POST "https://api.runpod.ai/v2/qwen3-32b-awq/runsync" \
    -H "Authorization: Bearer $RUNPOD_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": {
        "prompt": "Write a Python function that checks if a number is prime:",
        "max_tokens": 512,
        "temperature": 0.7
      }
    }'
  ```

  ```python Python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  import requests

  response = requests.post(
      "https://api.runpod.ai/v2/qwen3-32b-awq/runsync",
      headers={
          "Authorization": f"Bearer {RUNPOD_API_KEY}",
          "Content-Type": "application/json",
      },
      json={
          "input": {
              "prompt": "Write a Python function that checks if a number is prime:",
              "max_tokens": 512,
              "temperature": 0.7,
          }
      },
  )

  result = response.json()
  print(result["output"])
  ```

  ```javascript JavaScript theme={"theme":{"light":"github-light","dark":"github-dark"}}
  const response = await fetch(
    "https://api.runpod.ai/v2/qwen3-32b-awq/runsync",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${RUNPOD_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        input: {
          prompt: "Write a Python function that checks if a number is prime:",
          max_tokens: 512,
          temperature: 0.7,
        },
      }),
    }
  );

  const result = await response.json();
  console.log(result.output);
  ```
</RequestExample>

## Response

<ResponseField name="id" type="string">
  Unique identifier for the request.
</ResponseField>

<ResponseField name="status" type="string">
  Request status. Returns `COMPLETED` on success, `FAILED` on error.
</ResponseField>

<ResponseField name="delayTime" type="integer">
  Time in milliseconds the request spent in queue before processing began.
</ResponseField>

<ResponseField name="executionTime" type="integer">
  Time in milliseconds the model took to generate the response.
</ResponseField>

<ResponseField name="workerId" type="string">
  Identifier of the worker that processed the request.
</ResponseField>

<ResponseField name="output" type="object">
  The generation result containing the text and usage information.

  <ResponseField name="output.choices" type="array">
    Array containing the generated text.
  </ResponseField>

  <ResponseField name="output.cost" type="float">
    Cost of the generation in USD.
  </ResponseField>

  <ResponseField name="output.usage" type="object">
    Token usage information with `input` and `output` counts.
  </ResponseField>
</ResponseField>

<ResponseExample>
  ```json 200 theme={"theme":{"light":"github-light","dark":"github-dark"}}
  {
    "delayTime": 25,
    "executionTime": 3153,
    "id": "sync-0f3288b5-58e8-46fd-ba73-53945f5e8982-u2",
    "output": [
      {
        "choices": [
          {
            "tokens": [
              "def is_prime(n):\n    if n <= 1:\n        return False\n    for i in range(2, int(n**0.5) + 1):\n        if n % i == 0:\n            return False\n    return True"
            ]
          }
        ],
        "cost": 0.0001,
        "usage": {
          "input": 10,
          "output": 100
        }
      }
    ],
    "status": "COMPLETED",
    "workerId": "pkej0t9bbyjrgy"
  }
  ```

  ```json 400 theme={"theme":{"light":"github-light","dark":"github-dark"}}
  {
    "id": "sync-a1b2c3d4-e5f6-7890-abcd-ef1234567890-u1",
    "status": "FAILED",
    "error": "Invalid prompt"
  }
  ```
</ResponseExample>

## OpenAI API compatibility

Qwen3 32B AWQ is fully compatible with the OpenAI API format. You can use the OpenAI Python client to interact with this endpoint.

```python Python (OpenAI SDK) theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI

client = OpenAI(
    api_key=RUNPOD_API_KEY,
    base_url="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B-AWQ",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful coding assistant.",
        },
        {
            "role": "user",
            "content": "Write a Python function that checks if a number is prime.",
        },
    ],
    max_tokens=525,
)

print(response.choices[0].message.content)
```

For streaming responses, add `stream=True`:

```python Python (Streaming) theme={"theme":{"light":"github-light","dark":"github-dark"}}
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B-AWQ",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=525,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

For more details, see [Send vLLM requests](/serverless/vllm/vllm-requests) and the [OpenAI API compatibility guide](/serverless/vllm/openai-compatibility).

## Cost calculation

Qwen3 32B AWQ charges \$10.00 per 1M tokens. Example costs:

| Tokens           | Cost    |
| ---------------- | ------- |
| 1,000 tokens     | \$0.01  |
| 10,000 tokens    | \$0.10  |
| 100,000 tokens   | \$1.00  |
| 1,000,000 tokens | \$10.00 |
